Posts Tagged ‘Lucene’

Getting started with Solr

November 23rd, 2010 Comments off


Apache Solr is a full fledged, search server based on the Lucene toolkit.

Lucene provides the core search algorithms and index storage required by those algorithms. Most basic search requirements can be fulfilled by Lucene itself without requiring Solr. But using plain Lucene has some drawbacks in development and non functional aspects, forcing development teams to cover these in their designs. This is where Solr adds value.

Solr provides these benefits over using the raw Lucene toolkit:

  • Solr allows search behaviour to be configured through configuration files, rather than through code. Specifying search fields, indexing criteria, and indexing behaviour in code is prone to maintenance problems.
  • Lucene is java centric (but also has ports to other languages). Solr however provides a HTTP interface that allows any platform to use it. Projects that involve multiple languages or platforms can use the same solr server.
  • Solr provides an out-of-the-box faceted search (also called drilldown search) facility, that allows users to incrementally refine results using filters and "drilldown" towards a narrow set of best matches. Many shopping web portals use this feature to allow their users to incrementally refine their results.
  • Solr’s query syntax is slightly easier than Lucene’s. Either a default field can be specified, or solr provides a syntax of its own called dismax, that searches a fixed set of fields.
  • Solr’s java client API is much simpler and easier than Lucene’s. Solr abstracts away many of the underlying Lucene concepts.
  • Solr provides straightforward add, update, and delete document API, unlike Lucene.
  • Solr supports a pluggable architecture. For example, post processor plugins (example: search results highlighting) allow raw results to be modified. .
  • Solr facilitates scalability using solutions like caching, memory tweaking, clustering, sharding and load balancing.
  • Solr provides plugins to fetch database data and index them. This workflow is probably the most common requirement for any search implementation, and solr provides it out-of-the-box.

The following sections describe basics of deploying Solr and using it from command line.


Directory layout of Solr package

Extracted Solr package has this layout:

/client Contains client APIs in different languages to talk to a Solr server
/contrib/clustering Plugin that provides clustering capabilities for Solr, using Carrot2 clustering framework
/contrib/dataimporthandler Plugin that is useful for indexing data in databases
/contrib/extraction Plugin that is useful for extracting text from PDFs, Word DOCs, etc.
/contrib/velocity Handler to present and manipulate search results using velocity templates.
/dist Contains Solr core jars and wars that can be deployed in servlet containers or elsewhere, and the solrj client API for java clients.
/dist/solrj-lib Libraries required by solrj client API .
/docs Offline documentation and javadocs
/lib Contains Lucene and other jars required by Solr
/src Source code
/example A skeleton standalone solr server deplyment. Default environment is Jetty. When deploying Solr, this is the directory that’s customized and deployed.
/example/etc Jetty or other environment specific configuration files go here
/example/example-DIH An example DB and the Data Import Handler plugin configuration to index that DB
/example/exampledocs Example XML request files to send to Solr server. Usage: java –jar post.jar <xml filename>
/example/lib Jetty and servlet libraries. Not required if Solr is being deployed in a different environment
/example/logs Solr request logs
/example/multicore It’s possible to host multiple search cores in the same environment. Use case could be separate indexes for different categories of data.
/example/solr This is the main data area of Solr.
/example/solr/conf Contains configuration files used by Solr.

solrconfig.xml – Configuration parameters, memory tuning, different types of request handlers.

schema.xml – Specifies fields and analyzer configuration for indexing and querying. Other files contain data required by different components like the Stop word filter.

/example/solr/data This contains the actual results of indexing.
/example/webapps The solr webapp deployed in Jetty
/example/work Scratch directory for the container environment

Getting Started Guide

1) Copy the skeleton server under /example to the deployment directory.

2) Customize /example/solr/conf/schema.xml as explained in later sections, to model search fields of the application.

3) Start the solr server. For the default Jetty environment, use this command line with current directory set to /example:

java -DSTOP.PORT=8079 –DSTOP.KEY=secret –jar start.jar

The STOP.PORT specifies the port on which server should listen for a stop instruction, and STOP.KEY is just a kind of secret key to be passed while stopping.

4) If building from source, the WAR will be named something like apache-solr-4.0-snapshot.jar. Copy this to /webapps and importantly, rename it to solr.war. Without that renaming, Jetty will give 404 errors for /solr URLs.

5) The solr server will  now be available at http://localhost:8983/solr. 8983 is the default jetty connector port, as specified in /example/etc/jetty.xml

6) To stop the server, use the command line:

java -DSTOP.PORT=8079 –DSTOP.KEY=secret –jar start.jar –stop



Managing solr server with ant during development

Starting and stopping solr can be conveniently done from an IDE like Eclipse using an Ant script:

<project basedir="." name="ManageSolr">
<property name="stopport" value="8079"></property>
<property name="stopsecret" value="secret"></property>

<target name="start-solr">
	<java dir="./dist/solr" fork="true" jar="./dist/solr/start.jar">
		<jvmarg value="-DSTOP.PORT=${stopport}" />
		<jvmarg value="-DSTOP.KEY=${stopsecret}" />

<target name="stop-solr">
	<java dir="./dist/solr" fork="true" jar="./dist/solr/start.jar">
		<jvmarg value="-DSTOP.PORT=${stopport}" />
		<jvmarg value="-DSTOP.KEY=${stopsecret}" />
		<arg value="–stop" />

<target name="restart-solr" depends="stop-solr,start-solr">

<target name="deleteAllDocs">
	<java dir="./dist/solr/exampledocs" fork="true" jar="./dist/solr/exampledocs/post.jar">
		<arg value="${basedir}/deleteAllCommand.xml" />


Customizing Solr installation

The solr server distribution under /example is just that – an example. It should be customized to fit your search requirements. The conf/schema.xml should be changed to model searchable entities of the application, as described in this article.


Multicore configuration and deployment

Multicore configuration allows multiple schemas and indexes in a single solr server process. Multicores are useful when disparate entities with different fields need to be searched using a single server process.

  • The package contains an example multicore configuration in /example/multicore.  It contains 2 cores, each with its own schema.xml and solrconfig.xml.
  • Core names and instance directories can be changed in solr.xml.
  • The default multicore schema.xmls are rather simplistic and don’t contain the exhaustive list of field type definitions available in /example/solr/conf/schema.xml.  So, copy all files under/example/solr/conf/* into /example/multicore/core0/conf/* and/example/multicore/core1/conf/*
  • Modify the core schema XMLs according to the data they are indexing
  • The copied solrconfig.xml has a <datadir> element that points to /example/multicore/data. This is where index and other component data are stored. Since the same solrconfig is copied into both cores, both cores end up pointing to the same data directory and will try to write to same index, most likely resulting in index corruption.  So, just comment out the <datadir> elements. Then each core will store data in its respective/example/multicore/<coredir>/data.
  • The jar lib directories in default single core solrconfig.xml don’t match with the default directory structure in a multicore structure.Those relative paths are with solr.home (ie, “/example/solr“) as base directory.  Change the relative paths of /contrib and /dist, such that they’re relative *to the core’s directory* (ie,/example/solr/<coredir>).
  • Finally, the multicore configuration should be made the active configuration, either by specifying”java -Dsolr.home=/example/multicore -jar start.jar”          OR preferably,        By copying all files under/example/multicore/* into /example/solr, the default solr home.

Using Solr from command line

The primary method of communicating with Solr is using HTTP. A HTTP capable command line client like curl is useful for this.

Querying: Queries should be sent as



http://localhost:8983/solr/<core name>/select/?q=<query>

for multicore installation

Inserting or Updating documents in a single core installation: Solr update handler listens by default on the URL: http://localhost:8983/solr/update/ in a single core configuration.

To post an XML file with documents, use command line

curl http://localhost:8983/solr/update/?commit=true –F "myfile=@updates.xml"

Inserting or Updating documents in a multi core installation: Each core’s update handler listens by default on the URL: http://localhost:8983/solr/<core name>/update/


Updating with content extraction: Content extracting handler listens on the URL http://localhost:8983/solr/update/extract/ or http://localhost:8983/solr/<core name>/update/extract. Use the command line

curl "http://localhost:8983/solr/update/extract?" -F "myfile=@book.pdf"

where adds a regular field called "id" to the new document created by extracting handler.


The query parameters that Solr accepts are documented in Solr wiki.


Boolean operators in search queries

All Lucene queries are valid in Solr too. However, solr does provide some additional conveniences.

A default boolean operator can specified using a <solrQueryParser defaultOperator=”AND|OR”/> element in schema.xml.

Each query can also override boolean behaviour using the q.op=AND|OR query param. However, remember that the schema default or q.op affect not just the query terms, but also the facet filter queries.

For example, selecting 2 facet values for the same facet field will now imply that both should be satisfied. This is because internally, a filter query is just a part of the query from Lucene point of view.

    To restrict boolean logic to just the query terms, use the following syntax:
  • All words should be found: Prefix a + in front of each word. example: +video +science (=>only documents that contain both “video” AND “science” are returned)
  • Any one word should be found: This is the default behaviour when queries contain words without any prefix.example: video science (=>any document which contains either “video” or “science” is returned)
  • Documents which don’t contain a word: Prefix a “–” in front of each word that should not be present, for a successful hit. example: video –science (=>any document which contains “video” but not “science” is returned).