Article Relevancy
Apache Solr 1.4.x
Introduction
Apache Solr is a full fledged, search server based on the Lucene toolkit.
Lucene provides the core search algorithms and index storage required by those algorithms. Most basic search requirements can be fulfilled by Lucene itself without requiring Solr. But using plain Lucene has some drawbacks in development and non functional aspects, forcing development teams to cover these in their designs. This is where Solr adds value.
Solr provides these benefits over using the raw Lucene toolkit:
- Solr allows search behaviour to be configured through configuration files, rather than through code. Specifying search fields, indexing criteria, and indexing behaviour in code is prone to maintenance problems.
- Lucene is java centric (but also has ports to other languages). Solr however provides a HTTP interface that allows any platform to use it. Projects that involve multiple languages or platforms can use the same solr server.
- Solr provides an out-of-the-box faceted search (also called drilldown search) facility, that allows users to incrementally refine results using filters and "drilldown" towards a narrow set of best matches. Many shopping web portals use this feature to allow their users to incrementally refine their results.
- Solr’s query syntax is slightly easier than Lucene’s. Either a default field can be specified, or solr provides a syntax of its own called dismax, that searches a fixed set of fields.
- Solr’s java client API is much simpler and easier than Lucene’s. Solr abstracts away many of the underlying Lucene concepts.
- Solr provides straightforward add, update, and delete document API, unlike Lucene.
- Solr supports a pluggable architecture. For example, post processor plugins (example: search results highlighting) allow raw results to be modified. .
- Solr facilitates scalability using solutions like caching, memory tweaking, clustering, sharding and load balancing.
- Solr provides plugins to fetch database data and index them. This workflow is probably the most common requirement for any search implementation, and solr provides it out-of-the-box.
The following sections describe basics of deploying Solr and using it from command line.
Directory layout of Solr package
Extracted Solr package has this layout:
/client |
Contains client APIs in different languages to talk to a Solr server |
/contrib/clustering |
Plugin that provides clustering capabilities for Solr, using Carrot2 clustering framework |
/contrib/dataimporthandler |
Plugin that is useful for indexing data in databases |
/contrib/extraction |
Plugin that is useful for extracting text from PDFs, Word DOCs, etc. |
/contrib/velocity |
Handler to present and manipulate search results using velocity templates. |
/dist |
Contains Solr core jars and wars that can be deployed in servlet containers or elsewhere, and the solrj client API for java clients. |
/dist/solrj-lib |
Libraries required by solrj client API . |
/docs |
Offline documentation and javadocs |
/lib |
Contains Lucene and other jars required by Solr |
/src |
Source code |
/example |
A skeleton standalone solr server deplyment. Default environment is Jetty. When deploying Solr, this is the directory that’s customized and deployed. |
/example/etc |
Jetty or other environment specific configuration files go here |
/example/example-DIH |
An example DB and the Data Import Handler plugin configuration to index that DB |
/example/exampledocs |
Example XML request files to send to Solr server. Usage: java –jar post.jar <xml filename> |
/example/lib |
Jetty and servlet libraries. Not required if Solr is being deployed in a different environment |
/example/logs |
Solr request logs |
/example/multicore |
It’s possible to host multiple search cores in the same environment. Use case could be separate indexes for different categories of data. |
/example/solr |
This is the main data area of Solr. |
/example/solr/conf |
Contains configuration files used by Solr.
solrconfig.xml – Configuration parameters, memory tuning, different types of request handlers.
schema.xml – Specifies fields and analyzer configuration for indexing and querying. Other files contain data required by different components like the Stop word filter. |
/example/solr/data |
This contains the actual results of indexing. |
/example/webapps |
The solr webapp deployed in Jetty |
/example/work |
Scratch directory for the container environment |
Getting Started Guide
1) Copy the skeleton server under /example to the deployment directory.
2) Customize /example/solr/conf/schema.xml as explained in later sections, to model search fields of the application.
3) Start the solr server. For the default Jetty environment, use this command line with current directory set to /example:
java -DSTOP.PORT=8079 –DSTOP.KEY=secret –jar start.jar
The STOP.PORT specifies the port on which server should listen for a stop instruction, and STOP.KEY is just a kind of secret key to be passed while stopping.
4) If building from source, the WAR will be named something like apache-solr-4.0-snapshot.jar. Copy this to /webapps and importantly, rename it to solr.war. Without that renaming, Jetty will give 404 errors for /solr URLs.
5) The solr server will now be available at http://localhost:8983/solr. 8983 is the default jetty connector port, as specified in /example/etc/jetty.xml
6) To stop the server, use the command line:
java -DSTOP.PORT=8079 –DSTOP.KEY=secret –jar start.jar –stop
Managing solr server with ant during development
Starting and stopping solr can be conveniently done from an IDE like Eclipse using an Ant script:
<project basedir="." name="ManageSolr">
<property name="stopport" value="8079"></property>
<property name="stopsecret" value="secret"></property>
<target name="start-solr">
<java dir="./dist/solr" fork="true" jar="./dist/solr/start.jar">
<jvmarg value="-DSTOP.PORT=${stopport}" />
<jvmarg value="-DSTOP.KEY=${stopsecret}" />
</java>
</target>
<target name="stop-solr">
<java dir="./dist/solr" fork="true" jar="./dist/solr/start.jar">
<jvmarg value="-DSTOP.PORT=${stopport}" />
<jvmarg value="-DSTOP.KEY=${stopsecret}" />
<arg value="–stop" />
</java>
</target>
<target name="restart-solr" depends="stop-solr,start-solr">
</target>
<target name="deleteAllDocs">
<java dir="./dist/solr/exampledocs" fork="true" jar="./dist/solr/exampledocs/post.jar">
<arg value="${basedir}/deleteAllCommand.xml" />
</java>
</target>
</project>
Customizing Solr installation
The solr server distribution under /example is just that – an example. It should be customized to fit your search requirements. The conf/schema.xml should be changed to model searchable entities of the application, as described in this article.
Multicore configuration and deployment
Multicore configuration allows multiple schemas and indexes in a single solr server process. Multicores are useful when disparate entities with different fields need to be searched using a single server process.
- The package contains an example multicore configuration in /example/multicore. It contains 2 cores, each with its own schema.xml and solrconfig.xml.
- Core names and instance directories can be changed in solr.xml.
- The default multicore schema.xmls are rather simplistic and don’t contain the exhaustive list of field type definitions available in /example/solr/conf/schema.xml. So, copy all files under/example/solr/conf/* into /example/multicore/core0/conf/* and/example/multicore/core1/conf/*
- Modify the core schema XMLs according to the data they are indexing
- The copied solrconfig.xml has a <datadir> element that points to /example/multicore/data. This is where index and other component data are stored. Since the same solrconfig is copied into both cores, both cores end up pointing to the same data directory and will try to write to same index, most likely resulting in index corruption. So, just comment out the <datadir> elements. Then each core will store data in its respective/example/multicore/<coredir>/data.
- The jar lib directories in default single core solrconfig.xml don’t match with the default directory structure in a multicore structure.Those relative paths are with solr.home (ie, “/example/solr“) as base directory. Change the relative paths of /contrib and /dist, such that they’re relative *to the core’s directory* (ie,/example/solr/<coredir>).
- Finally, the multicore configuration should be made the active configuration, either by specifying”java -Dsolr.home=/example/multicore -jar start.jar” OR preferably, By copying all files under/example/multicore/* into /example/solr, the default solr home.
Using Solr from command line
The primary method of communicating with Solr is using HTTP. A HTTP capable command line client like curl is useful for this.
Querying: Queries should be sent as
http://localhost:8983/solr/select/?q=<query>
or
http://localhost:8983/solr/<core name>/select/?q=<query>
for multicore installation
Inserting or Updating documents in a single core installation: Solr update handler listens by default on the URL: http://localhost:8983/solr/update/ in a single core configuration.
To post an XML file with documents, use command line
curl http://localhost:8983/solr/update/?commit=true –F "myfile=@updates.xml"
Inserting or Updating documents in a multi core installation: Each core’s update handler listens by default on the URL: http://localhost:8983/solr/<core name>/update/
Updating with content extraction: Content extracting handler listens on the URL http://localhost:8983/solr/update/extract/ or http://localhost:8983/solr/<core name>/update/extract. Use the command line
curl "http://localhost:8983/solr/update/extract?literal.id=book1&commit=true" -F "myfile=@book.pdf"
where literal.id adds a regular field called "id" to the new document created by extracting handler.
The query parameters that Solr accepts are documented in Solr wiki.
Boolean operators in search queries
All Lucene queries are valid in Solr too. However, solr does provide some additional conveniences.
A default boolean operator can specified using a <solrQueryParser defaultOperator=”AND|OR”/> element in schema.xml.
Each query can also override boolean behaviour using the q.op=AND|OR query param. However, remember that the schema default or q.op affect not just the query terms, but also the facet filter queries.
For example, selecting 2 facet values for the same facet field will now imply that both should be satisfied. This is because internally, a filter query is just a part of the query from Lucene point of view.
To restrict boolean logic to just the query terms, use the following syntax:
- All words should be found: Prefix a + in front of each word. example: +video +science (=>only documents that contain both “video” AND “science” are returned)
- Any one word should be found: This is the default behaviour when queries contain words without any prefix.example: video science (=>any document which contains either “video” or “science” is returned)
- Documents which don’t contain a word: Prefix a “–” in front of each word that should not be present, for a successful hit. example: video –science (=>any document which contains “video” but not “science” is returned).