Archive

Archive for the ‘Search’ Category

Solr on Jetty on Ubuntu

October 14th, 2011 Comments off

This article explains steps involved in deploying Apache Solr search engine as a system service on the Jetty servlet container on Ubuntu OS. This article is based on information from the Solr Jetty wiki page and on troubleshooting experiences of others.

Prerequisites:

  • Target system should have atleast Java 6 installed (in my case, OpenJRE 6 is installed)

Steps:

1. In this description, /opt/solr will be the target directory where Solr will be deployed.

 

2. The /example directory in the solr package forms the basis of the installation on the target system. It contains multiple configurations, each suitable for a different use case:

/example-DIH : a multicore configuration with each core demonstrating a different data importing configuration

/multicore : a simple multicore installation

/solr : a basic single core configuration.

Copy the configuration suitable for your application into /example/solr (replacing the one already there if necessary) and discard the rest. A configuration typically consists of /conf and /data (and sometimes also /bin and /lib) sub directories.

 

2. Additionally, the /dist and /contrib package directories contain important jars required by some of these configurations:

/dist/apache-solr-dataimporthandler*.jars – if you require data importing capabilities.

/dist/apache-solr-cell-*.jars ,  /contrib/extraction/lib/*.jars – If you require content extraction from PDF, MS office and other document files.

These jars should also be deployed on the target system.

 

3. Copy these files to the target system and create the directory structure suggested below under /opt/solr:

|-- dist - All required jars, including additional jars from /contrib
|-- etc - this should probably go into the root /etc directory, as per conventions
|   |-- jetty.xml
|   `-- webdefault.xml
|-- lib
|-- solr
|   |-- bin
|   |-- conf
|   |   |-- admin-extra.html
|   |   |-- dataimport.properties
|   |   |-- elevate.xml
|   |   |-- protwords.txt
|   |   |-- schema.xml
|   |   |-- scripts.conf
|   |   |-- solrconfig.xml
|   |   |-- stopwords.txt
|   |   |-- synonyms.txt
|   |   `-- xml-data-config.xml
|   |-- data
|-- start.jar
|-- webapps
|   `-- solr.war
`-- work

 

4. The solr process should run with its own dedicated credentials, so that authorizations can be administered at a fine granularity. So create a system user and group named ‘solr’.

$ sudo adduser --system solr
$ sudo addgroup solr
$ sudo adduser solr solr

5. Create a log directory /var/log/solr for solr and jetty logs.

6. Jetty outputs its errors to STDERR by default. Redirect it to a rolling log file by adding this section to /opt/solr/etc/jetty.xml.

    <!-- =========================================================== -->
    <!-- configure logging                                           -->
    <!-- =========================================================== -->
   <new id="ServerLog" class="java.io.PrintStream">
      <arg>
        <new class="org.mortbay.util.RolloverFileOutputStream">
          <arg><systemproperty default="/var/log/solr" name="jetty.logs" />/yyyy_mm_dd.stderrout.log</arg>
          <arg type="boolean">false</arg>
          <arg type="int">90</arg>
          <arg><call class="java.util.TimeZone" name="getTimeZone"><arg>GMT</arg></call></arg>
          <get id="ServerLogName" name="datedFilename"></get>
        </new>
      </arg>
    </new>
    <call class="org.mortbay.log.Log" name="info"><arg>Redirecting stderr/stdout to <ref id="ServerLogName" /></arg></call>
    <call class="java.lang.System" name="setErr"><arg><ref id="ServerLog" /></arg></call>
    <call class="java.lang.System" name="setOut"><arg><ref id="ServerLog" /></arg></call>

 

7. Now we need to set file and directory permissions so that the solr process user can work correctly.

Use chown to make solr:solr as the owner and group.

 

$ sudo chown -R solr:solr /opt/solr
$ sudo chown -R solr:solr /var/log/solr


Use chmod to give write permissions to solr:solr for the following directories:

/opt/solr/data

/opt/solr/work

/var/log/solr

 

8. The basic installation should work now. Try by launching jetty as a regular process:

 

/opt/solr$ sudo java -Dsolr.solr.home=/opt/solr/solr -jar start.jar

 

This should start solr.

Verify that logs are getting generated under /var/logs/solr.

Test it by sending a query to http://localhost:8983/solr/select?q=something using curl.

 

9. Now we need to install solr as a system daemon so that it can start automatically. Download the jetty.sh startup script (link courtesy http://wiki.apache.org/solr/SolrJetty) and save it as /etc/init.d/solr. Give it executable rights.

The following environment variables need to be set. They can either be inserted in this /etc/init.d/solr script itself, or they can be stored in /etc/default/jetty, which is read by the script.

 

JAVA_HOME=/usr/lib/jvm/default-java

JAVA_OPTIONS="-Xmx64m -Dsolr.solr.home=/opt/solr/solr"

JETTY_HOME=/opt/solr

JETTY_USER=solr

JETTY_GROUP=solr

JETTY_LOGS=/var/log/solr

 

Set the -Xmx parameters as per your requirements.

 

10. Additionally, this startup script has a problem that prevents it from running in Ubuntu. If you try running this right now using

 

$ sudo /etc/init.d/solr

 

you’ll get a

Starting Jetty: FAILED

error.

 

The problem – as explained well in this troubleshooting article – is in this line that attempts to start the daemon:

 

if start-stop-daemon -S -p"$JETTY_PID" $CH_USER -d"$JETTY_HOME" -b -m -a "$JAVA" -- "${RUN_ARGS[@]}" --daemon

 

In Ubuntu, –daemon is not a valid option for start-stop-daemon. Remove that option from the script:

if start-stop-daemon -S -p"$JETTY_PID" $CH_USER -d"$JETTY_HOME" -b -m -a "$JAVA" -- "${RUN_ARGS[@]}"

 

If you try starting it now, it should work:

$ sudo /etc/init.d/solr

 

It should give a

Starting Jetty: OK

message, and ps -ef |grep java should show the "java -jar start.jar" process.

 

11. Finally, it’s time to configure this as an init script. Read this article if you want a background on Ubuntu runlevels and init scripts.

Insert these lines at the top of /etc/init.d/solr to make it a LSB (Linux Standard Base) compliant init script. Without these lines, it’s not possible to configure the run level scripts.

### BEGIN INIT INFO

# Provides:          solr

# Required-Start:    $local_fs $remote_fs $network

# Required-Stop:     $local_fs $remote_fs $network

# Should-Start:      $named

# Should-Stop:       $named

# Default-Start:     2 3 4 5

# Default-Stop:      0 1 6

# Short-Description: Start Solr.

# Description:       Start the solr search engine.

### END INIT INFO

 

Now run the following command:

$ sudo update-rc.d solr defaults
 Adding system startup for /etc/init.d/solr ...
   /etc/rc0.d/K20solr -> ../init.d/solr
   /etc/rc1.d/K20solr -> ../init.d/solr
   /etc/rc6.d/K20solr -> ../init.d/solr
   /etc/rc2.d/S20solr -> ../init.d/solr
   /etc/rc3.d/S20solr -> ../init.d/solr
   /etc/rc4.d/S20solr -> ../init.d/solr
   /etc/rc5.d/S20solr -> ../init.d/solr

As you can see, the run levels 2-5 (they are equivalent in Ubuntu) are now configured to start solr.

Categories: Search, Ubuntu Tags: , , ,

Content Extraction in Solr

November 28th, 2010 Comments off

Overview

The example solrconfig.xml is already configured for content extraction from any document format – like MS Word DOC, PDF, – which can be handled by Apache Tika.

Content extraction requires libraries found in the /contrib/extraction directory. These include Solr Cell, Apache Tika and Apache POI libraries.

The ExtractingRequestHandler configuration in solrconfig.xml specifies the endpoint at which documents can be submitted for extraction. It’s usually http://localhost:8983/solr/update/extract.

 

Howto

  • To index a document, send the request as

curl “http://localhost:8983/solr/update/extract?literal.id=book1&commit=true” -F myfile=@book.pdf

The request goes as a multi-part form encoding.

  • By default, document contents are added into the document field “text”. The field can be changed in /solr/conf/solrconfig.xml in the Extracting handler’s <requesHandler> element; it has a child element “fmap.content” that specifies which field content should be indexed under.
  • <str name=”fmap.content”>text</str>

Since “text” is NOT a stored field, features like result highlighting won’t be available.

If results highlighting is required, modify /solr/conf/schema.xml to include a new *stored* field called “doc_content” which receives document contents from extracting handler. “doc_content” itself can be included in the “text” catch-all field so that all queries can be matched against document contents.

 

Restrictions of default content extraction

  • Since extracting handler can specify only a single content  field, contents of multiple files will all go into the same content field. This is a problem if the the content file containing the search string has to be indicated to user.
  • There is no out-of-the-box workaround for this in solr. It’s required to write a specialized extracting handler to map each file (“content stream” in solr terminology) in the multipart request to separate content fields.

Solr search data modelling

November 28th, 2010 Comments off

Overview

Searchable entities of an application need to be modelled as Solr documents and fields for them to be searchable by Solr.

The schema.xml in /solr/conf is where the application search model should be defined.

The <types> element defines the set of field types available in the model.

The <fields> element defines the set of fields of each document in the model. Each field has a type which is defined in the <types> element.

 

<types> section

This section describes types for all fields in the model. Contains <fieldType> elements. Each <fieldType> has these attributes:

  • name is the name of the field type definition and is referred from the <fields> section
  • class is the subclass of org.apache.solr.schema.FieldType that models this field type definition.  Class names starting with "solr" refer to java classes in the org.apache.solr.analysis package.
  • sortMissingLast and sortMissingFirst
    The optional sortMissingLast and sortMissingFirst attributes are currently supported on types that are sorted internally as strings. This includes "string","boolean","sint","slong","sfloat","sdouble","pdate"
    - If sortMissingLast="true", then a sort on this field will cause documents without the field to come after documents with the field, regardless of the requested sort order (asc or desc).
    - If sortMissingFirst="true", then a sort on this field will cause documents without the field to come before documents with the field, regardless of the requested sort order.
    - If sortMissingLast="false" and sortMissingFirst="false" (the default), then default lucene sorting will be used which places docs without the field first in an ascending sort and last in a descending sort.

  • omitNorms is set to true to omit the norms associated with this field (this disables length normalization and index-time boosting for the field, and saves some memory). Only full-text fields or fields that need an index-time boost need norms.

Each field type definition has an associated Analyzer to tokenize and filter characters or tokens.

The Trie field types are suitable for numeric fields that involve numeric range queries. The trie concept makes searching such fields faster.

 

Basic field types

string Fields of this type are not analyzed (ie, not tokenized or filtered), but are indexed and stored verbatim.
binary For binary data. Should be sent/retrieved as Base64 encoded strings.
int/tint/pint
long/tlong/plong
float/tfloat/pfloat double/tdouble/
pdouble
The regular types (int,float,etc) and their t- versions differ in their precisionStep values.The precisionStep value is used to generate indexes at different precision levels, to support numeric range queries. Both sets are modelled by TrieField types, but the t- versions have precisionStep of 8 while the regular types have 0.So numeric range queries will be faster with the t-versions, but indexes will be larger (and probably slower). The p- versions are when numeric range queries are not needed at all. They are modelled by non-Trie types.
date/tdate/pdate Similar to the above differences in numeric fields.Use tdate for date ranges and date faceting.Dates have to be in a special UTC timezone format, like this example: 2011-02-06T05:34:00.299Z Use org.apache.solr.common.util.DateUtil.getThreadLocalDateFormat().format(new Date()) to get a date in this format.
sint/slong/
sfloat/sdouble
Sortable fields

Text field types Being a full text search solution, the text field types and their configuration becomes the most critical part of the modelling. Modelling of text fields is explained in detail in the article Solr text field types, analyzers, tokenizers & filters explained.

 

<fields> section

Fields of documents are described in this section using <field> elements.

Each <field> element can have these attributes:

name (mandatory) the name for the field. Very critical information, used in search queries, facet fields.
type (mandatory) the name of a previously defined type from the       <types> section
indexed true if this field should be indexed (should be searchable or sortable)
stored true if this field value should be retrievable verbatim in search results.
compressed [false] if this field should be stored using gzip compression (this will only apply if the field type is compressable; among the standard field types, only TextField and StrField are).This is very useful for large data fields, but will probably slow down search results – so it should not be used for fields that involve frequent querying
multiValued true if this field may contain multiple values per document
omitNorms (expert) set to true to omit the norms associated with this field (this disables length normalization and index-time boosting for the field, and saves some memory).  Only full-text fields or fields that need an index-time boost need norms.
termVectors [false] set to true to store the term vector for a given field. When using MoreLikeThis, fields used for similarity should be stored for best performance.
termPositions Store position information with the term vector.  This will increase storage costs.
termOffsets Store offset information with the term vector. This will increase storage costs.
default a value that should be used if no value is specified when adding a document.

The example deployment itself defines many commonly used fields and types; study them and check if something needed is already available before modelling your own.

<dynamicField> elements can be used to model field names which are not explicitly defined by name, but which match some defined pattern.

<copyField> definitions specify to copy one field to another at the time a document is added to the index.  It’s used either to index the same field differently, or to add multiple fields to the same field for easier/faster searching. For example, all text fields in the document can be copied to a single catch-all field, for faster querying.

<uniqueKey> element specifies the field to be used to determine and enforce document uniqueness.

<defaultSearchField> element specifies the field to be queried when it’s not explicitly specified in the query string using a “field:value” syntax. The catch-all copyfield is usually specified as the default search field.

<solrQueryParse> specifies query parser configuration. defaultOperator=”AND|OR” specifies whether queries are combined using AND operator or OR operator.

Faceting – or drilldown – search using Solr

November 26th, 2010 Comments off

Overview

Faceted searching – also called as drilldown searching – refers to incrementally refining search results by different criteria at each level. Popular e-shopping sites like Amazon and Ebay provide this in their search pages.

Solr has excellent support for faceting. The sections below describe how to use faceting in java applications, using the solrj client API.

 

Steps

Step 1 : Do the first level search and get first level facets

SolrQuery qry = new SolrQuery(strQuery);
String[] fetchFacetFields = new String[]{"categories"};
qry.setFacet(true);
qry.addFacetField(fetchFacetFields);
qry.setIncludeScore(true);
qry.setShowDebugInfo(true);
QueryRequest qryReq = new QueryRequest(qry); 

QueryResponse resp = qryReq.process(solrServer);  

SolrDocumentList results = resp.getResults();
int count = results.size();
System.out.println(count + " hits");
for (int i = 0; i > count; i++) {
    SolrDocument hitDoc = results.get(i);
    System.out.println("#" + (i+1) + ":" + hitDoc.getFieldValue("name"));
    for (Iterator<Entry<String, Object>> flditer = hitDoc.iterator(); flditer.hasNext();) {
        Entry<String, Object> entry = flditer.next();
        System.out.println(entry.getKey() + ": " + entry.getValue());
    }
} 

List<FacetField> facetFields = resp.getFacetFields();
for (int i = 0; i > facetFields.size(); i++) {
    FacetField facetField = facetFields.get(i);
    List<Count> facetInfo = facetField.getValues();
    for (FacetField.Count facetInstance : facetInfo) {
        System.out.println(facetInstance.getName() + " : " + facetInstance.getCount() + " [drilldown qry:" + facetInstance.getAsFilterQuery());
    }
}

 

The response will contain details of number of hits for each instance of the facet.

For example, if the field categories has values movies and songs in the set of matching hits, then each of them is called a facet instance. 

Each facet instance of a FacetField has a name (“songs”), and each has an associated facet instance count and a filter query.

Facet instance count of 10 for “categories:songs” means in the set of all search results, 10 results have the value of categories as songs.

Facet instance filter query is the subquery to go down to the next level of drilldown search, by filtering on the facet instance value.

At this point in a typical drilldown search user interface, the left sidebar with all the filters would display those facet instances that have nonzero instance count with checkboxes and respective counts. User can then select the most promising facet to drilldown along and check its checkbox...

 

Step 2: Add facet filter query for next level of refined results

Add the filter query of facet instance to the main query, using addFilterQuery.

Filter query for single facet instance is of the format "<field>:<value>”. example: addFilterQuery(“categories:movies”);

// filterQueries is a String[] of facet filter queries got using getAsFilterQuery() from previous search
SolrQuery qry = new SolrQuery(strQuery);
if (filterQueries != null) {
    for (String fq : filterQueries) {
        qry.addFilterQuery(fq);
    }
}
qry.setFacet(true);
qry.addFacetField(fetchFacetFields);
qry.setIncludeScore(true);
qry.setShowDebugInfo(true);
QueryRequest qryReq = new QueryRequest(qry);
QueryResponse resp = qryReq.process(solrServer);

For subsequent levels of refinement, add facet instance filter queries to the current level’s main query, and add the list of facet fields required for the next level.

 

Facet filter query syntax

The facet filter queries have some rather intricate syntaxes for achieving various search behaviours, which are described below.

 

Selecting multiple facets

In some drilldown search designs, a user is allowed to specify multiple facet instances for the same field. For example, a categories field may have multiple category facet instances. In such cases, the facet instances should be combined using an OR operator.

Categories [ ]

  Movies (300) [ ]

  Songs (400) [ ]

  Ads (150) []

 

If user selects “Movies” and “songs”, the filter query should have the semantics of an OR operator –

“..where category=movies OR category=songs”.

This can be specified in solr filter queries by enclosing the facet instances inside parentheses:

<fqfield>:(value1 value2 value3…)

examples:

In command line URL :

fq=categories%3A%28songs+movies%29

where %3A is character ‘:‘   , %28 is character ‘(‘ and %29 is character ‘)’

OR, equivalently

In java

qry.addFilterQuery(“categories:(songs movies)”);

Whitespaces in facet instances

If facet instances have whitespaces within them, then multiple facet instances should be specified simply by enclosing them in double quotes (%22).

For example, for a facet field "crn" with facet instances “M.Tech. Computer Sc. & Engg.” and “ELECTRICAL ENGINEERING” (note the whitespaces), the syntax:

In URLs:

fq=crn%3A%28%22M.Tech.+Computer+Sc.+%26+Engg.%22+%22ELECTRICAL+ENGINEERING%22%29

OR

In Java:

qry.addFilterQuery("crn:("M.Tech. Computer Sc. & Engg." "ELECTRICAL ENGINEERING")");

 

 

Handling large number of facet values using pagination

Solr provides pagination for facet values and automatically imposes a limit on the number of values returned for each facet field. This limit can be set using the facet.limit query parameter, or setFacetLimit() API, and the facet value offset can be set using facet.offset query parameter.

However, there is no direct API like setFacetOffset() in SolrJ…instead, use

solrQry.add(FacetParams.FACET_OFFSET, “100”)

 

 

Facet Query vs Filter Query of facet

The Solr API also contains methods that refer to "facet queries". It’s important not to confuse facet queries and filter queries of facets.At first glance, it looks like the facet query concept is what will provide us the drilldown possibility. But not so.

Facet query is a kind of dynamic facet field, applicable only to certain use cases where it makes sense to categorize items in ranges – either numerical or date ranges .

For example, if items have to be categorized into price ranges like [$100-$200], [$200-$300] etc, then facet queries have to be used to “get the count of all items whose price>$100 and price<$200”. Just specifying the price field as a facet field would not be useful here, because it just returns the list of all unique prices available in the search results. What really provides the drilldown capabilities in this case is the facet query concept.

Facet queries are specified using the syntax field:[start TO end]. In URL, it should go in encoded format :

facet.query=age:[20+TO+22]

In API, it’s specified as

solrQuery.addFacetQuery(“age:[20 TO 22]”);

 

Undestanding facet counts

The facet counts are always in the context of the set of search results of main query + filter queries. image

Embedded Solr

November 25th, 2010 Comments off

A java application running in a JVM can use the EmbeddedSolrServer to host Solr in the same JVM.

Following snippet shows how to use it:

public class EmbeddedServerExplorer {
    public static void main(String[] args) {
        try {
            // Set "solr.solr.home" to the directory under which /conf and /data are present.
            System.setProperty("solr.solr.home", "solr");
            CoreContainer.Initializer initializer = new CoreContainer.Initializer();
            CoreContainer coreContainer = initializer.initialize();
            EmbeddedSolrServer server = new EmbeddedSolrServer(coreContainer, "");
            SolrInputDocument doc = new SolrInputDocument();
            doc.addField("id", "embeddedDoc1");
            doc.addField("name", "test embedded server");
            server.add(doc);
            server.commit();
            coreContainer.shutdown();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}
Categories: Search Tags: ,

Using Solr from java applications with SolrJ

November 23rd, 2010 Comments off

Overview

SolrJ provides java wrappers and adaptors to communicate with Solr and translate its results to java objects. Using SolrJ is much more convenient than using raw HTTP and JSON. Internally, SolrJ uses Apache HttpClient to send HTTP requests.

 

Important classes

SolrJ API is fairly simple and intuitive. The diagram below shows important SolrJ classes.

image

 

Setup the client connection to server

solrServer = new CommonsHttpSolrServer("http://localhost:8983/solr");
solrServer.setParser(new XMLResponseParser());

Response parser in java client API can be either XML or binary. In other language APIs, JSON is possible.

 

Add or update document(s)

SolrInputDocument doc = new SolrInputDocument();
// Add fields. The field names should match fields defined in schema.xml
doc.addField(FLD_ID, docId++);
try {
    solrServer.add(doc);
    return true;
} catch (Exception e) {
    LOG.error("addItem error", e);
    return false;
}

Commit changes

For best performance, commit changes only after all – or a batch of reasonable size -documents are added/updated.

solrServer.commit();

Send a search query

SolrQuery qry = new SolrQuery("name:video");
qry.setIncludeScore(true);
qry.setShowDebugInfo(true);
qry.setRows(100);
QueryRequest qryReq = new QueryRequest(qry);
QueryResponse resp = qryReq.process(solrServer);

SolrQuery.setRows() specifies how many results to return in the response. The actual count of all hits may be much higher. If “field:” is omitted from query string, then the field specified by <defaultSearchField> in schema.xml is searched.

Handle search results

SolrDocumentList results = resp.getResults();
System.out.println(results.getNumFound() + " total hits");
int count = results.size();
System.out.println(count + " received hits");
for (int i = 0; i &amp;gt; count; i++) {
    SolrDocument hitDoc = results.get(i);
    System.out.println("#" + (i+1) + ":" + hitDoc.getFieldValue("name"));
    for (Iterator<Entry<String, Object>> flditer = hitDoc.iterator(); flditer.hasNext();) {
        Entry<String, Object> entry = flditer.next();
        System.out.println(entry.getKey() + ": " + entry.getValue());
    }
}

SolrDocumentList.getNumFound() is total number of hits in the index. But in each response, only as many results as specified by SolrQuery.setRows() will be returned. These two attributes can be used for pagination.

Categories: Search Tags: , ,

Getting started with Solr

November 23rd, 2010 Comments off

Introduction

Apache Solr is a full fledged, search server based on the Lucene toolkit.

Lucene provides the core search algorithms and index storage required by those algorithms. Most basic search requirements can be fulfilled by Lucene itself without requiring Solr. But using plain Lucene has some drawbacks in development and non functional aspects, forcing development teams to cover these in their designs. This is where Solr adds value.

Solr provides these benefits over using the raw Lucene toolkit:

  • Solr allows search behaviour to be configured through configuration files, rather than through code. Specifying search fields, indexing criteria, and indexing behaviour in code is prone to maintenance problems.
  • Lucene is java centric (but also has ports to other languages). Solr however provides a HTTP interface that allows any platform to use it. Projects that involve multiple languages or platforms can use the same solr server.
  • Solr provides an out-of-the-box faceted search (also called drilldown search) facility, that allows users to incrementally refine results using filters and "drilldown" towards a narrow set of best matches. Many shopping web portals use this feature to allow their users to incrementally refine their results.
  • Solr’s query syntax is slightly easier than Lucene’s. Either a default field can be specified, or solr provides a syntax of its own called dismax, that searches a fixed set of fields.
  • Solr’s java client API is much simpler and easier than Lucene’s. Solr abstracts away many of the underlying Lucene concepts.
  • Solr provides straightforward add, update, and delete document API, unlike Lucene.
  • Solr supports a pluggable architecture. For example, post processor plugins (example: search results highlighting) allow raw results to be modified. .
  • Solr facilitates scalability using solutions like caching, memory tweaking, clustering, sharding and load balancing.
  • Solr provides plugins to fetch database data and index them. This workflow is probably the most common requirement for any search implementation, and solr provides it out-of-the-box.

The following sections describe basics of deploying Solr and using it from command line.

 

Directory layout of Solr package

Extracted Solr package has this layout:

/client Contains client APIs in different languages to talk to a Solr server
/contrib/clustering Plugin that provides clustering capabilities for Solr, using Carrot2 clustering framework
/contrib/dataimporthandler Plugin that is useful for indexing data in databases
/contrib/extraction Plugin that is useful for extracting text from PDFs, Word DOCs, etc.
/contrib/velocity Handler to present and manipulate search results using velocity templates.
/dist Contains Solr core jars and wars that can be deployed in servlet containers or elsewhere, and the solrj client API for java clients.
/dist/solrj-lib Libraries required by solrj client API .
/docs Offline documentation and javadocs
/lib Contains Lucene and other jars required by Solr
/src Source code
/example A skeleton standalone solr server deplyment. Default environment is Jetty. When deploying Solr, this is the directory that’s customized and deployed.
/example/etc Jetty or other environment specific configuration files go here
/example/example-DIH An example DB and the Data Import Handler plugin configuration to index that DB
/example/exampledocs Example XML request files to send to Solr server. Usage: java –jar post.jar <xml filename>
/example/lib Jetty and servlet libraries. Not required if Solr is being deployed in a different environment
/example/logs Solr request logs
/example/multicore It’s possible to host multiple search cores in the same environment. Use case could be separate indexes for different categories of data.
/example/solr This is the main data area of Solr.
/example/solr/conf Contains configuration files used by Solr.

solrconfig.xml – Configuration parameters, memory tuning, different types of request handlers.

schema.xml – Specifies fields and analyzer configuration for indexing and querying. Other files contain data required by different components like the Stop word filter.

/example/solr/data This contains the actual results of indexing.
/example/webapps The solr webapp deployed in Jetty
/example/work Scratch directory for the container environment


Getting Started Guide

1) Copy the skeleton server under /example to the deployment directory.

2) Customize /example/solr/conf/schema.xml as explained in later sections, to model search fields of the application.

3) Start the solr server. For the default Jetty environment, use this command line with current directory set to /example:

java -DSTOP.PORT=8079 –DSTOP.KEY=secret –jar start.jar

The STOP.PORT specifies the port on which server should listen for a stop instruction, and STOP.KEY is just a kind of secret key to be passed while stopping.

4) If building from source, the WAR will be named something like apache-solr-4.0-snapshot.jar. Copy this to /webapps and importantly, rename it to solr.war. Without that renaming, Jetty will give 404 errors for /solr URLs.

5) The solr server will  now be available at http://localhost:8983/solr. 8983 is the default jetty connector port, as specified in /example/etc/jetty.xml

6) To stop the server, use the command line:

java -DSTOP.PORT=8079 –DSTOP.KEY=secret –jar start.jar –stop

 

 

Managing solr server with ant during development

Starting and stopping solr can be conveniently done from an IDE like Eclipse using an Ant script:

<project basedir="." name="ManageSolr">
<property name="stopport" value="8079"></property>
<property name="stopsecret" value="secret"></property>

<target name="start-solr">
	<java dir="./dist/solr" fork="true" jar="./dist/solr/start.jar">
		<jvmarg value="-DSTOP.PORT=${stopport}" />
		<jvmarg value="-DSTOP.KEY=${stopsecret}" />
	</java>
</target>

<target name="stop-solr">
	<java dir="./dist/solr" fork="true" jar="./dist/solr/start.jar">
		<jvmarg value="-DSTOP.PORT=${stopport}" />
		<jvmarg value="-DSTOP.KEY=${stopsecret}" />
		<arg value="–stop" />
	</java>
</target>

<target name="restart-solr" depends="stop-solr,start-solr">
</target>

<target name="deleteAllDocs">
	<java dir="./dist/solr/exampledocs" fork="true" jar="./dist/solr/exampledocs/post.jar">
		<arg value="${basedir}/deleteAllCommand.xml" />
	</java>
</target>
</project>

 

Customizing Solr installation

The solr server distribution under /example is just that – an example. It should be customized to fit your search requirements. The conf/schema.xml should be changed to model searchable entities of the application, as described in this article.

 

Multicore configuration and deployment

Multicore configuration allows multiple schemas and indexes in a single solr server process. Multicores are useful when disparate entities with different fields need to be searched using a single server process.

  • The package contains an example multicore configuration in /example/multicore.  It contains 2 cores, each with its own schema.xml and solrconfig.xml.
  • Core names and instance directories can be changed in solr.xml.
  • The default multicore schema.xmls are rather simplistic and don’t contain the exhaustive list of field type definitions available in /example/solr/conf/schema.xml.  So, copy all files under/example/solr/conf/* into /example/multicore/core0/conf/* and/example/multicore/core1/conf/*
  • Modify the core schema XMLs according to the data they are indexing
  • The copied solrconfig.xml has a <datadir> element that points to /example/multicore/data. This is where index and other component data are stored. Since the same solrconfig is copied into both cores, both cores end up pointing to the same data directory and will try to write to same index, most likely resulting in index corruption.  So, just comment out the <datadir> elements. Then each core will store data in its respective/example/multicore/<coredir>/data.
  • The jar lib directories in default single core solrconfig.xml don’t match with the default directory structure in a multicore structure.Those relative paths are with solr.home (ie, “/example/solr“) as base directory.  Change the relative paths of /contrib and /dist, such that they’re relative *to the core’s directory* (ie,/example/solr/<coredir>).
  • Finally, the multicore configuration should be made the active configuration, either by specifying”java -Dsolr.home=/example/multicore -jar start.jar”          OR preferably,        By copying all files under/example/multicore/* into /example/solr, the default solr home.

Using Solr from command line

The primary method of communicating with Solr is using HTTP. A HTTP capable command line client like curl is useful for this.

Querying: Queries should be sent as

http://localhost:8983/solr/select/?q=<query>

or

http://localhost:8983/solr/<core name>/select/?q=<query>

for multicore installation

Inserting or Updating documents in a single core installation: Solr update handler listens by default on the URL: http://localhost:8983/solr/update/ in a single core configuration.

To post an XML file with documents, use command line

curl http://localhost:8983/solr/update/?commit=true –F "myfile=@updates.xml"

Inserting or Updating documents in a multi core installation: Each core’s update handler listens by default on the URL: http://localhost:8983/solr/<core name>/update/

 

Updating with content extraction: Content extracting handler listens on the URL http://localhost:8983/solr/update/extract/ or http://localhost:8983/solr/<core name>/update/extract. Use the command line

curl "http://localhost:8983/solr/update/extract?literal.id=book1&commit=true" -F "myfile=@book.pdf"

where literal.id adds a regular field called "id" to the new document created by extracting handler.

 

The query parameters that Solr accepts are documented in Solr wiki.

 

Boolean operators in search queries

All Lucene queries are valid in Solr too. However, solr does provide some additional conveniences.

A default boolean operator can specified using a <solrQueryParser defaultOperator=”AND|OR”/> element in schema.xml.

Each query can also override boolean behaviour using the q.op=AND|OR query param. However, remember that the schema default or q.op affect not just the query terms, but also the facet filter queries.

For example, selecting 2 facet values for the same facet field will now imply that both should be satisfied. This is because internally, a filter query is just a part of the query from Lucene point of view.

    To restrict boolean logic to just the query terms, use the following syntax:
  • All words should be found: Prefix a + in front of each word. example: +video +science (=>only documents that contain both “video” AND “science” are returned)
  • Any one word should be found: This is the default behaviour when queries contain words without any prefix.example: video science (=>any document which contains either “video” or “science” is returned)
  • Documents which don’t contain a word: Prefix a “–” in front of each word that should not be present, for a successful hit. example: video –science (=>any document which contains “video” but not “science” is returned).

Solr text field types, analyzers, tokenizers & filters explained

November 23rd, 2010 Comments off

Overview

Solr’s world view consists of documents, where each document consists of searchable fields. The rules for searching each field are defined using field type definitions. A field type definition describes the analyzers, tokenizers and filters which control searching behaviour for all fields of that type.

 

When a document is added/updated, its fields are analyzed and tokenized, and those tokens are stored in solr’s index. When a query is sent, the query is again analyzed, tokenized and then matched against tokens in the index. This critical function of tokenization is handled by Tokenizer components.

 

In addition to tokenizers, there are TokenFilter components, whose job is to modify the token stream.

There are also CharFilter components, whose job is to modify individual characters. For example, HTML text can be filtered to modify HTML entities like &amp; to regular &.

 

Defining text field types in schema.xml

Here’s a typical text field type definition:

    <fieldType name="textgen" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

What this type definition specifies is:

  • When indexing a field of this type, use an analyzer composed of
    • a WhitespaceTokenizerFactory object
    • a StopFilterFactory
    • a WordDelimiterFilterFactory
    • a LowerCaseFilterFactory
  • When querying a field of this type, use an analyzer composed of
    • a WhitespaceTokenizerFactory object
    • a SynonymFilterFactory
    • a StopFilterFactory
    • a WordDelimiterFilterFactory
    • a LowerCaseFilterFactory

If there is only one analyzer element, then the same analyzer is used for both indexing and querying.

      It’s important to use the same or similar analyzers that process text in a compatible manner at index and query time. For example, if an indexing analyzer lowercases words, then the query analyzer should do the same to enable finding the indexed words.

     

    Under the hood

    Solr builds a TokenizerChain instance for each of these analyzers. A TokenizerChain is composed of 1 TokenizerFactory instance, 0-n TokenFilterFactory instances, and 0-n CharFilterFactory instances. These factory instances are responsible for creating their respective objects from the Lucene framework. For example, a TokenizerFactory creates a Lucene Tokenizer; its concrete implementation WhitespaceTokenizerFactory creates a Lucene WhitespaceTokenizer. image

    The class design diagram shows how a TokenizerChain works:

    • Raw input is provided by a Reader instance
    • CharReader (is-a CharStream) wraps the raw Reader
    • Each CharFilterFactory creates a character filter that modifies input CharStream and outputs a CharStream. So CharFilterFactories can be chained.
    • TokenizerFactory creates a Tokenizer from the CharStream.
    • Tokenizer is-a TokenStream, and can be passed to TokenFilterFactories.
    • Each TokenFilterFactory modifies the token stream and outputs another TokenStream. So these can be chained.

    Commonly used CharFilterFactories

    solr.MappingCharFilterFactory Maps a set of characters to another set of characters.The mapping file is specified by mappingattribute, and should be present under /solr/conf.
    Example: <charFilter class=”solr.MappingCharFilterFactory” mapping=”mapping-ISOLatin1Accent.txt”/>
    The mapping file should have this format:
    # Ä => A “u00C4″ => “A”
    # Å => A “u00C5″ => “A” 
    solr.HTMLStripCharFilterFactory Strips HTML/XML from input stream.The input need not be an HTML document as only constructs that look like HTML will be removed.
    Removes HTML/XML tags while keeping the content
    Attributes within tags are also removed, and attribute quoting is optional.
    Removes XML processing instructions: <?foo bar?>
    Removes XML comments
    Removes XML elements starting with <! and ending with >
    Removes contents of <script> and <style> elements.
    Handles XML comments inside these elements (normal comment processing won’t always work)
    Replaces numeric character entities references like &#65; or &#x7f;The terminating ‘;’ is optional if the entity reference is followed by whitespace.
    Replaces all named character entity references.
    terminating ‘;’ is mandatory to avoid false matches on something like “Alpha&Omega Corp” Examples: <charFilter class=”solr.HTMLStripCharFilterFactory”/>
    The text
    my <a href=”www.foo.bar”>link</a>
    becomes
    my link

     

    Commonly used TokenizerFactories

    solr.WhitespaceTokenizerFactory A tokenizer that divides text at whitespaces, as defined by java.lang.Character.isWhiteSpace().Adjacent sequences of non-whitespace characters form tokens.

    Example: HELLOtttWORLD.txt   is tokenized into 2 tokens HELLO WORLD.txt
    solr.KeywordTokenizerFactory Treats the entire field as one token, regardless of its content.This is a lot like the “string” field type, in that no tokenization happens at all.Use it if a text field requires no tokenization, but does require char filters, or token filtering like LowerCaseFilter and TrimFilter. Example: http://example.com/I-am+example?Text=-Hello is retained as http://example.com/I-am+example?Text=-Hello
    solr.StandardTokenizerFactory A good general purpose tokenizer.

    • Splits words at punctuation characters, removing punctuation. However, a dot that’s not followed by whitespace is considered part of a token.
    • Not suitable for file names because the .extension is treated as part of token.
    • Splits words at hyphens, unless there’s a number in the token, in which case the whole token is interpreted as a product number and is not split.
    • Recognizes email addresses and internet hostnames as one token.

    Example: This sentencet can’t be “tokenized_Correctly” by www.google.com or IBM  or NATO 10.1.9.5 test@email.org product-number 123-456949 file.txt is tokenized as This sentence can’t be tokenized Correctly by www.google.com or IBM or NATO 10.1.9.5 test@email.org product number 123-456949 file.txt

    solr.PatternTokenizerFactory Uses regex pattern matching to construct distinct tokens for the input stream. It takes two arguments: “pattern” and “group”.

    • “pattern” is the regular expression.
    • “group” says which group to extract into tokens.

    group=-1 (the default) is equivalent to “split”. In this case, the tokens will be equivalent to the output from (without empty tokens): String.split(java.lang.String) Using group >= 0 selects the matching group as the token. For example, if you have:

      pattern = '([^']+)'
      group = 0
      input = aaa 'bbb' 'ccc'

    the output will be two tokens: ‘bbb’ and ‘ccc’ (including the ‘ marks). With the same input but using group=1, the output would be: bbb and ccc (no ‘ marks)

    solr.NGramTokenizerFactory Not clear when and where to use, but the idea is that input is split into 1-sized, then 2-sized, then 3-sized, etc tokens.Perhaps useful for partial matching…It takes “minGram” and “maxGram” arguments, but again, not clear how to set them. Example: email becomes e m a i l em ma ai il

     

    Commonly used TokenFilterFactories

    solr.WordDelimiterFilterFactory Splits words into subwords and performs optional transformations on subword groups.One use for WordDelimiterFilter is to help match words with different delimiters. One way of doing so is to specify generateWordParts="1" catenateWords="1" in the analyzer used for indexing, andgenerateWordParts="1" in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that leaves them in place (such as WhitespaceTokenizer). By default, words are split into subwords with the following rules:

    • split on intra-word delimiters (all non alpha-numeric characters).
      • "Wi-Fi" -> "Wi", "Fi"
    • split on case transitions (can be turned off – see splitOnCaseChange parameter)
      • "PowerShot" -> "Power", "Shot"
    • split on letter-number transitions (can be turned off – see splitOnNumerics parameter)
      • "SD500" -> "SD", "500"
    • leading and trailing intra-word delimiters on each subword are ignored
      • "//hello---there, 'dude'" -> "hello", "there", "dude"
    • trailing “‘s” are removed for each subword (can be turned off – see stemEnglishPossessive parameter)
      • "O'Neil's" -> "O", "Neil"
        • Note: this step isn’t performed in a separate filter because of possible subword combinations.

    Splitting is affected by the following parameters:

    • splitOnCaseChange=”1″causes lowercase => uppercase transitions to generate a new part [Solr 1.3]:
      • "PowerShot" => "Power" "Shot"
      • "TransAM" => "Trans" "AM"
      • default is true (“1″); set to 0 to turn off
    • splitOnNumerics=”1″causes alphabet => number transitions to generate a new part [Solr 1.3]:
      • "j2se" => "j" "2" "se"
      • default is true (“1″); set to 0 to turn off
    • stemEnglishPossessive=”1″causes trailing “‘s” to be removed for each subword.
      • "Doug's" => "Doug"
      • default is true (“1″); set to 0 to turn off

    There are also a number of parameters that affect what tokens are present in the final output and if subwords are combined:

    • generateWordParts=”1″causes parts of words to be generated:
      • "PowerShot" => "Power" "Shot" (if splitOnCaseChange=1)
      • "Power-Shot" => "Power" "Shot"
      • default is 0
    • generateNumberParts=”1″causes number subwords to be generated:
      • "500-42" => "500" "42"
      • default is 0
    • catenateWords=”1″causes maximum runs of word parts to be catenated:
      • "wi-fi" => "wifi"
      • default is 0
    • catenateNumbers=”1″causes maximum runs of number parts to be catenated:
      • "500-42" => "50042"
      • default is 0
    • catenateAll=”1″causes all subword parts to be catenated:
      • "wi-fi-4000" => "wifi4000"
      • default is 0
    • preserveOriginal=”1″causes the original token to be indexed without modifications (in addition to the tokens produced due to other options)
      • default is 0

    Example of generateWordParts=”1″ and catenateWords=”1″:

    • "PowerShot" -> 0:"Power", 1:"Shot" 1:"PowerShot"(where 0,1,1 are token positions)
    • "A's+B's&C's" -> 0:"A", 1:"B", 2:"C", 2:"ABC"
    • "Super-Duper-XL500-42-AutoCoder!" -> 0:"Super",
    • 1:"Duper",
    • 2:"XL",
    • 2:"SuperDuperXL",
    • 3:"500"
    • 4:"42",
    • 5:"Auto",
    • 6:"Coder",
    • 6:"AutoCoder"
    solr.SynonymFilterFactory Matches strings of tokens and replaces them with other strings of tokens.

    • The synonyms parameter names an external file defining the synonyms.
    • If ignoreCase is true, matching will lowercase before checking equality.
    • If expand is true, a synonym will be expanded to all equivalent synonyms. If it is false, all equivalent synonyms will be reduced to the first in the list.

    Synonym file format: i-pod, i pod => ipod, sea biscuit, sea biscit => seabiscuit

    solr.StopFilterFactory Discards common words.<filter class=”solr.StopFilterFactory” words=”stopwords.txt” ignoreCase=”true”/> Stop words file should be in /solr/conf. Format:#Standard english stop words a an
    solr.SnowballPorterFilterFactory Uses the Tartarus snowball stemmer framework for different languages. Set the “language” attribute.Not clear how this is different from PorterStemFilterFactory.Example: running gives run
    solr.HyphenatedWordsFilterFactory Combines words split by hyphens. Use only at indexing time.
    solr.KeepWordFilterFactory Retains only words specified in the “words” file.
    solr.LengthFilterFactory Retains only tokens whose length falls between “min” and “max”
    solr.LowerCaseFilterFactory Changes all text to lower case.
    solr.PorterStemFilterFactory Transforms token stream according to the Porter stemming algorithm. The input token stream should already be lowercase (pass through a LowerCaseFilter).Example:running is tokenized to run
    solr.ReversedWildcardFilterFactory
    solr.ReverseStringFilterFactory
    Useful if wildcard queries like “Apache*” should be supported.Factory for ReversedWildcardFilter-s. When this factory is added to an analysis chain, it will be used both for filtering the tokens during indexing, and to determine the query processing of this field during search. This class supports the following init arguments:

    • withOriginal – if true, then produce both original and reversed tokens at the same positions. If false, then produce only reversed tokens.
    • maxPosAsterisk – maximum position (1-based) of the asterisk wildcard (‘*’) that triggers the reversal of query term. Asterisk that occurs at positions higher than this value will not cause the reversal of query term. Defaults to 2, meaning that asterisks on positions 1 and 2 will cause a reversal.
    • maxPosQuestion – maximum position (1-based) of the question mark wildcard (‘?’) that triggers the reversal of query term. Defaults to 1. Set this to 0, and maxPosAsteriskto 1 to reverse only pure suffix queries (i.e. ones with a single leading asterisk).
    • maxFractionAsterisk – additional parameter that triggers the reversal if asterisk (‘*’) position is less than this fraction of the query token length. Defaults to 0.0f (disabled).
    • minTrailing – minimum number of trailing characters in query token after the last wildcard character. For good performance this should be set to a value larger than 1. Defaults to 2.

    Note 1: This filter always reverses input tokens during indexing. Note 2: Query tokens without wildcard characters will never be reversed.

     

    Predefined text field types (in v1.4.x schema)

    The default deployment contains a set of predefined text field types. The following table gives their tokenization details and examples.

    text Indexing behaviour:

    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
    -->
    
    <!-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a ‘gap’ for more accurate phrase queries. -->
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
    

    - Tokenizes at whitespaces

    - Stop words are removed

    - Words delimiters are used to generate word tokens.

    generateWordParts=1 => wi-fi will generate wi and fi

    generateNumberParts = 1 => 3.5 will generate 3 and 5

    catenateWords=1 => wi-fi will generate wi, fi and wifi

    catenateNumbers = 1 => 3.5 will generate 3,5 and 35

    catenateAll = 1 => wi-fi-35 will generate wi, fi, wifi, 35 and wifi35.

    catenateAll = 0 => wi-fi-35 will generate wi, fi, wifi and 35, but not wifi35.

    splitOnCaseChange=1 => camelCase will generate camel and case.

    - All text is changed to lower case.

    - The Snowball porter stemmer will convert running to “run”

    Querying behaviour:

    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
    

    In querying, only the synonym filter is additional. So something like TV which is in the synonym group “Television, Televisions, TV, TVs” results in this query token stream: televis televis tv tvs (“televis” is because “television” has been stemmed by Snowball Porter).

    textgen Very similar to “text” but without stemming.
    Indexing behaviour:

    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <!-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a ‘gap’ for more accurate phrase queries. -->
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    

    - Tokenizes at whitespaces

    - Stop words are removed

    - Words delimiters are used to generate word tokens.

    generateWordParts=1 => wi-fi will generate wi and fi

    generateNumberParts = 1 => 3.5 will generate 3 and 5

    catenateWords=1 => wi-fi will generate wi, fi and wifi

    catenateNumbers = 1 => 3.5 will generate 3,5 and 35

    catenateAll = 1 => wi-fi-35 will generate wi, fi, wifi, 35 and wifi35.

    catenateAll = 0 => wi-fi-35 will generate wi, fi, wifi and 35, but not wifi35.

    splitOnCaseChange=1 => camelCase will generate camel and case.

    - All text is changed to lower case.

    - Note that there is no stemmer, which is what makes this different from “text” type.

    Querying behaviour:

    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    

    In querying, only the synonym filter is additional. So something like TV which is in the synonym group “Television, Televisions, TV, TVs” results in this query token stream: television televisions tv tvs

    For file paths and filenames, “textgen” seems to give the most appropriate results.

    textTight Very similar again to “text”, but differs in:- WordDelimiterFilter has generateWordParts=”0″ and generateNumberParts=”0″.So “wi-fi” will give just “wifi” . ”HELLO_WORLD” will give just “helloworld” ”d:filepathfilename.ext” will give just “dfilepathfilenameext”
    text_ws Just simple whitespace tokenization.
    text_rev Similar to “textgen”, this is for general unstemmed text field that indexes tokens normally and also reversed (via ReversedWildcardFilterFactory), to enable more efficient leading wildcard queries.

    Testing Solr schema, analyzers and tokenization

    November 19th, 2010 Comments off

    Introduction

    Using tests to tune accuracy of search results is very critical. Accuracy of search results depends to a great extent on the analyzers, tokenizers and filters used in the Solr schema.

    Testing and refining their behaviour on a standalone Solr server is unproductive and time consuming, involving cycles of deleting documents, stopping server, changing schema, restarting server, and reindexing documents.

    It would be desirable if these analyzer tweaks can be tested quickly on small fragments of text to ascertain how they’ll be tokenized and searched, before modifying the solr schema.

    The following snippets help you in unit testing and functional testing tokenization behaviour.

    The first testing snippet below can be used to test behaviour of combinations of tokenizers, token filters and char filters, and examining their resulting token streams. Such tests would be useful in a unit test suite.

    The second snippet can be used for integration tests where a Solr schema, as it would exist in a production server, is loaded and tested.

     

    Unit testing Solr tokenizers, token filters and char filters

    This java snippet uses Solr core, SolrJ and Lucene classes to run a piece of text through a tokenizer-filter chain and show its output. This code can be easily adapted to become a junit test case with automated results matching.

    For Solr 1.4.x:

    public static void main(String[] args) {
    	try {
    		StringReader inputText = new StringReader(args[0]);
     
    		TokenizerFactory tkf = new WhitespaceTokenizerFactory();
    		Tokenizer tkz = tkf.create(inputText);
     
    		LowerCaseFilterFactory lcf = new LowerCaseFilterFactory();
    		TokenStream lcts = lcf.create(tkz);
     
    		TokenFilterFactory fcf = new SnowballPorterFilterFactory();
    		Map params = new HashMap();
    		params.put("language", "English");
    		fcf.init(params);
    		TokenStream ts = fcf.create(lcts);
     
    		TermAttribute termAttrib = (TermAttribute) ts.getAttribute(TermAttribute.class);
     
    		while (ts.incrementToken()) {
    			String term = termAttrib.term();
    			System.out.println(term);
    		}
    	} catch (Exception e) {
    		e.printStackTrace();
    	}
     
    	System.exit(0);
    }

     

    For Solr 3.3.x:

    The code for Solr 3.3.x is slightly different, because some portions of the API have been changed or deprecated:

    	public static void main(String[] args) {
    		try {
    			StringReader inputText = new StringReader("RUNNING runnable");
     
    			Map<String, String> tkargs = new HashMap<String, String>();
    			tkargs.put("luceneMatchVersion", "LUCENE_33");
     
    			TokenizerFactory tkf = new WhitespaceTokenizerFactory();
    			tkf.init(tkargs);
    			Tokenizer tkz = tkf.create(inputText);
     
    			LowerCaseFilterFactory lcf = new LowerCaseFilterFactory();
    			lcf.init(tkargs);
    			TokenStream lcts = lcf.create(tkz);
     
    			TokenFilterFactory fcf = new SnowballPorterFilterFactory();
    			Map<String, String> params = new HashMap<String, String>();
    			params.put("language", "English");
    			fcf.init(params);
    			TokenStream ts = fcf.create(lcts);
     
    			CharTermAttribute termAttrib = (CharTermAttribute) ts.getAttribute(CharTermAttribute.class);
     
    			while (ts.incrementToken()) {
    				String term = termAttrib.toString();
    				System.out.println(term);
    			}
    		} catch (Exception e) {
    			e.printStackTrace();
    		}
     
    		System.exit(0);
    	}

     

      Functional testing of Solr schema.xml

      For functional tests, it would be more useful if the actual Solr search model itself is tested, instead of testing individual tokenizer chains.

    The snippet below shows how the schema.xml can be loaded and then an analysis done on piece of text input and a dummy field, to examine the resulting query and index tokens:

    For Solr 1.4.x:

    public class SchemaTester {
    	public static void main(String[] args) {
    		try {
    			InputStream solrCfgIs = new FileInputStream(
    					"solr/conf/solrconfig.xml");
    			SolrConfig solrConfig = new SolrConfig(null, solrCfgIs);
     
    			InputStream solrSchemaIs = new FileInputStream(
    					"solr/conf/schema.xml");
    			IndexSchema solrSchema = new IndexSchema(solrConfig, null,
    					solrSchemaIs);
     
    			// Dumps all analyzer definitions in schema...
    			Map fieldTypes = solrSchema.getFieldTypes();
    			for (Iterator<Entry<String, FieldType>> iter = fieldTypes.entrySet().iterator();
    				iter.hasNext();) {
     
    				Entry entry = iter.next();
    				FieldType fldType = entry.getValue();
    				Analyzer analyzer = fldType.getAnalyzer();
    				System.out.println(entry.getKey() + ":" + analyzer.toString());
     
    			}
     
    			//String inputText = "HELLO_WORLD d:\filepath\filename.ext wi-fi wi-fi-3500 running TV camelCase test-hyphenated file.txt";
    			String inputText = args[0];
     
    			// Name of the field type in your schema.xml. ex: "textgen"
    			FieldType fieldTypeText = fieldTypes.get("textgen");
     
    			System.out.println("Indexing analysis:");
    			Analyzer analyzer = fieldTypeText.getAnalyzer();
    			TokenStream tokenStream = analyzer.tokenStream("dummyfield",
    					new StringReader(inputText));
    			TermAttribute termAttr = (TermAttribute) tokenStream.getAttribute(TermAttribute.class);
    			while (tokenStream.incrementToken()) {
    				System.out.println(termAttr.term());
    			}
     
    			System.out.println("nnQuerying analysis:");
    			Analyzer qryAnalyzer = fieldTypeText.getQueryAnalyzer();
    			TokenStream qrytokenStream = qryAnalyzer.tokenStream("dummyfield",
    					new StringReader(inputText));
    			TermAttribute termAttr2 = (TermAttribute) qrytokenStream.getAttribute(TermAttribute.class);
    			while (qrytokenStream.incrementToken()) {
    				System.out.println(termAttr2.term());
    			}
     
    		} catch (Exception e) {
    			e.printStackTrace();
    		}
    	}
    }

     

    For Solr 3.3.x:

    	public static void main(String[] args) {
    		try {
    			InputSource solrCfgIs = new InputSource(
    					new FileReader("solr/conf/solrconfig.xml"));
    			SolrConfig solrConfig = new SolrConfig(null, solrCfgIs);
     
    			InputSource solrSchemaIs = new InputSource(
    					new FileReader("solr/conf/schema.xml"));
    			IndexSchema solrSchema = new IndexSchema(solrConfig, null,
    					solrSchemaIs);
     
    			Map<String, FieldType> fieldTypes = solrSchema.getFieldTypes();
    			for (Iterator<Entry<String, FieldType>> iter = fieldTypes.entrySet().iterator();
    				iter.hasNext();) {
     
    				Entry<String, FieldType> entry = iter.next();
    				FieldType fldType = entry.getValue();
    				Analyzer analyzer = fldType.getAnalyzer();
    				System.out.println(entry.getKey() + ":" + analyzer.toString());
     
    			}
     
    			String inputText = "Proof of the pudding lies in its eating";
    			FieldType fieldTypeText = fieldTypes.get("text_en");
     
    			System.out.println("Indexing analysis:");
    			Analyzer analyzer = fieldTypeText.getAnalyzer();
    			TokenStream tokenStream = analyzer.tokenStream("dummyfield", 
    					new StringReader(inputText));
    			CharTermAttribute termAttr = (CharTermAttribute) tokenStream.getAttribute(CharTermAttribute.class);
    			while (tokenStream.incrementToken()) {
    				System.out.println(termAttr.toString());
    			}
     
    			System.out.println("nnQuerying analysis:");
    			Analyzer qryAnalyzer = fieldTypeText.getQueryAnalyzer();
    			TokenStream qrytokenStream = qryAnalyzer.tokenStream("dummyfield", 
    					new StringReader(inputText));
    			CharTermAttribute termAttr2 = (CharTermAttribute) qrytokenStream.getAttribute(CharTermAttribute.class);
    			while (qrytokenStream.incrementToken()) {
    				System.out.println(termAttr2.toString());
    			}
     
     
    		} catch (Exception e) {
    			e.printStackTrace();
    		}
    	}

    Dependencies

    These snippets require the following jars from Solr package:

    • apache-solr-core-*.jar
    • apache-solr-solrj-*.jar
    • lucene-analyzers-*.jar
    • lucene-core-*.jar
    • lucene-snowball-*.jar
    • lucene-spatial-*.jar (only for v3.3)
    • commons-io-*.jar (only for v3.3)
    • slf4j-api-*.jar
    • slf4j-jdk14-*.jar
    Categories: Search Tags: ,