Solr text field types, analyzers, tokenizers & filters explained
November 2010. Apache Solr 1.4.x
Overview
Solr’s world view consists of documents, where each document consists of searchable fields. The rules for searching each field are defined using field type definitions. A field type definition describes the analyzers, tokenizers and filters which control searching behaviour for all fields of that type.
When a document is added/updated, its fields are analyzed and tokenized, and those tokens are stored in solr’s index. When a query is sent, the query is again analyzed, tokenized and then matched against tokens in the index. This critical function of tokenization is handled by Tokenizer components.
In addition to tokenizers, there are TokenFilter components, whose job is to modify the token stream.
There are also CharFilter components, whose job is to modify individual characters. For example, HTML text can be filtered to modify HTML entities like & to regular &.
Defining text field types in schema.xml
Here’s a typical text field type definition:
<fieldType name="textgen" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
What this type definition specifies is:
- When indexing a field of this type, use an analyzer composed of
- a WhitespaceTokenizerFactory object
- a StopFilterFactory
- a WordDelimiterFilterFactory
- a LowerCaseFilterFactory
- When querying a field of this type, use an analyzer composed of
- a WhitespaceTokenizerFactory object
- a SynonymFilterFactory
- a StopFilterFactory
- a WordDelimiterFilterFactory
- a LowerCaseFilterFactory
If there is only one analyzer element, then the same analyzer is used for both indexing and querying.
- It’s important to use the same or similar analyzers that process text in a compatible manner at index and query time. For example, if an indexing analyzer lowercases words, then the query analyzer should do the same to enable finding the indexed words.
Under the hood
Solr builds a TokenizerChain instance for each of these analyzers. A TokenizerChain is composed of 1 TokenizerFactory instance, 0-n TokenFilterFactory instances, and 0-n CharFilterFactory instances. These factory instances are responsible for creating their respective objects from the Lucene framework. For example, a TokenizerFactory creates a Lucene Tokenizer; its concrete implementation WhitespaceTokenizerFactory creates a Lucene WhitespaceTokenizer.
The class design diagram shows how a TokenizerChain works:
- Raw input is provided by a Reader instance
- CharReader (is-a CharStream) wraps the raw Reader
- Each CharFilterFactory creates a character filter that modifies input CharStream and outputs a CharStream. So CharFilterFactories can be chained.
- TokenizerFactory creates a Tokenizer from the CharStream.
- Tokenizer is-a TokenStream, and can be passed to TokenFilterFactories.
- Each TokenFilterFactory modifies the token stream and outputs another TokenStream. So these can be chained.
Commonly used CharFilterFactories
| solr.MappingCharFilterFactory | Maps a set of characters to another set of characters.The mapping file is specified by mappingattribute, and should be present under /solr/conf. Example: <charFilter class=”solr.MappingCharFilterFactory” mapping=”mapping-ISOLatin1Accent.txt”/> The mapping file should have this format: # Ä => A “u00C4″ => “A” # Å => A “u00C5″ => “A” |
| solr.HTMLStripCharFilterFactory | Strips HTML/XML from input stream.The input need not be an HTML document as only constructs that look like HTML will be removed. Removes HTML/XML tags while keeping the content Attributes within tags are also removed, and attribute quoting is optional. Removes XML processing instructions: <?foo bar?> Removes XML comments Removes XML elements starting with <! and ending with > Removes contents of <script> and <style> elements. Handles XML comments inside these elements (normal comment processing won’t always work) Replaces numeric character entities references like A or The terminating ‘;’ is optional if the entity reference is followed by whitespace. Replaces all named character entity references. terminating ‘;’ is mandatory to avoid false matches on something like “Alpha&Omega Corp” Examples: <charFilter class=”solr.HTMLStripCharFilterFactory”/> The text my <a href=”www.foo.bar”>link</a> becomes my link |
Commonly used TokenizerFactories
| solr.WhitespaceTokenizerFactory | A tokenizer that divides text at whitespaces, as defined by java.lang.Character.isWhiteSpace().Adjacent sequences of non-whitespace characters form tokens. Example: HELLOtttWORLD.txt is tokenized into 2 tokens HELLO WORLD.txt |
| solr.KeywordTokenizerFactory | Treats the entire field as one token, regardless of its content.This is a lot like the “string” field type, in that no tokenization happens at all.Use it if a text field requires no tokenization, but does require char filters, or token filtering like LowerCaseFilter and TrimFilter. Example: http://example.com/I-am+example?Text=-Hello is retained as http://example.com/I-am+example?Text=-Hello |
| solr.StandardTokenizerFactory | A good general purpose tokenizer.
Example: This sentencet can’t be “tokenized_Correctly” by www.google.com or IBM or NATO 10.1.9.5 test@email.org product-number 123-456949 file.txt is tokenized as This sentence can’t be tokenized Correctly by www.google.com or IBM or NATO 10.1.9.5 test@email.org product number 123-456949 file.txt |
| solr.PatternTokenizerFactory | Uses regex pattern matching to construct distinct tokens for the input stream. It takes two arguments: “pattern” and “group”.
group=-1 (the default) is equivalent to “split”. In this case, the tokens will be equivalent to the output from (without empty tokens): pattern = '([^']+)' group = 0 input = aaa 'bbb' 'ccc' the output will be two tokens: ‘bbb’ and ‘ccc’ (including the ‘ marks). With the same input but using group=1, the output would be: bbb and ccc (no ‘ marks) |
| solr.NGramTokenizerFactory | Not clear when and where to use, but the idea is that input is split into 1-sized, then 2-sized, then 3-sized, etc tokens.Perhaps useful for partial matching…It takes “minGram” and “maxGram” arguments, but again, not clear how to set them. Example: email becomes e m a i l em ma ai il |
Commonly used TokenFilterFactories
| solr.WordDelimiterFilterFactory | Splits words into subwords and performs optional transformations on subword groups.One use for WordDelimiterFilter is to help match words with different delimiters. One way of doing so is to specify generateWordParts="1" catenateWords="1" in the analyzer used for indexing, andgenerateWordParts="1" in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that leaves them in place (such as WhitespaceTokenizer). By default, words are split into subwords with the following rules:
Splitting is affected by the following parameters:
There are also a number of parameters that affect what tokens are present in the final output and if subwords are combined:
Example of generateWordParts=”1″ and catenateWords=”1″:
|
| solr.SynonymFilterFactory | Matches strings of tokens and replaces them with other strings of tokens.
Synonym file format: i-pod, i pod => ipod, sea biscuit, sea biscit => seabiscuit |
| solr.StopFilterFactory | Discards common words.<filter class=”solr.StopFilterFactory” words=”stopwords.txt” ignoreCase=”true”/> Stop words file should be in /solr/conf. Format:#Standard english stop words a an |
| solr.SnowballPorterFilterFactory | Uses the Tartarus snowball stemmer framework for different languages. Set the “language” attribute.Not clear how this is different from PorterStemFilterFactory.Example: running gives run |
| solr.HyphenatedWordsFilterFactory | Combines words split by hyphens. Use only at indexing time. |
| solr.KeepWordFilterFactory | Retains only words specified in the “words” file. |
| solr.LengthFilterFactory | Retains only tokens whose length falls between “min” and “max” |
| solr.LowerCaseFilterFactory | Changes all text to lower case. |
| solr.PorterStemFilterFactory | Transforms token stream according to the Porter stemming algorithm. The input token stream should already be lowercase (pass through a LowerCaseFilter).Example:running is tokenized to run |
| solr.ReversedWildcardFilterFactory solr.ReverseStringFilterFactory |
Useful if wildcard queries like “Apache*” should be supported.Factory for ReversedWildcardFilter-s. When this factory is added to an analysis chain, it will be used both for filtering the tokens during indexing, and to determine the query processing of this field during search. This class supports the following init arguments:
Note 1: This filter always reverses input tokens during indexing. Note 2: Query tokens without wildcard characters will never be reversed. |
Predefined text field types (in v1.4.x schema)
The default deployment contains a set of predefined text field types. The following table gives their tokenization details and examples.
| text | Indexing behaviour:
<tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- in this example, we will only use synonyms at query time <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> --> <!-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a ‘gap’ for more accurate phrase queries. --> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/> - Tokenizes at whitespaces - Stop words are removed - Words delimiters are used to generate word tokens. generateWordParts=1 => wi-fi will generate wi and fi generateNumberParts = 1 => 3.5 will generate 3 and 5 catenateWords=1 => wi-fi will generate wi, fi and wifi catenateNumbers = 1 => 3.5 will generate 3,5 and 35 catenateAll = 1 => wi-fi-35 will generate wi, fi, wifi, 35 and wifi35. catenateAll = 0 => wi-fi-35 will generate wi, fi, wifi and 35, but not wifi35. splitOnCaseChange=1 => camelCase will generate camel and case. - All text is changed to lower case. - The Snowball porter stemmer will convert running to “run” Querying behaviour: <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/> In querying, only the synonym filter is additional. So something like TV which is in the synonym group “Television, Televisions, TV, TVs” results in this query token stream: televis televis tv tvs (“televis” is because “television” has been stemmed by Snowball Porter). |
| textgen | Very similar to “text” but without stemming. Indexing behaviour: <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a ‘gap’ for more accurate phrase queries. --> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> - Tokenizes at whitespaces - Stop words are removed - Words delimiters are used to generate word tokens. generateWordParts=1 => wi-fi will generate wi and fi generateNumberParts = 1 => 3.5 will generate 3 and 5 catenateWords=1 => wi-fi will generate wi, fi and wifi catenateNumbers = 1 => 3.5 will generate 3,5 and 35 catenateAll = 1 => wi-fi-35 will generate wi, fi, wifi, 35 and wifi35. catenateAll = 0 => wi-fi-35 will generate wi, fi, wifi and 35, but not wifi35. splitOnCaseChange=1 => camelCase will generate camel and case. - All text is changed to lower case. - Note that there is no stemmer, which is what makes this different from “text” type. Querying behaviour: <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> In querying, only the synonym filter is additional. So something like TV which is in the synonym group “Television, Televisions, TV, TVs” results in this query token stream: television televisions tv tvs For file paths and filenames, “textgen” seems to give the most appropriate results. |
| textTight | Very similar again to “text”, but differs in:- WordDelimiterFilter has generateWordParts=”0″ and generateNumberParts=”0″.So “wi-fi” will give just “wifi” . ”HELLO_WORLD” will give just “helloworld” ”d:filepathfilename.ext” will give just “dfilepathfilenameext” |
| text_ws | Just simple whitespace tokenization. |
| text_rev | Similar to “textgen”, this is for general unstemmed text field that indexes tokens normally and also reversed (via ReversedWildcardFilterFactory), to enable more efficient leading wildcard queries. |





