Home > Search > Testing Solr schema, analyzers and tokenization

Testing Solr schema, analyzers and tokenization

November 19th, 2010

Introduction

Using tests to tune accuracy of search results is very critical. Accuracy of search results depends to a great extent on the analyzers, tokenizers and filters used in the Solr schema.

Testing and refining their behaviour on a standalone Solr server is unproductive and time consuming, involving cycles of deleting documents, stopping server, changing schema, restarting server, and reindexing documents.

It would be desirable if these analyzer tweaks can be tested quickly on small fragments of text to ascertain how they’ll be tokenized and searched, before modifying the solr schema.

The following snippets help you in unit testing and functional testing tokenization behaviour.

The first testing snippet below can be used to test behaviour of combinations of tokenizers, token filters and char filters, and examining their resulting token streams. Such tests would be useful in a unit test suite.

The second snippet can be used for integration tests where a Solr schema, as it would exist in a production server, is loaded and tested.

 

Unit testing Solr tokenizers, token filters and char filters

This java snippet uses Solr core, SolrJ and Lucene classes to run a piece of text through a tokenizer-filter chain and show its output. This code can be easily adapted to become a junit test case with automated results matching.

For Solr 1.4.x:

public static void main(String[] args) {
	try {
		StringReader inputText = new StringReader(args[0]);
 
		TokenizerFactory tkf = new WhitespaceTokenizerFactory();
		Tokenizer tkz = tkf.create(inputText);
 
		LowerCaseFilterFactory lcf = new LowerCaseFilterFactory();
		TokenStream lcts = lcf.create(tkz);
 
		TokenFilterFactory fcf = new SnowballPorterFilterFactory();
		Map params = new HashMap();
		params.put("language", "English");
		fcf.init(params);
		TokenStream ts = fcf.create(lcts);
 
		TermAttribute termAttrib = (TermAttribute) ts.getAttribute(TermAttribute.class);
 
		while (ts.incrementToken()) {
			String term = termAttrib.term();
			System.out.println(term);
		}
	} catch (Exception e) {
		e.printStackTrace();
	}
 
	System.exit(0);
}

 

For Solr 3.3.x:

The code for Solr 3.3.x is slightly different, because some portions of the API have been changed or deprecated:

	public static void main(String[] args) {
		try {
			StringReader inputText = new StringReader("RUNNING runnable");
 
			Map<String, String> tkargs = new HashMap<String, String>();
			tkargs.put("luceneMatchVersion", "LUCENE_33");
 
			TokenizerFactory tkf = new WhitespaceTokenizerFactory();
			tkf.init(tkargs);
			Tokenizer tkz = tkf.create(inputText);
 
			LowerCaseFilterFactory lcf = new LowerCaseFilterFactory();
			lcf.init(tkargs);
			TokenStream lcts = lcf.create(tkz);
 
			TokenFilterFactory fcf = new SnowballPorterFilterFactory();
			Map<String, String> params = new HashMap<String, String>();
			params.put("language", "English");
			fcf.init(params);
			TokenStream ts = fcf.create(lcts);
 
			CharTermAttribute termAttrib = (CharTermAttribute) ts.getAttribute(CharTermAttribute.class);
 
			while (ts.incrementToken()) {
				String term = termAttrib.toString();
				System.out.println(term);
			}
		} catch (Exception e) {
			e.printStackTrace();
		}
 
		System.exit(0);
	}

 

    Functional testing of Solr schema.xml

    For functional tests, it would be more useful if the actual Solr search model itself is tested, instead of testing individual tokenizer chains.

The snippet below shows how the schema.xml can be loaded and then an analysis done on piece of text input and a dummy field, to examine the resulting query and index tokens:

For Solr 1.4.x:

public class SchemaTester {
	public static void main(String[] args) {
		try {
			InputStream solrCfgIs = new FileInputStream(
					"solr/conf/solrconfig.xml");
			SolrConfig solrConfig = new SolrConfig(null, solrCfgIs);
 
			InputStream solrSchemaIs = new FileInputStream(
					"solr/conf/schema.xml");
			IndexSchema solrSchema = new IndexSchema(solrConfig, null,
					solrSchemaIs);
 
			// Dumps all analyzer definitions in schema...
			Map fieldTypes = solrSchema.getFieldTypes();
			for (Iterator<Entry<String, FieldType>> iter = fieldTypes.entrySet().iterator();
				iter.hasNext();) {
 
				Entry entry = iter.next();
				FieldType fldType = entry.getValue();
				Analyzer analyzer = fldType.getAnalyzer();
				System.out.println(entry.getKey() + ":" + analyzer.toString());
 
			}
 
			//String inputText = "HELLO_WORLD d:\filepath\filename.ext wi-fi wi-fi-3500 running TV camelCase test-hyphenated file.txt";
			String inputText = args[0];
 
			// Name of the field type in your schema.xml. ex: "textgen"
			FieldType fieldTypeText = fieldTypes.get("textgen");
 
			System.out.println("Indexing analysis:");
			Analyzer analyzer = fieldTypeText.getAnalyzer();
			TokenStream tokenStream = analyzer.tokenStream("dummyfield",
					new StringReader(inputText));
			TermAttribute termAttr = (TermAttribute) tokenStream.getAttribute(TermAttribute.class);
			while (tokenStream.incrementToken()) {
				System.out.println(termAttr.term());
			}
 
			System.out.println("nnQuerying analysis:");
			Analyzer qryAnalyzer = fieldTypeText.getQueryAnalyzer();
			TokenStream qrytokenStream = qryAnalyzer.tokenStream("dummyfield",
					new StringReader(inputText));
			TermAttribute termAttr2 = (TermAttribute) qrytokenStream.getAttribute(TermAttribute.class);
			while (qrytokenStream.incrementToken()) {
				System.out.println(termAttr2.term());
			}
 
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
}

 

For Solr 3.3.x:

	public static void main(String[] args) {
		try {
			InputSource solrCfgIs = new InputSource(
					new FileReader("solr/conf/solrconfig.xml"));
			SolrConfig solrConfig = new SolrConfig(null, solrCfgIs);
 
			InputSource solrSchemaIs = new InputSource(
					new FileReader("solr/conf/schema.xml"));
			IndexSchema solrSchema = new IndexSchema(solrConfig, null,
					solrSchemaIs);
 
			Map<String, FieldType> fieldTypes = solrSchema.getFieldTypes();
			for (Iterator<Entry<String, FieldType>> iter = fieldTypes.entrySet().iterator();
				iter.hasNext();) {
 
				Entry<String, FieldType> entry = iter.next();
				FieldType fldType = entry.getValue();
				Analyzer analyzer = fldType.getAnalyzer();
				System.out.println(entry.getKey() + ":" + analyzer.toString());
 
			}
 
			String inputText = "Proof of the pudding lies in its eating";
			FieldType fieldTypeText = fieldTypes.get("text_en");
 
			System.out.println("Indexing analysis:");
			Analyzer analyzer = fieldTypeText.getAnalyzer();
			TokenStream tokenStream = analyzer.tokenStream("dummyfield", 
					new StringReader(inputText));
			CharTermAttribute termAttr = (CharTermAttribute) tokenStream.getAttribute(CharTermAttribute.class);
			while (tokenStream.incrementToken()) {
				System.out.println(termAttr.toString());
			}
 
			System.out.println("nnQuerying analysis:");
			Analyzer qryAnalyzer = fieldTypeText.getQueryAnalyzer();
			TokenStream qrytokenStream = qryAnalyzer.tokenStream("dummyfield", 
					new StringReader(inputText));
			CharTermAttribute termAttr2 = (CharTermAttribute) qrytokenStream.getAttribute(CharTermAttribute.class);
			while (qrytokenStream.incrementToken()) {
				System.out.println(termAttr2.toString());
			}
 
 
		} catch (Exception e) {
			e.printStackTrace();
		}
	}

Dependencies

These snippets require the following jars from Solr package:

  • apache-solr-core-*.jar
  • apache-solr-solrj-*.jar
  • lucene-analyzers-*.jar
  • lucene-core-*.jar
  • lucene-snowball-*.jar
  • lucene-spatial-*.jar (only for v3.3)
  • commons-io-*.jar (only for v3.3)
  • slf4j-api-*.jar
  • slf4j-jdk14-*.jar


Categories: Search Tags: ,
Comments are closed.