Home > Search > Solr search data modelling

Solr search data modelling

November 28th, 2010

Overview

Searchable entities of an application need to be modelled as Solr documents and fields for them to be searchable by Solr.

The schema.xml in /solr/conf is where the application search model should be defined.

The <types> element defines the set of field types available in the model.

The <fields> element defines the set of fields of each document in the model. Each field has a type which is defined in the <types> element.

 

<types> section

This section describes types for all fields in the model. Contains <fieldType> elements. Each <fieldType> has these attributes:

  • name is the name of the field type definition and is referred from the <fields> section
  • class is the subclass of org.apache.solr.schema.FieldType that models this field type definition.  Class names starting with "solr" refer to java classes in the org.apache.solr.analysis package.
  • sortMissingLast and sortMissingFirst
    The optional sortMissingLast and sortMissingFirst attributes are currently supported on types that are sorted internally as strings. This includes "string","boolean","sint","slong","sfloat","sdouble","pdate"
    - If sortMissingLast="true", then a sort on this field will cause documents without the field to come after documents with the field, regardless of the requested sort order (asc or desc).
    - If sortMissingFirst="true", then a sort on this field will cause documents without the field to come before documents with the field, regardless of the requested sort order.
    - If sortMissingLast="false" and sortMissingFirst="false" (the default), then default lucene sorting will be used which places docs without the field first in an ascending sort and last in a descending sort.

  • omitNorms is set to true to omit the norms associated with this field (this disables length normalization and index-time boosting for the field, and saves some memory). Only full-text fields or fields that need an index-time boost need norms.

Each field type definition has an associated Analyzer to tokenize and filter characters or tokens.

The Trie field types are suitable for numeric fields that involve numeric range queries. The trie concept makes searching such fields faster.

 

Basic field types

string Fields of this type are not analyzed (ie, not tokenized or filtered), but are indexed and stored verbatim.
binary For binary data. Should be sent/retrieved as Base64 encoded strings.
int/tint/pint
long/tlong/plong
float/tfloat/pfloat double/tdouble/
pdouble
The regular types (int,float,etc) and their t- versions differ in their precisionStep values.The precisionStep value is used to generate indexes at different precision levels, to support numeric range queries. Both sets are modelled by TrieField types, but the t- versions have precisionStep of 8 while the regular types have 0.So numeric range queries will be faster with the t-versions, but indexes will be larger (and probably slower). The p- versions are when numeric range queries are not needed at all. They are modelled by non-Trie types.
date/tdate/pdate Similar to the above differences in numeric fields.Use tdate for date ranges and date faceting.Dates have to be in a special UTC timezone format, like this example: 2011-02-06T05:34:00.299Z Use org.apache.solr.common.util.DateUtil.getThreadLocalDateFormat().format(new Date()) to get a date in this format.
sint/slong/
sfloat/sdouble
Sortable fields

Text field types Being a full text search solution, the text field types and their configuration becomes the most critical part of the modelling. Modelling of text fields is explained in detail in the article Solr text field types, analyzers, tokenizers & filters explained.

 

<fields> section

Fields of documents are described in this section using <field> elements.

Each <field> element can have these attributes:

name (mandatory) the name for the field. Very critical information, used in search queries, facet fields.
type (mandatory) the name of a previously defined type from the       <types> section
indexed true if this field should be indexed (should be searchable or sortable)
stored true if this field value should be retrievable verbatim in search results.
compressed [false] if this field should be stored using gzip compression (this will only apply if the field type is compressable; among the standard field types, only TextField and StrField are).This is very useful for large data fields, but will probably slow down search results – so it should not be used for fields that involve frequent querying
multiValued true if this field may contain multiple values per document
omitNorms (expert) set to true to omit the norms associated with this field (this disables length normalization and index-time boosting for the field, and saves some memory).  Only full-text fields or fields that need an index-time boost need norms.
termVectors [false] set to true to store the term vector for a given field. When using MoreLikeThis, fields used for similarity should be stored for best performance.
termPositions Store position information with the term vector.  This will increase storage costs.
termOffsets Store offset information with the term vector. This will increase storage costs.
default a value that should be used if no value is specified when adding a document.

The example deployment itself defines many commonly used fields and types; study them and check if something needed is already available before modelling your own.

<dynamicField> elements can be used to model field names which are not explicitly defined by name, but which match some defined pattern.

<copyField> definitions specify to copy one field to another at the time a document is added to the index.  It’s used either to index the same field differently, or to add multiple fields to the same field for easier/faster searching. For example, all text fields in the document can be copied to a single catch-all field, for faster querying.

<uniqueKey> element specifies the field to be used to determine and enforce document uniqueness.

<defaultSearchField> element specifies the field to be queried when it’s not explicitly specified in the query string using a “field:value” syntax. The catch-all copyfield is usually specified as the default search field.

<solrQueryParse> specifies query parser configuration. defaultOperator=”AND|OR” specifies whether queries are combined using AND operator or OR operator.



Comments are closed.