Home > Search > Content Extraction in Solr

Content Extraction in Solr

November 28th, 2010

Overview

The example solrconfig.xml is already configured for content extraction from any document format – like MS Word DOC, PDF, – which can be handled by Apache Tika.

Content extraction requires libraries found in the /contrib/extraction directory. These include Solr Cell, Apache Tika and Apache POI libraries.

The ExtractingRequestHandler configuration in solrconfig.xml specifies the endpoint at which documents can be submitted for extraction. It’s usually http://localhost:8983/solr/update/extract.

 

Howto

  • To index a document, send the request as

curl “http://localhost:8983/solr/update/extract?literal.id=book1&commit=true” -F myfile=@book.pdf

The request goes as a multi-part form encoding.

  • By default, document contents are added into the document field “text”. The field can be changed in /solr/conf/solrconfig.xml in the Extracting handler’s <requesHandler> element; it has a child element “fmap.content” that specifies which field content should be indexed under.
  • <str name=”fmap.content”>text</str>

Since “text” is NOT a stored field, features like result highlighting won’t be available.

If results highlighting is required, modify /solr/conf/schema.xml to include a new *stored* field called “doc_content” which receives document contents from extracting handler. “doc_content” itself can be included in the “text” catch-all field so that all queries can be matched against document contents.

 

Restrictions of default content extraction

  • Since extracting handler can specify only a single content  field, contents of multiple files will all go into the same content field. This is a problem if the the content file containing the search string has to be indicated to user.
  • There is no out-of-the-box workaround for this in solr. It’s required to write a specialized extracting handler to map each file (“content stream” in solr terminology) in the multipart request to separate content fields.


Comments are closed.