December 28th, 2011


When I was deploying my website, I ran into a slow page load problem. One of the pages had about 9 non-interleaved screenshot PNG image files, each about (700 x 500 pixels) in size and between 40 KB and 350KB file size.

I wondered if deploying these images on Amazon CloudFront would improve response times. Amazon CloudFront is the Content Delivery Network (CDN) offering from Amazon and is one of the cloud services that constitute Amazon Web Services (AWS). 

CDNs are supposed to improve response times by replicating resources across multiple servers around the world, and serving a requested resource from the server closest to the requesting client. The implicit assumption is that the root cause of latency is geographical distance (greater the distance, more the number of routers involved in-between), and so serving files from a server that is physically closer should reduce latency.

Since my site was already hosted on Amazon’s EC2, it made sense to try their CloudFront CDN, rather than some other vendor’s CDN. Though this was not a performance critical page, it did provide the opportunity to experiment with CloudFront for a realistic scenario, and the knowledge gained may prove useful in future. So I started experimenting…



I decided to use Amazon S3 as the origin server for Cloudfront (the origin server is the server from which Cloudfront picks up the resources to replicate). I opted for "reduced redundancy storage" setting instead of "standard redundancy" for the S3 bucket, to minimize costs (and also because these images are already available to me from my development machine and web server..standard redundancy makes more sense for user content data or critical backups).


Evaluation criteria

Better response times would be great.

Even if there was no improvement in response times, a CDN would still reduce the load on my rather underpowered EC2 micro instance web server, and spare me some more connections for more dynamic content, like my SaaS products. So I was already somewhat biased towards using Cloudfront or some other file server before evaluating them.

But CloudFront, like other AWS services, is a metered service. So the evaluation also needed to keep costs in mind.


Performance measurements

For response time measurements, I decided to use different tools to get a complete picture:

  • The first set of measurements are taken using browsers. All 3 major browsers – Chrome, Firefox and IE – provide excellent profiling tools for developers.
  • However, browser measurements are not enough. The system should also be tested for scalability. What happens to response times when there are dozens of concurrent connections requesting the page? Can the page be rendered to all those users without much increase in response times? With a single web server on an underpowered machine, this is clearly not possible. But putting a CDN in the mix should shift atleast some of the load on my puny single unscalable web server to Amazon’s scalable mammoth delivery network. I used Apache Jmeter and Apache Bench (ab) tools to load the server.

Browser measurements


Chrome’s developer tools network tab, Firefox Firebug network tab, and IE’s developer tools Network tab provide profiling information.

Chrome and Firefox (via Firebug and Firebug NetExport plugin) can export profiling data to JSON format files called .har files.

IE exports to XML files which have a similar schema to the JSON .har files but expressed in XML.


Each browser was tested 5 times with a complete cache cleanup in between. The cache cleanup ensured that all images were downloaded in each test. However, cache cleanup does not clear the browsers’ DNS caches, which means DNS lookup timings are usually manifested only in the first test.


A python script was used to parse these files, calculate averages and produce the below HTML table of averages.



Legend to the table:

1st column => the image file name

"OwnServer" => tests in which images were downloaded from my Apache web server running on EC2 and EBS

"Cloudfront" => tests in which images were downloaded from Cloudfront distribution with S3 as origin server

T => Total time for request and response (including thread blocked, wait, connect, send request, wait, receive response)

R => Total time for just receiving all the data

W => Time spent waiting before response started

All figures are in milliseconds



Analysis of browser results

The metrics to pay attention here are R (the average receive times) and W (the wait times).

I didn’t pay much attention to T (the average total times) because I felt they are misleading. The problem is that browsers download embedded resources like <img>s using a small number of connections. When there are more resources than there are connections, the extra resources are blocked until some connections are freed. These blocked times manifest in the T values, but they are not deterministic and are also not similar across browsers since connection implementations differ. Hence, total times should be ignored in my opinion.

What can we observe from the R(eceive) and W(ait) times?

  • Chrome: For 5 out of 9 images, R(eceive) times from cloudfront are less than receive from own server. For other 4 images, receive times from cloudfront are slightly higher. So it’s almost a tie. However, W(ait) times are consistently less for Cloudfront. So, Cloudfront leads.
  • Firefox: For 6 out of 9 images, R(eceive) times from cloudfront are lesser. W(ait) times are also consistently less, except in one case, which seems to be an anomaly. Cloudfront leads again.
  • IE: For 6 out of 9 images, R(eceive) times from cloudfront are lesser. W(ait) times are also consistently less, except in 2 cases, which seem to be anomalies. Cloudfront leads again.

Browser Conclusions

Cloudfront does makes the site fasterbut not as consistently or drastically as expected, atleast in my tests (I’m in India and my nearest edge locations seem to be Singapore or Hong Kong).

One possible factor may be that the resources should get lots of hits for Cloudfront to cache and provide them more effectively. I’m not sure about this, but cloudfront documentation does seem to hint that more popular resources will benefit more.


Load measurements using apache bench (ab)


ab is incapable of downloading a web page and all its embedded resources. So I ran ab requests on just one of the image files – the biggest one at 350KB.

I set different values for -n and -c options. -k was enabled to simulate browser behaviour by keeping connections alive.




Own server


50 total requests, 1 user


Total time



Mean Time / request

5.85 s


Max time taken by 90% of requests



Max time taken by 50% of requests



Data transferred



50 total requests, 5 concurrent users


Total time

211.97 s


Mean Time / request

4.24 s


Max time taken by 90% of requests



Max time taken by 50% of requests



Data transferred



50 total requests, 10 concurrent users


Total time

217.34 s


Mean Time / request

4.35 s


Max time taken by 90% of requests



Max time taken by 50% of requests



Data transferred



50 total requests, 25 concurrent users


Total time



Mean Time / request



Max time taken by 90% of requests



Max time taken by 50% of requests



Data transferred



80 total requests, 40 concurrent users


Total time



Mean Time / request



Max time taken by 90% of requests



Max time taken by 50% of requests



Data transferred



100 total requests, 50 concurrent users


Total time



Mean Time / request



Max time taken by 90% of requests



Max time taken by 50% of requests



Data transferred




Analysis of ab results

Results are so all over the place, that I found it difficult to draw any conclusion!

The 50th percentile results in some tests clearly favour Cloudfront, but not consistently.


I also found it hard to understand some of the raw values (not shown here). For example, in the last test with 100 requests across 50 concurrent users, total time was 477.1s but the longest request was 454s! How that can be is beyond me. I’m guessing that a request sent fairly early never got a response. It’s possible that this was because load was too much for my puny 512 kbps bandwidth.

Another thing to notice is that data volume with cloudfront is atleast 25% higher at higher loads. I’m guessing that this is because of TCP retransmissions, though why it appears only when communicating with cloudfront is not clear.



I’m reluctant to draw any concrete conclusion from ab results except that 50% of requests seem to be faster most of the time when using Cloudfront.


Load measurements using Apache JMeter


JMeter was used to test the following loads:

  • 50 total requests with 1 user. Retrieve embedded resource using a pool of 9 threads (9 because the page had 9 images)
  • 50 total requests across 5 concurrent users. Retrieve embedded resources using pools of 5 threads (only 5 because JMeter creates multiple pools for each virtual user, which means 5 users x 5 threads = 25 threads would be created. I was afraid that higher pool sizes might make bandwidth contention a factor in the timings)






50 total requests,
1 user,
9 downloading threads






90% of requests




50 total requests,
5 concurrent users,
5 downloading threads per user





Actually overall 77s, but only 48s if 7 anomalous measurements were removed.
Ownserver actually never finished 50 requests. Probably, socket timeouts.

90% of requests



Actually 232.6s
But 34 out of 43 (80%) were within 60s.


Analysis of results

When simulating a single user, using Cloudfront didn’t show any major improvement in speed.

But when simulating 5 concurrent users with 5 resource downloading threads per user, I saw interesting results. 7 results timed out with extremely high times like 270 seconds. These I put down as anomalies, possibly because I was overloading my bandwidth.

Without those anomalies included, the average time per request was just 48 seconds when using cloudfront, compared to 75 seconds when not. Also, 80% of the remaining timings completed within 60 seconds when using cloudfront, compared to 110.5 seconds when not.



So load testing with JMeter shows that Cloudfront is better at higher loads.


Measurements using

Methodology provides automated testing for websites, from client locations around the world.

5 tests were conducted from each location and each method of serving images.



Its results come out as follows:

  Served from own server Cloudfront
New York 8.772 s 8.911 s
London 8.791 s 8.703 s



Doesn’t look like Cloudfront has improved page speeds.


Cost analysis

If the choice is between storing content on an EC2 EBS drive and serving it from EC2 web server, vs. storing it in S3 and serving it via Cloudfront, the following cost components are relevant (as of Dec 2011):

Assume ‘B’ GBs is the size of content (for simplicity, I’ll assume just 1 file of ‘B’ GBs) being stored.

Assume 1 user requests this file each and every second every day, which comes to 86400 requests/day or 2,592,000 requests/month.

via EBS and EC2 via S3 and Cloudfront
EBS storage per GB = $0.10B S3 reduced redundancy storage = $0.093B
(ignoring S3 IO request costs by assuming this file will be stored just once, and then always served via Cloudfront)
EBS cost per 1 million IO requests = $0.10 x 2.592 = 0.2592B Cloudfront data transfer = $0.19B
But as we have seen with ab tests, at higher loads, more data is transferred due to TCP retransmissions.
Assuming 20% extra data is transferred, this will come to $0.228B
Data transfer through elastic IP = $0.01B Cloudfront cost per 10000 HTTP requests = $0.009 x 2592000/10000 = $2.3328
Total: $0.3692B
If that file is 1GB in size, this comes to $0.37
Total: $2.3328 + 0.321B
If that file is 1GB in size, this comes to $2.65

So cost wise too, Cloudfront comes out costlier than serving off EBS or S3. It’s really only its HTTP request costs that tilt the choice away from Cloudfront.



Final conclusion

In my case, my website is not a high traffic site. I also didn’t observe any drastic improvement in page speeds, except possibly at high loads (shown by the JMeter results). And cost wise, it’s indeed cheaper to stick with EBS and EC2.

So, should I use Cloudfront or not? I think it’s not needed for my site at the moment.

Categories: AWS Tags: , , , ,
December 27th, 2011

JMeter is commonly used to stress test webpages by simulating multiple users concurrently visiting a webpage URL. However, for this simulation to be accurate, JMeter needs to be configured correctly so that it behaves like a browser.

In this article, I explain what settings to configure, to make JMeter simulate browser requests fairly accurately.


Before configuring JMeter correctly, let’s understand how browsers work:

  • When user enters a webpage URL in browser, it connects to server, starts downloading the page, and starts parsing.
  • As it’s parsing, it’ll encounter embedded URLs like javascript, CSS and image files.
  • A browser then creates more threads, each of which opens a new connection and fetches one of these embedded URLs. Most browsers use a limited number of connections per server (6 in case of firefox at the time of writing) and cap the total number of downloading threads (48 in case of firefox at the time of writing).
  • The page is considered loaded when all these embedded URLs have been fetched.

JMeter can simulate this behaviour if the following 2 settings are configured:

  • Retrieve All Embedded Resources from HTML Files


    This checkbox is found near the bottom of HTTP Request Defaults config elements and HTTP Request samplers.

    Check the checkbox to make JMeter download embedded resources like javascript, CSS and images, just as a browser would.

    Add a View Results in Tree listener element if you want to see which embedded resources are downloaded and their metrics. Note that View Results in Table bytes don’t include the embedded resources.

  • Use concurrent pool. Size=n


    The behaviour of this checkbox and pool size are as follows:

    Retrieve all embedded resources from HTML files Use concurrent pool Behaviour
    Checked Unchecked The main page and its embedded resources will be downloaded in the same thread.

    For example, if Thread group is simulating 3 users, Jmeter creates 3 threads – one for each simulated user – named "Thread Group 1-1" to "Thread Group 1-3".

    Each of these threads will download all embedded resources sequentially in the context of their respective thread.

    If page P has resource A,B and C, Jmeter will download them as follows:
    ~ThreadGroup1-1 : p, A, B, C (downloaded one after another)
    ~ThreadGroup1-2 : p, A, B, C (downloaded one after another)
    ~ThreadGroup1-3 : p, A, B, C (downloaded one after another)

    Checked Checked.
    Pool size=x
    As usual, JMeter creates threads named "Thread Group 1-k" to simulate users.

    In addition, for every one of these threads simulating a user, JMeter creates separate threadpools of size x with thread names like pool-n-thread-m.

    The main page is downloaded by the user’s thread "Thread Group 1-k" while the embedded resources are downloaded by its associated threadpool with thread names like pool-n-thread-m.


    So to simulate browsers, check the ‘Use concurrent pool‘ checkbox and specify a reasonable pool size (4-8 seems typical for browsers).

    However, when setting the concurrent pool size, keep in mind the number of users being simulated, because a separate threadpool is created for each of these simulated users. If there are many users, too many threads may get created and start affecting the response times adversely due to bandwidth contention at the JMeter side. If many users are to be simulated, it’s recommended to distribute JMeter testing to multiple machines.

October 14th, 2011

This article explains steps involved in deploying Apache Solr search engine as a system service on the Jetty servlet container on Ubuntu OS. This article is based on information from the Solr Jetty wiki page and on troubleshooting experiences of others.


  • Target system should have atleast Java 6 installed (in my case, OpenJRE 6 is installed)


1. In this description, /opt/solr will be the target directory where Solr will be deployed.


2. The /example directory in the solr package forms the basis of the installation on the target system. It contains multiple configurations, each suitable for a different use case:

/example-DIH : a multicore configuration with each core demonstrating a different data importing configuration

/multicore : a simple multicore installation

/solr : a basic single core configuration.

Copy the configuration suitable for your application into /example/solr (replacing the one already there if necessary) and discard the rest. A configuration typically consists of /conf and /data (and sometimes also /bin and /lib) sub directories.


2. Additionally, the /dist and /contrib package directories contain important jars required by some of these configurations:

/dist/apache-solr-dataimporthandler*.jars – if you require data importing capabilities.

/dist/apache-solr-cell-*.jars ,  /contrib/extraction/lib/*.jars – If you require content extraction from PDF, MS office and other document files.

These jars should also be deployed on the target system.


3. Copy these files to the target system and create the directory structure suggested below under /opt/solr:

|-- dist - All required jars, including additional jars from /contrib
|-- etc - this should probably go into the root /etc directory, as per conventions
|   |-- jetty.xml
|   `-- webdefault.xml
|-- lib
|-- solr
|   |-- bin
|   |-- conf
|   |   |-- admin-extra.html
|   |   |--
|   |   |-- elevate.xml
|   |   |-- protwords.txt
|   |   |-- schema.xml
|   |   |-- scripts.conf
|   |   |-- solrconfig.xml
|   |   |-- stopwords.txt
|   |   |-- synonyms.txt
|   |   `-- xml-data-config.xml
|   |-- data
|-- start.jar
|-- webapps
|   `-- solr.war
`-- work


4. The solr process should run with its own dedicated credentials, so that authorizations can be administered at a fine granularity. So create a system user and group named ‘solr’.

$ sudo adduser --system solr
$ sudo addgroup solr
$ sudo adduser solr solr

5. Create a log directory /var/log/solr for solr and jetty logs.

6. Jetty outputs its errors to STDERR by default. Redirect it to a rolling log file by adding this section to /opt/solr/etc/jetty.xml.

    <!-- =========================================================== -->
    <!-- configure logging                                           -->
    <!-- =========================================================== -->
   <new id="ServerLog" class="">
        <new class="org.mortbay.util.RolloverFileOutputStream">
          <arg><systemproperty default="/var/log/solr" name="jetty.logs" />/yyyy_mm_dd.stderrout.log</arg>
          <arg type="boolean">false</arg>
          <arg type="int">90</arg>
          <arg><call class="java.util.TimeZone" name="getTimeZone"><arg>GMT</arg></call></arg>
          <get id="ServerLogName" name="datedFilename"></get>
    <call class="org.mortbay.log.Log" name="info"><arg>Redirecting stderr/stdout to <ref id="ServerLogName" /></arg></call>
    <call class="java.lang.System" name="setErr"><arg><ref id="ServerLog" /></arg></call>
    <call class="java.lang.System" name="setOut"><arg><ref id="ServerLog" /></arg></call>


7. Now we need to set file and directory permissions so that the solr process user can work correctly.

Use chown to make solr:solr as the owner and group.


$ sudo chown -R solr:solr /opt/solr
$ sudo chown -R solr:solr /var/log/solr

Use chmod to give write permissions to solr:solr for the following directories:





8. The basic installation should work now. Try by launching jetty as a regular process:


/opt/solr$ sudo java -Dsolr.solr.home=/opt/solr/solr -jar start.jar


This should start solr.

Verify that logs are getting generated under /var/logs/solr.

Test it by sending a query to http://localhost:8983/solr/select?q=something using curl.


9. Now we need to install solr as a system daemon so that it can start automatically. Download the startup script (link courtesy and save it as /etc/init.d/solr. Give it executable rights.

The following environment variables need to be set. They can either be inserted in this /etc/init.d/solr script itself, or they can be stored in /etc/default/jetty, which is read by the script.



JAVA_OPTIONS="-Xmx64m -Dsolr.solr.home=/opt/solr/solr"






Set the -Xmx parameters as per your requirements.


10. Additionally, this startup script has a problem that prevents it from running in Ubuntu. If you try running this right now using


$ sudo /etc/init.d/solr


you’ll get a

Starting Jetty: FAILED



The problem – as explained well in this troubleshooting article – is in this line that attempts to start the daemon:


if start-stop-daemon -S -p"$JETTY_PID" $CH_USER -d"$JETTY_HOME" -b -m -a "$JAVA" -- "${RUN_ARGS[@]}" --daemon


In Ubuntu, –daemon is not a valid option for start-stop-daemon. Remove that option from the script:

if start-stop-daemon -S -p"$JETTY_PID" $CH_USER -d"$JETTY_HOME" -b -m -a "$JAVA" -- "${RUN_ARGS[@]}"


If you try starting it now, it should work:

$ sudo /etc/init.d/solr


It should give a

Starting Jetty: OK

message, and ps -ef |grep java should show the "java -jar start.jar" process.


11. Finally, it’s time to configure this as an init script. Read this article if you want a background on Ubuntu runlevels and init scripts.

Insert these lines at the top of /etc/init.d/solr to make it a LSB (Linux Standard Base) compliant init script. Without these lines, it’s not possible to configure the run level scripts.


# Provides:          solr

# Required-Start:    $local_fs $remote_fs $network

# Required-Stop:     $local_fs $remote_fs $network

# Should-Start:      $named

# Should-Stop:       $named

# Default-Start:     2 3 4 5

# Default-Stop:      0 1 6

# Short-Description: Start Solr.

# Description:       Start the solr search engine.



Now run the following command:

$ sudo update-rc.d solr defaults
 Adding system startup for /etc/init.d/solr ...
   /etc/rc0.d/K20solr -> ../init.d/solr
   /etc/rc1.d/K20solr -> ../init.d/solr
   /etc/rc6.d/K20solr -> ../init.d/solr
   /etc/rc2.d/S20solr -> ../init.d/solr
   /etc/rc3.d/S20solr -> ../init.d/solr
   /etc/rc4.d/S20solr -> ../init.d/solr
   /etc/rc5.d/S20solr -> ../init.d/solr

As you can see, the run levels 2-5 (they are equivalent in Ubuntu) are now configured to start solr.

Categories: Search, Ubuntu Tags: , , ,
September 25th, 2011

Ubuntu has 2 different mechanisms for starting system services:

  • The traditional mechanism based on run levels, and scripts in /etc/init.d and /etc/rcn.d directories
  • A new mechanism known as upstart.

Some services are started using one mechanism and others using the other. If you want to control the services, it’s necessary to understand these mechanisms.

Run levels and init.d scripts – the traditional mechanism

Linux has the concept of run levels, in all distros as part of the Linux Base Specification. They can be considered to be “modes” in which Linux runs.

Run level Name Description
0 Halt Shuts down the system
1 Single-user mode Mode for administrative tasks.
2 Multi-User Mode Does not configure network interfaces and does not export networks services
3 Multi-User Mode with Networking Starts the system normally
4 Not used / user definable For special purposes
5 Start the system normally with GUI display manager Run level 3 + display manager
6 Reboot Reboots the system
s or S Single-user mode Does not configure network interfaces, or start daemons.

In Ubuntu (and Debian), run levels 2 to 5 are equivalent and configured with the same set of services.

Get Current run level

Use the runlevel command to get current run level. runlevel is available in Ubuntu as well as redhat based distros like CentOS (not sure about other distros).

karthik@ubuntuLynx:~$ runlevel
N 2

/etc/init.d directories

The /etc/init.d directory contains scripts, which can start / stop / restart services. These are invoked with a start|stop argument at startup and shutdown.

/etc/rcn.d directories

The /etc/rcn.d directories specify which scripts in /etc/init.d are enabled for run level n.

For example, /etc/rc2.d specifies which scripts in /etc/init.d are enabled for run level 2. At startup and shutdown, only these enabled scripts are invoked.

Entries in /etc/rcn.d directories are symlinks to scripts in /etc/init.d, but with a special prefix of the format


S means the script is enabled for this run level.

K means the script is disabled for this run level.

nn is a sequence number that can be used to control the sequence of starting services, so that services which depend on other services are started only after those other services are started.

Below is a listing or /etc/rc2.d. It shows that tomcat6, dovecot and postfix are not automatically started in run level 2. However, they can be started manually.


Enabling and disabling run level services

Use the chkconfig –list command to get an overview of all services and their status. If not installed, install it using sudo apt-get install chkconfig. It gives a status listing like this:

karthik@ubuntukarmic:~$ chkconfig --list
acpi-support              0:off  1:off  2:on   3:on   4:on   5:on   6:off
acpid                     0:off  1:off  2:off  3:off  4:off  5:off  6:off
alsa-utils                0:off  1:off  2:off  3:off  4:off  5:off  6:off

Use the update-rc.d command to enable or disable a service at a run level:

Syntax: sudo     update-rc.d     name    enable|disable    runlevel

Example: sudo update-rc.d dovecot disable 2


sudo update-rc.d dovecot defaults


When creating new init scripts, ensure that the script has the following section (this is an example – change values appropriately) at the top to make it  LSB (Linux Standard Base) compliant. Without this section, update-rc.d won’t work but will give a “missing LSB information” warning…

# Provides:          solr
# Required-Start:    $local_fs $remote_fs $network
# Required-Stop:     $local_fs $remote_fs $network
# Should-Start:      $named
# Should-Stop:       $named
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: Start Solr.
# Description:       Start the solr search engine.



Upstart jobs are configured in /etc/init directory, in .conf files.

Use the service command to start and stop upstart services:

sudo service <servicename> start|stop

For disabling an upstart service from starting up, open the respective /etc/init/[service].conf file and comment out the lines that begin with start on.


#start on (net-device-up
#          and local-filesystems
#         and runlevel [2345])


This will disable the service from starting at startup, but allow manual starts using service start command.

For completely disabling a service – both from automatic and manual starts – it’s better to uninstall the package, but it’s also possible to just rename the .conf file to .conf.disabled.

Resources for further reading

November 28th, 2010


The example solrconfig.xml is already configured for content extraction from any document format – like MS Word DOC, PDF, – which can be handled by Apache Tika.

Content extraction requires libraries found in the /contrib/extraction directory. These include Solr Cell, Apache Tika and Apache POI libraries.

The ExtractingRequestHandler configuration in solrconfig.xml specifies the endpoint at which documents can be submitted for extraction. It’s usually http://localhost:8983/solr/update/extract.



  • To index a document, send the request as

curl “http://localhost:8983/solr/update/extract?” -F myfile=@book.pdf

The request goes as a multi-part form encoding.

  • By default, document contents are added into the document field “text”. The field can be changed in /solr/conf/solrconfig.xml in the Extracting handler’s <requesHandler> element; it has a child element “fmap.content” that specifies which field content should be indexed under.
  • <str name=”fmap.content”>text</str>

Since “text” is NOT a stored field, features like result highlighting won’t be available.

If results highlighting is required, modify /solr/conf/schema.xml to include a new *stored* field called “doc_content” which receives document contents from extracting handler. “doc_content” itself can be included in the “text” catch-all field so that all queries can be matched against document contents.


Restrictions of default content extraction

  • Since extracting handler can specify only a single content  field, contents of multiple files will all go into the same content field. This is a problem if the the content file containing the search string has to be indicated to user.
  • There is no out-of-the-box workaround for this in solr. It’s required to write a specialized extracting handler to map each file (“content stream” in solr terminology) in the multipart request to separate content fields.

November 28th, 2010


Searchable entities of an application need to be modelled as Solr documents and fields for them to be searchable by Solr.

The schema.xml in /solr/conf is where the application search model should be defined.

The <types> element defines the set of field types available in the model.

The <fields> element defines the set of fields of each document in the model. Each field has a type which is defined in the <types> element.


<types> section

This section describes types for all fields in the model. Contains <fieldType> elements. Each <fieldType> has these attributes:

  • name is the name of the field type definition and is referred from the <fields> section
  • class is the subclass of org.apache.solr.schema.FieldType that models this field type definition.  Class names starting with "solr" refer to java classes in the org.apache.solr.analysis package.
  • sortMissingLast and sortMissingFirst
    The optional sortMissingLast and sortMissingFirst attributes are currently supported on types that are sorted internally as strings. This includes "string","boolean","sint","slong","sfloat","sdouble","pdate"
    - If sortMissingLast="true", then a sort on this field will cause documents without the field to come after documents with the field, regardless of the requested sort order (asc or desc).
    - If sortMissingFirst="true", then a sort on this field will cause documents without the field to come before documents with the field, regardless of the requested sort order.
    - If sortMissingLast="false" and sortMissingFirst="false" (the default), then default lucene sorting will be used which places docs without the field first in an ascending sort and last in a descending sort.

  • omitNorms is set to true to omit the norms associated with this field (this disables length normalization and index-time boosting for the field, and saves some memory). Only full-text fields or fields that need an index-time boost need norms.

Each field type definition has an associated Analyzer to tokenize and filter characters or tokens.

The Trie field types are suitable for numeric fields that involve numeric range queries. The trie concept makes searching such fields faster.


Basic field types

string Fields of this type are not analyzed (ie, not tokenized or filtered), but are indexed and stored verbatim.
binary For binary data. Should be sent/retrieved as Base64 encoded strings.
float/tfloat/pfloat double/tdouble/
The regular types (int,float,etc) and their t- versions differ in their precisionStep values.The precisionStep value is used to generate indexes at different precision levels, to support numeric range queries. Both sets are modelled by TrieField types, but the t- versions have precisionStep of 8 while the regular types have 0.So numeric range queries will be faster with the t-versions, but indexes will be larger (and probably slower). The p- versions are when numeric range queries are not needed at all. They are modelled by non-Trie types.
date/tdate/pdate Similar to the above differences in numeric fields.Use tdate for date ranges and date faceting.Dates have to be in a special UTC timezone format, like this example: 2011-02-06T05:34:00.299Z Use org.apache.solr.common.util.DateUtil.getThreadLocalDateFormat().format(new Date()) to get a date in this format.
Sortable fields

Text field types Being a full text search solution, the text field types and their configuration becomes the most critical part of the modelling. Modelling of text fields is explained in detail in the article Solr text field types, analyzers, tokenizers & filters explained.


<fields> section

Fields of documents are described in this section using <field> elements.

Each <field> element can have these attributes:

name (mandatory) the name for the field. Very critical information, used in search queries, facet fields.
type (mandatory) the name of a previously defined type from the       <types> section
indexed true if this field should be indexed (should be searchable or sortable)
stored true if this field value should be retrievable verbatim in search results.
compressed [false] if this field should be stored using gzip compression (this will only apply if the field type is compressable; among the standard field types, only TextField and StrField are).This is very useful for large data fields, but will probably slow down search results – so it should not be used for fields that involve frequent querying
multiValued true if this field may contain multiple values per document
omitNorms (expert) set to true to omit the norms associated with this field (this disables length normalization and index-time boosting for the field, and saves some memory).  Only full-text fields or fields that need an index-time boost need norms.
termVectors [false] set to true to store the term vector for a given field. When using MoreLikeThis, fields used for similarity should be stored for best performance.
termPositions Store position information with the term vector.  This will increase storage costs.
termOffsets Store offset information with the term vector. This will increase storage costs.
default a value that should be used if no value is specified when adding a document.

The example deployment itself defines many commonly used fields and types; study them and check if something needed is already available before modelling your own.

<dynamicField> elements can be used to model field names which are not explicitly defined by name, but which match some defined pattern.

<copyField> definitions specify to copy one field to another at the time a document is added to the index.  It’s used either to index the same field differently, or to add multiple fields to the same field for easier/faster searching. For example, all text fields in the document can be copied to a single catch-all field, for faster querying.

<uniqueKey> element specifies the field to be used to determine and enforce document uniqueness.

<defaultSearchField> element specifies the field to be queried when it’s not explicitly specified in the query string using a “field:value” syntax. The catch-all copyfield is usually specified as the default search field.

<solrQueryParse> specifies query parser configuration. defaultOperator=”AND|OR” specifies whether queries are combined using AND operator or OR operator.

November 26th, 2010


Faceted searching – also called as drilldown searching – refers to incrementally refining search results by different criteria at each level. Popular e-shopping sites like Amazon and Ebay provide this in their search pages.

Solr has excellent support for faceting. The sections below describe how to use faceting in java applications, using the solrj client API.



Step 1 : Do the first level search and get first level facets

SolrQuery qry = new SolrQuery(strQuery);
String[] fetchFacetFields = new String[]{"categories"};
QueryRequest qryReq = new QueryRequest(qry); 

QueryResponse resp = qryReq.process(solrServer);  

SolrDocumentList results = resp.getResults();
int count = results.size();
System.out.println(count + " hits");
for (int i = 0; i > count; i++) {
    SolrDocument hitDoc = results.get(i);
    System.out.println("#" + (i+1) + ":" + hitDoc.getFieldValue("name"));
    for (Iterator<Entry<String, Object>> flditer = hitDoc.iterator(); flditer.hasNext();) {
        Entry<String, Object> entry =;
        System.out.println(entry.getKey() + ": " + entry.getValue());

List<FacetField> facetFields = resp.getFacetFields();
for (int i = 0; i > facetFields.size(); i++) {
    FacetField facetField = facetFields.get(i);
    List<Count> facetInfo = facetField.getValues();
    for (FacetField.Count facetInstance : facetInfo) {
        System.out.println(facetInstance.getName() + " : " + facetInstance.getCount() + " [drilldown qry:" + facetInstance.getAsFilterQuery());


The response will contain details of number of hits for each instance of the facet.

For example, if the field categories has values movies and songs in the set of matching hits, then each of them is called a facet instance. 

Each facet instance of a FacetField has a name (“songs”), and each has an associated facet instance count and a filter query.

Facet instance count of 10 for “categories:songs” means in the set of all search results, 10 results have the value of categories as songs.

Facet instance filter query is the subquery to go down to the next level of drilldown search, by filtering on the facet instance value.

At this point in a typical drilldown search user interface, the left sidebar with all the filters would display those facet instances that have nonzero instance count with checkboxes and respective counts. User can then select the most promising facet to drilldown along and check its checkbox...


Step 2: Add facet filter query for next level of refined results

Add the filter query of facet instance to the main query, using addFilterQuery.

Filter query for single facet instance is of the format "<field>:<value>”. example: addFilterQuery(“categories:movies”);

// filterQueries is a String[] of facet filter queries got using getAsFilterQuery() from previous search
SolrQuery qry = new SolrQuery(strQuery);
if (filterQueries != null) {
    for (String fq : filterQueries) {
QueryRequest qryReq = new QueryRequest(qry);
QueryResponse resp = qryReq.process(solrServer);

For subsequent levels of refinement, add facet instance filter queries to the current level’s main query, and add the list of facet fields required for the next level.


Facet filter query syntax

The facet filter queries have some rather intricate syntaxes for achieving various search behaviours, which are described below.


Selecting multiple facets

In some drilldown search designs, a user is allowed to specify multiple facet instances for the same field. For example, a categories field may have multiple category facet instances. In such cases, the facet instances should be combined using an OR operator.

Categories [ ]

  Movies (300) [ ]

  Songs (400) [ ]

  Ads (150) []


If user selects “Movies” and “songs”, the filter query should have the semantics of an OR operator –

“..where category=movies OR category=songs”.

This can be specified in solr filter queries by enclosing the facet instances inside parentheses:

<fqfield>:(value1 value2 value3…)


In command line URL :


where %3A is character ‘:‘   , %28 is character ‘(‘ and %29 is character ‘)’

OR, equivalently

In java

qry.addFilterQuery(“categories:(songs movies)”);

Whitespaces in facet instances

If facet instances have whitespaces within them, then multiple facet instances should be specified simply by enclosing them in double quotes (%22).

For example, for a facet field "crn" with facet instances “M.Tech. Computer Sc. & Engg.” and “ELECTRICAL ENGINEERING” (note the whitespaces), the syntax:

In URLs:



In Java:

qry.addFilterQuery("crn:("M.Tech. Computer Sc. & Engg." "ELECTRICAL ENGINEERING")");



Handling large number of facet values using pagination

Solr provides pagination for facet values and automatically imposes a limit on the number of values returned for each facet field. This limit can be set using the facet.limit query parameter, or setFacetLimit() API, and the facet value offset can be set using facet.offset query parameter.

However, there is no direct API like setFacetOffset() in SolrJ…instead, use

solrQry.add(FacetParams.FACET_OFFSET, “100”)



Facet Query vs Filter Query of facet

The Solr API also contains methods that refer to "facet queries". It’s important not to confuse facet queries and filter queries of facets.At first glance, it looks like the facet query concept is what will provide us the drilldown possibility. But not so.

Facet query is a kind of dynamic facet field, applicable only to certain use cases where it makes sense to categorize items in ranges – either numerical or date ranges .

For example, if items have to be categorized into price ranges like [$100-$200], [$200-$300] etc, then facet queries have to be used to “get the count of all items whose price>$100 and price<$200”. Just specifying the price field as a facet field would not be useful here, because it just returns the list of all unique prices available in the search results. What really provides the drilldown capabilities in this case is the facet query concept.

Facet queries are specified using the syntax field:[start TO end]. In URL, it should go in encoded format :


In API, it’s specified as

solrQuery.addFacetQuery(“age:[20 TO 22]”);


Undestanding facet counts

The facet counts are always in the context of the set of search results of main query + filter queries. image

November 25th, 2010

A java application running in a JVM can use the EmbeddedSolrServer to host Solr in the same JVM.

Following snippet shows how to use it:

public class EmbeddedServerExplorer {
    public static void main(String[] args) {
        try {
            // Set "solr.solr.home" to the directory under which /conf and /data are present.
            System.setProperty("solr.solr.home", "solr");
            CoreContainer.Initializer initializer = new CoreContainer.Initializer();
            CoreContainer coreContainer = initializer.initialize();
            EmbeddedSolrServer server = new EmbeddedSolrServer(coreContainer, "");
            SolrInputDocument doc = new SolrInputDocument();
            doc.addField("id", "embeddedDoc1");
            doc.addField("name", "test embedded server");
        } catch (Exception e) {

Categories: Search Tags: ,
November 23rd, 2010


SolrJ provides java wrappers and adaptors to communicate with Solr and translate its results to java objects. Using SolrJ is much more convenient than using raw HTTP and JSON. Internally, SolrJ uses Apache HttpClient to send HTTP requests.


Important classes

SolrJ API is fairly simple and intuitive. The diagram below shows important SolrJ classes.



Setup the client connection to server

solrServer = new CommonsHttpSolrServer("http://localhost:8983/solr");
solrServer.setParser(new XMLResponseParser());

Response parser in java client API can be either XML or binary. In other language APIs, JSON is possible.


Add or update document(s)

SolrInputDocument doc = new SolrInputDocument();
// Add fields. The field names should match fields defined in schema.xml
doc.addField(FLD_ID, docId++);
try {
    return true;
} catch (Exception e) {
    LOG.error("addItem error", e);
    return false;

Commit changes

For best performance, commit changes only after all – or a batch of reasonable size -documents are added/updated.


Send a search query

SolrQuery qry = new SolrQuery("name:video");
QueryRequest qryReq = new QueryRequest(qry);
QueryResponse resp = qryReq.process(solrServer);

SolrQuery.setRows() specifies how many results to return in the response. The actual count of all hits may be much higher. If “field:” is omitted from query string, then the field specified by <defaultSearchField> in schema.xml is searched.

Handle search results

SolrDocumentList results = resp.getResults();
System.out.println(results.getNumFound() + " total hits");
int count = results.size();
System.out.println(count + " received hits");
for (int i = 0; i &amp;gt; count; i++) {
    SolrDocument hitDoc = results.get(i);
    System.out.println("#" + (i+1) + ":" + hitDoc.getFieldValue("name"));
    for (Iterator<Entry<String, Object>> flditer = hitDoc.iterator(); flditer.hasNext();) {
        Entry<String, Object> entry =;
        System.out.println(entry.getKey() + ": " + entry.getValue());

SolrDocumentList.getNumFound() is total number of hits in the index. But in each response, only as many results as specified by SolrQuery.setRows() will be returned. These two attributes can be used for pagination.

Categories: Search Tags: , ,
November 23rd, 2010


Apache Solr is a full fledged, search server based on the Lucene toolkit.

Lucene provides the core search algorithms and index storage required by those algorithms. Most basic search requirements can be fulfilled by Lucene itself without requiring Solr. But using plain Lucene has some drawbacks in development and non functional aspects, forcing development teams to cover these in their designs. This is where Solr adds value.

Solr provides these benefits over using the raw Lucene toolkit:

  • Solr allows search behaviour to be configured through configuration files, rather than through code. Specifying search fields, indexing criteria, and indexing behaviour in code is prone to maintenance problems.
  • Lucene is java centric (but also has ports to other languages). Solr however provides a HTTP interface that allows any platform to use it. Projects that involve multiple languages or platforms can use the same solr server.
  • Solr provides an out-of-the-box faceted search (also called drilldown search) facility, that allows users to incrementally refine results using filters and "drilldown" towards a narrow set of best matches. Many shopping web portals use this feature to allow their users to incrementally refine their results.
  • Solr’s query syntax is slightly easier than Lucene’s. Either a default field can be specified, or solr provides a syntax of its own called dismax, that searches a fixed set of fields.
  • Solr’s java client API is much simpler and easier than Lucene’s. Solr abstracts away many of the underlying Lucene concepts.
  • Solr provides straightforward add, update, and delete document API, unlike Lucene.
  • Solr supports a pluggable architecture. For example, post processor plugins (example: search results highlighting) allow raw results to be modified. .
  • Solr facilitates scalability using solutions like caching, memory tweaking, clustering, sharding and load balancing.
  • Solr provides plugins to fetch database data and index them. This workflow is probably the most common requirement for any search implementation, and solr provides it out-of-the-box.

The following sections describe basics of deploying Solr and using it from command line.


Directory layout of Solr package

Extracted Solr package has this layout:

/client Contains client APIs in different languages to talk to a Solr server
/contrib/clustering Plugin that provides clustering capabilities for Solr, using Carrot2 clustering framework
/contrib/dataimporthandler Plugin that is useful for indexing data in databases
/contrib/extraction Plugin that is useful for extracting text from PDFs, Word DOCs, etc.
/contrib/velocity Handler to present and manipulate search results using velocity templates.
/dist Contains Solr core jars and wars that can be deployed in servlet containers or elsewhere, and the solrj client API for java clients.
/dist/solrj-lib Libraries required by solrj client API .
/docs Offline documentation and javadocs
/lib Contains Lucene and other jars required by Solr
/src Source code
/example A skeleton standalone solr server deplyment. Default environment is Jetty. When deploying Solr, this is the directory that’s customized and deployed.
/example/etc Jetty or other environment specific configuration files go here
/example/example-DIH An example DB and the Data Import Handler plugin configuration to index that DB
/example/exampledocs Example XML request files to send to Solr server. Usage: java –jar post.jar <xml filename>
/example/lib Jetty and servlet libraries. Not required if Solr is being deployed in a different environment
/example/logs Solr request logs
/example/multicore It’s possible to host multiple search cores in the same environment. Use case could be separate indexes for different categories of data.
/example/solr This is the main data area of Solr.
/example/solr/conf Contains configuration files used by Solr.

solrconfig.xml – Configuration parameters, memory tuning, different types of request handlers.

schema.xml – Specifies fields and analyzer configuration for indexing and querying. Other files contain data required by different components like the Stop word filter.

/example/solr/data This contains the actual results of indexing.
/example/webapps The solr webapp deployed in Jetty
/example/work Scratch directory for the container environment

Getting Started Guide

1) Copy the skeleton server under /example to the deployment directory.

2) Customize /example/solr/conf/schema.xml as explained in later sections, to model search fields of the application.

3) Start the solr server. For the default Jetty environment, use this command line with current directory set to /example:

java -DSTOP.PORT=8079 –DSTOP.KEY=secret –jar start.jar

The STOP.PORT specifies the port on which server should listen for a stop instruction, and STOP.KEY is just a kind of secret key to be passed while stopping.

4) If building from source, the WAR will be named something like apache-solr-4.0-snapshot.jar. Copy this to /webapps and importantly, rename it to solr.war. Without that renaming, Jetty will give 404 errors for /solr URLs.

5) The solr server will  now be available at http://localhost:8983/solr. 8983 is the default jetty connector port, as specified in /example/etc/jetty.xml

6) To stop the server, use the command line:

java -DSTOP.PORT=8079 –DSTOP.KEY=secret –jar start.jar –stop



Managing solr server with ant during development

Starting and stopping solr can be conveniently done from an IDE like Eclipse using an Ant script:

<project basedir="." name="ManageSolr">
<property name="stopport" value="8079"></property>
<property name="stopsecret" value="secret"></property>

<target name="start-solr">
	<java dir="./dist/solr" fork="true" jar="./dist/solr/start.jar">
		<jvmarg value="-DSTOP.PORT=${stopport}" />
		<jvmarg value="-DSTOP.KEY=${stopsecret}" />

<target name="stop-solr">
	<java dir="./dist/solr" fork="true" jar="./dist/solr/start.jar">
		<jvmarg value="-DSTOP.PORT=${stopport}" />
		<jvmarg value="-DSTOP.KEY=${stopsecret}" />
		<arg value="–stop" />

<target name="restart-solr" depends="stop-solr,start-solr">

<target name="deleteAllDocs">
	<java dir="./dist/solr/exampledocs" fork="true" jar="./dist/solr/exampledocs/post.jar">
		<arg value="${basedir}/deleteAllCommand.xml" />


Customizing Solr installation

The solr server distribution under /example is just that – an example. It should be customized to fit your search requirements. The conf/schema.xml should be changed to model searchable entities of the application, as described in this article.


Multicore configuration and deployment

Multicore configuration allows multiple schemas and indexes in a single solr server process. Multicores are useful when disparate entities with different fields need to be searched using a single server process.

  • The package contains an example multicore configuration in /example/multicore.  It contains 2 cores, each with its own schema.xml and solrconfig.xml.
  • Core names and instance directories can be changed in solr.xml.
  • The default multicore schema.xmls are rather simplistic and don’t contain the exhaustive list of field type definitions available in /example/solr/conf/schema.xml.  So, copy all files under/example/solr/conf/* into /example/multicore/core0/conf/* and/example/multicore/core1/conf/*
  • Modify the core schema XMLs according to the data they are indexing
  • The copied solrconfig.xml has a <datadir> element that points to /example/multicore/data. This is where index and other component data are stored. Since the same solrconfig is copied into both cores, both cores end up pointing to the same data directory and will try to write to same index, most likely resulting in index corruption.  So, just comment out the <datadir> elements. Then each core will store data in its respective/example/multicore/<coredir>/data.
  • The jar lib directories in default single core solrconfig.xml don’t match with the default directory structure in a multicore structure.Those relative paths are with solr.home (ie, “/example/solr“) as base directory.  Change the relative paths of /contrib and /dist, such that they’re relative *to the core’s directory* (ie,/example/solr/<coredir>).
  • Finally, the multicore configuration should be made the active configuration, either by specifying”java -Dsolr.home=/example/multicore -jar start.jar”          OR preferably,        By copying all files under/example/multicore/* into /example/solr, the default solr home.

Using Solr from command line

The primary method of communicating with Solr is using HTTP. A HTTP capable command line client like curl is useful for this.

Querying: Queries should be sent as



http://localhost:8983/solr/<core name>/select/?q=<query>

for multicore installation

Inserting or Updating documents in a single core installation: Solr update handler listens by default on the URL: http://localhost:8983/solr/update/ in a single core configuration.

To post an XML file with documents, use command line

curl http://localhost:8983/solr/update/?commit=true –F "myfile=@updates.xml"

Inserting or Updating documents in a multi core installation: Each core’s update handler listens by default on the URL: http://localhost:8983/solr/<core name>/update/


Updating with content extraction: Content extracting handler listens on the URL http://localhost:8983/solr/update/extract/ or http://localhost:8983/solr/<core name>/update/extract. Use the command line

curl "http://localhost:8983/solr/update/extract?" -F "myfile=@book.pdf"

where adds a regular field called "id" to the new document created by extracting handler.


The query parameters that Solr accepts are documented in Solr wiki.


Boolean operators in search queries

All Lucene queries are valid in Solr too. However, solr does provide some additional conveniences.

A default boolean operator can specified using a <solrQueryParser defaultOperator=”AND|OR”/> element in schema.xml.

Each query can also override boolean behaviour using the q.op=AND|OR query param. However, remember that the schema default or q.op affect not just the query terms, but also the facet filter queries.

For example, selecting 2 facet values for the same facet field will now imply that both should be satisfied. This is because internally, a filter query is just a part of the query from Lucene point of view.

    To restrict boolean logic to just the query terms, use the following syntax:
  • All words should be found: Prefix a + in front of each word. example: +video +science (=>only documents that contain both “video” AND “science” are returned)
  • Any one word should be found: This is the default behaviour when queries contain words without any prefix.example: video science (=>any document which contains either “video” or “science” is returned)
  • Documents which don’t contain a word: Prefix a “–” in front of each word that should not be present, for a successful hit. example: video –science (=>any document which contains “video” but not “science” is returned).