Tips for speeding up batch indexing

Tips for speeding up batch indexing

Tips for speeding up batch indexing

Overview

Indexing documents tends to fall into two general patterns: adding documents one at a time as they are created (as in a web application), and adding a bunch of documents at once (batch indexing).

The following settings and alternate workflows can make batch indexing faster.

StemmingAnalyzer cache

The stemming analyzer by default uses a least-recently-used (LRU) cache to limit the amount of memory it uses, to prevent the cache from growing very large if the analyzer is reused for a long period of time. However, the LRU cache can slow down indexing by almost 200% compared to a stemming analyzer with an “unbounded” cache.

When you’re indexing in large batches with a one-shot instance of the analyzer, consider using an unbounded cache:

w = myindex.writer()
# Get the analyzer object from a text field
stem_ana = w.schema["content"].format.analyzer
# Set the cachesize to -1 to indicate unbounded caching
stem_ana.cachesize = -1
# Reset the analyzer to pick up the changed attribute
stem_ana.clear()

# Use the writer to index documents...

The limitmb parameter

The limitmb parameter to whoosh.index.Index.writer() controls the maximum memory (in megabytes) the writer will use for the indexing pool. The higher the number, the faster indexing will be.

The default value of 128 is actually somewhat low, considering many people have multiple gigabytes of RAM these days. Setting it higher can speed up indexing considerably:

from whoosh import index

ix = index.open_dir("indexdir")
writer = ix.writer(limitmb=256)

Note

The actual memory used will be higher than this value because of interpreter overhead (up to twice as much!). It is very useful as a tuning parameter, but not for trying to exactly control the memory usage of Whoosh.

The procs parameter

The procs parameter to whoosh.index.Index.writer() controls the number of processors the writer will use for indexing (via the multiprocessing module):

from whoosh import index

ix = index.open_dir("indexdir")
writer = ix.writer(procs=4)

Note that when you use multiprocessing, the limitmb parameter controls the amount of memory used by each process, so the actual memory used will be limitmb * procs:

# Each process will use a limit of 128, for a total of 512
writer = ix.writer(procs=4, limitmb=128)

The multisegment parameter

The procs parameter causes the default writer to use multiple processors to do much of the indexing, but then still uses a single process to merge the pool of each sub-writer into a single segment.

You can get much better indexing speed by also using the multisegment=True keyword argument, which instead of merging the results of each sub-writer, simply has them each just write out a new segment:

from whoosh import index

ix = index.open_dir("indexdir")
writer = ix.writer(procs=4, multisegment=True)

The drawback is that instead of creating a single new segment, this option creates a number of new segments at least equal to the number of processes you use.

For example, if you use procs=4, the writer will create four new segments. (If you merge old segments or call add_reader on the parent writer, the parent writer will also write a segment, meaning you’ll get five new segments.)

So, while multisegment=True is much faster than a normal writer, you should only use it for large batch indexing jobs (or perhaps only for indexing from scratch). It should not be the only method you use for indexing, because otherwise the number of segments will tend to increase forever!