Release notes
- Whoosh 2.x release notes
- Whoosh 1.x release notes
- Whoosh 0.3 release notes
Quick start
- A quick introduction
- The Index and Schema objects
- The IndexWriter object
- The Searcher object
Introduction to Whoosh
- About Whoosh
- What is Whoosh?
- What can Whoosh do for you?
- Getting help with Whoosh
Glossary
Designing a schema
- About schemas and fields
- Built-in field types
- Creating a Schema
- Modifying the schema after indexing
- Dynamic fields
- Advanced schema setup
How to index documents
- Creating an Index object
- Clearing the index
- Indexing documents
  - Indexing and storing different values for the same field
  - Finishing adding documents
- Merging segments
- Deleting documents
- Updating documents
- Incremental indexing
- Clearing the index
How to search
- The Searcher object
- Results object
- Scoring and sorting
  - Scoring
  - Sorting
- Highlighting snippets and More Like This
- Filtering results
- Which terms from my query matched?
- Collapsing results
- Time limited searches
- Convenience methods
- Combining Results objects
Parsing user queries
- Overview
- Using the default parser
- Common customizations
- Advanced customization
The default query language
- Overview
- Individual terms and phrases
- Boolean operators
- Fields
- Inexact terms
- Ranges
- Boosting query elements
- Making a term from literal text
Indexing and parsing dates/times
- Indexing dates
- Parsing date queries
- About time zones and basetime
- Date parser notes
- Limitations
Query objects
About analyzers
- Overview
- Using analyzers
- Advanced Analysis
Stemming, variations, and accent folding
- The problem
- Stemming
- Variations
- Lemmatization
- Character folding
Indexing and searching N-grams
- Overview
Sorting and faceting
- Overview
- Sorting
- Grouping
- Facet types
- MultiFacet
- Missing values
- Using overlapping groups
- Using a custom sort order
- Expert: writing your own facet
How to create highlighted search result excerpts
- Overview
- Requirements
- How to
- The character limit
- Customizing the highlights
- Highlighter object
- Speeding up highlighting
  - PinpointFragmenter
  - PinpointFragmenter limitations
- Using the low-level API
  - Usage
Query expansion and Key word extraction
- Overview
- Usage
- Expansion models
“Did you mean… ?” Correcting errors in user queries
- Overview
- Pulling suggestions from an indexed field
- Pulling suggestions from a word list
- Merging two or more correctors
- Correcting user queries
Field caches
- Customizing cache behaviour
- Creating a custom caching policy
Tips for speeding up batch indexing
- Overview
- StemmingAnalyzer cache
- The limitmb parameter
- The procs parameter
- The multisegment parameter
Concurrency, locking, and versioning
- Concurrency
- Locking
  - Lock files
- Versioning
Indexing and searching document hierarchies
- Overview
- Using nested document indexing
- Using query-time joins
Whoosh recipes
- General
  - Get the stored fields for a document from the document number
- Analysis
  - Eliminate words shorter/longer than N
  - Allow optional case-sensitive searches
- Searching
  - Find every document
  - iTunes-style search-as-you-type
- Shortcuts
  - Look up documents by a field value
- Sorting and scoring
  - Score results based on the position of the matched term
- Results
  - How many hits were there?
  - Which terms matched in each hit?
- Global information
Whoosh API
- analysis module
  - Analyzers
  - Tokenizers
  - Filters
  - Token classes and functions
    - Token
    - unstopped()
- codec.base module
  - Classes
- collectors module
  - Base classes
  - Basic collectors
  - Wrappers
- columns module
  - Base classes
  - Basic columns
  - Technical columns
  - Experimental columns
    - ClampedNumericColumn
- fields module
  - Schema class
    - Schema
    - SchemaClass
  - FieldType base class
    - FieldType
  - Pre-made field types
    - ID
    - IDLIST
    - STORED
    - KEYWORD
    - TEXT
    - NUMERIC
    - DATETIME
    - BOOLEAN
    - NGRAM
    - NGRAMWORDS
  - Exceptions
    - FieldConfigurationError
    - UnknownFieldError
- filedb.filestore module
  - Base class
    - Storage
  - Implementation classes
    - FileStorage
    - RamStorage
  - Helper functions
    - copy_storage()
    - copy_to_ram()
  - Exceptions
    - ReadOnlyError
- filedb.filetables module
  - Hash file
    - HashWriter
      - HashWriter.add()
      - HashWriter.add_all()
    - HashReader
  - Ordered Hash file
    - OrderedHashWriter
    - OrderedHashReader
- filedb.structfile module
  - Classes
- formats module
  - Base class
    - Format
  - Formats
- highlight module
  - Manual highlighting
    - Highlighter
    - highlight()
  - Fragmenters
  - Scorers
    - FragmentScorer
    - BasicFragmentScorer
  - Formatters
  - Utility classes
    - Fragment
- support.bitvector module
  - Base classes
    - DocIdSet
    - BaseBitSet
  - Implementation classes
- index module
  - Functions
  - Base class
    - Index
  - Implementation
    - FileIndex
  - Exceptions
- lang.morph_en module
  - variations()
- lang.porter module
  - stem()
- lang.wordnet module
  - Thesaurus
    - Thesaurus
  - Low-level functions
- matching module
  - Matchers
  - Exceptions
    - ReadTooFar
    - NoQualityAvailable
- qparser module
  - Parser object
    - QueryParser
    - Pre-made configurations
  - Plug-ins
  - Syntax node objects
- query module
  - Base classes
  - Query classes
    - Term
    - Variations
    - FuzzyTerm
    - Phrase
    - And
    - Or
    - DisjunctionMax
    - Not
    - Prefix
    - Wildcard
    - Regex
    - TermRange
    - NumericRange
    - DateRange
    - Every
    - NullQuery
  - Binary queries
  - Span queries
  - Special queries
  - Exceptions
    - QueryError
- reading module
  - Classes
  - Exceptions
    - TermNotFound
- scoring module
  - Base classes
  - Scoring algorithm classes
  - Scoring utility classes
- searching module
  - Searching classes
    - Searcher
  - Results classes
  - Exceptions
    - NoTermsException
    - TimeLimit
- sorting module
  - Base types
    - FacetType
      - FacetType.categorizer()
    - Categorizer
  - Facet types
  - Facets object
    - Facets
  - FacetType objects
- spelling module
  - Corrector objects
  - QueryCorrector objects
- support.charset module
  - default_charset
  - charset_table_to_dict()
- support.levenshtein module
  - relative()
  - distance()
- util module
  - fib()
  - make_binary_tree()
  - make_weighted_tree()
  - synchronized()
  - unclosed()
- writing module
  - Writer
    - IndexWriter
  - Utility writers
    - BufferedWriter
    - AsyncWriter
  - Exceptions
    - IndexingError
Technical notes
- How to implement a new backend
  - Index
  - IndexWriter
  - IndexReader
  - Matcher
- filedb notes
  - Files created

`collectors` module¶

This module contains “collector” objects. Collectors provide a way to gather “raw” results from a whoosh.matching.Matcher object, implement sorting, filtering, collation, etc., and produce a whoosh.searching.Results object.

The basic collectors are:

TopCollector: Returns the top N matching results sorted by score, using block-quality optimizations to skip blocks of documents that can’t contribute to the top N. The whoosh.searching.Searcher.search() method uses this type of collector by default or when you specify a limit.
UnlimitedCollector: Returns all matching results sorted by score. The whoosh.searching.Searcher.search() method uses this type of collector when you specify limit=None or you specify a limit equal to or greater than the number of documents in the searcher.
SortingCollector: Returns all matching results sorted by a whoosh.sorting.Facet object. The whoosh.searching.Searcher.search() method uses this type of collector when you use the sortedby parameter.

Here’s an example of a simple collector that instead of remembering the matched documents just counts up the number of matches:

class CountingCollector(Collector):
    def prepare(self, top_searcher, q, context):
        # Always call super method in prepare
        Collector.prepare(self, top_searcher, q, context)

        self.count = 0

    def collect(self, sub_docnum):
        self.count += 1

c = CountingCollector()
mysearcher.search_with_collector(myquery, c)
print(c.count)

There are also several wrapping collectors that extend or modify the functionality of other collectors. The meth:whoosh.searching.Searcher.search method uses many of these when you specify various parameters.

NOTE: collectors are not designed to be reentrant or thread-safe. It is generally a good idea to create a new collector for each search.

Base classes¶

class whoosh.collectors.Collector¶

Base class for collectors.

all_ids()¶

Returns a sequence of docnums matched in this collector. (Only valid after the collector is run.)

The default implementation is based on the docset. If a collector does not maintain the docset, it will need to override this method.

abstract collect(sub_docnum)¶

This method is called for every matched document. It should do the work of adding a matched document to the results, and it should return an object to use as a “sorting key” for the given document (such as the document’s score, a key generated by a facet, or just None). Subclasses must implement this method.

If you want the score for the current document, use self.matcher.score().

Overriding methods should add the current document offset (self.offset) to the sub_docnum to get the top-level document number for the matching document to add to results.

Parameters:	sub_docnum – the document number of the current match within the current sub-searcher. You must add `self.offset` to this number to get the document’s top-level document number.

collect_matches()¶: This method calls Collector.matches() and then for each matched document calls Collector.collect(). Sub-classes that want to intervene between finding matches and adding them to the collection (for example, to filter out certain documents) can override this method.

computes_count()¶

Returns True if the collector naturally computes the exact number of matching documents. Collectors that use block optimizations will return False since they might skip blocks containing matching documents.

Note that if this method returns False you can still call count(), but it means that method might have to do more work to calculate the number of matching documents.

count()¶

Returns the total number of documents matched in this collector. (Only valid after the collector is run.)

The default implementation is based on the docset. If a collector does not maintain the docset, it will need to override this method.

finish()¶

This method is called after a search.

Subclasses can override this to perform set-up work, but they should still call the superclass’s method because it sets several necessary attributes on the collector object:

self.runtime: The time (in seconds) the search took.

matches()¶: Yields a series of relative document numbers for matches in the current subsearcher.

prepare(top_searcher, q, context)¶

This method is called before a search.

Subclasses can override this to perform set-up work, but they should still call the superclass’s method because it sets several necessary attributes on the collector object:

self.top_searcher: The top-level searcher.
self.q: The query object
self.context: context.needs_current controls whether a wrapping collector requires that this collector’s matcher be in a valid state at every call to collect(). If this is False, the collector is free to use faster methods that don’t necessarily keep the matcher updated, such as matcher.all_ids().

Parameters:	top_searcher – the top-level `whoosh.searching.Searcher` object. q – the `whoosh.query.Query` object being searched for. context – a `whoosh.searching.SearchContext` object containing information about the search.

remove(global_docnum)¶: Removes a document from the collector. Not that this method uses the global document number as opposed to Collector.collect() which takes a segment-relative docnum.

abstract results()¶: Returns a Results object containing the results of the search. Subclasses must implement this method

set_subsearcher(subsearcher, offset)¶

This method is called each time the collector starts on a new sub-searcher.

Subclasses can override this to perform set-up work, but they should still call the superclass’s method because it sets several necessary attributes on the collector object:

self.subsearcher: The current sub-searcher. If the top-level searcher is atomic, this is the same as the top-level searcher.
self.offset: The document number offset of the current searcher. You must add this number to the document number passed to Collector.collect() to get the top-level document number for use in results.
self.matcher: A whoosh.matching.Matcher object representing the matches for the query in the current sub-searcher.

abstract sort_key(sub_docnum)¶

Returns a sorting key for the current match. This should return the same value returned by Collector.collect(), but without the side effect of adding the current document to the results.

If the collector has been prepared with context.needs_current=True, this method can use self.matcher to get information, for example the score. Otherwise, it should only use the provided sub_docnum, since the matcher may be in an inconsistent state.

Subclasses must implement this method.

class whoosh.collectors.ScoredCollector(replace=10)¶

Base class for collectors that sort the results based on document score.

Parameters:	replace – Number of matches between attempts to replace the matcher with a more efficient version.

collect(sub_docnum)¶

If you want the score for the current document, use self.matcher.score().

Overriding methods should add the current document offset (self.offset) to the sub_docnum to get the top-level document number for the matching document to add to results.

Parameters:	sub_docnum – the document number of the current match within the current sub-searcher. You must add `self.offset` to this number to get the document’s top-level document number.

matches()¶: Yields a series of relative document numbers for matches in the current subsearcher.

prepare(top_searcher, q, context)¶

This method is called before a search.

Subclasses can override this to perform set-up work, but they should still call the superclass’s method because it sets several necessary attributes on the collector object:

self.top_searcher: The top-level searcher.
self.q: The query object
self.context: context.needs_current controls whether a wrapping collector requires that this collector’s matcher be in a valid state at every call to collect(). If this is False, the collector is free to use faster methods that don’t necessarily keep the matcher updated, such as matcher.all_ids().

Parameters:	top_searcher – the top-level `whoosh.searching.Searcher` object. q – the `whoosh.query.Query` object being searched for. context – a `whoosh.searching.SearchContext` object containing information about the search.

sort_key(sub_docnum)¶

Returns a sorting key for the current match. This should return the same value returned by Collector.collect(), but without the side effect of adding the current document to the results.

Subclasses must implement this method.

class whoosh.collectors.WrappingCollector(child)¶

Base class for collectors that wrap other collectors.

all_ids()¶

Returns a sequence of docnums matched in this collector. (Only valid after the collector is run.)

The default implementation is based on the docset. If a collector does not maintain the docset, it will need to override this method.

collect(sub_docnum)¶

If you want the score for the current document, use self.matcher.score().

Overriding methods should add the current document offset (self.offset) to the sub_docnum to get the top-level document number for the matching document to add to results.

Parameters:	sub_docnum – the document number of the current match within the current sub-searcher. You must add `self.offset` to this number to get the document’s top-level document number.

collect_matches()¶: This method calls Collector.matches() and then for each matched document calls Collector.collect(). Sub-classes that want to intervene between finding matches and adding them to the collection (for example, to filter out certain documents) can override this method.

count()¶

Returns the total number of documents matched in this collector. (Only valid after the collector is run.)

The default implementation is based on the docset. If a collector does not maintain the docset, it will need to override this method.

finish()¶

This method is called after a search.

Subclasses can override this to perform set-up work, but they should still call the superclass’s method because it sets several necessary attributes on the collector object:

self.runtime: The time (in seconds) the search took.

matches()¶: Yields a series of relative document numbers for matches in the current subsearcher.

prepare(top_searcher, q, context)¶

This method is called before a search.

Subclasses can override this to perform set-up work, but they should still call the superclass’s method because it sets several necessary attributes on the collector object:

self.top_searcher: The top-level searcher.
self.q: The query object
self.context: context.needs_current controls whether a wrapping collector requires that this collector’s matcher be in a valid state at every call to collect(). If this is False, the collector is free to use faster methods that don’t necessarily keep the matcher updated, such as matcher.all_ids().

Parameters:	top_searcher – the top-level `whoosh.searching.Searcher` object. q – the `whoosh.query.Query` object being searched for. context – a `whoosh.searching.SearchContext` object containing information about the search.

remove(global_docnum)¶: Removes a document from the collector. Not that this method uses the global document number as opposed to Collector.collect() which takes a segment-relative docnum.

results()¶: Returns a Results object containing the results of the search. Subclasses must implement this method

set_subsearcher(subsearcher, offset)¶

This method is called each time the collector starts on a new sub-searcher.

Subclasses can override this to perform set-up work, but they should still call the superclass’s method because it sets several necessary attributes on the collector object:

self.subsearcher: The current sub-searcher. If the top-level searcher is atomic, this is the same as the top-level searcher.
self.offset: The document number offset of the current searcher. You must add this number to the document number passed to Collector.collect() to get the top-level document number for use in results.
self.matcher: A whoosh.matching.Matcher object representing the matches for the query in the current sub-searcher.

sort_key(sub_docnum)¶

Returns a sorting key for the current match. This should return the same value returned by Collector.collect(), but without the side effect of adding the current document to the results.

Subclasses must implement this method.

Basic collectors¶

class whoosh.collectors.TopCollector(limit=10, usequality=True, **kwargs)¶

A collector that only returns the top “N” scored results.

Parameters:	limit – the maximum number of results to return. usequality – whether to use block-quality optimizations. This may be useful for debugging.

class whoosh.collectors.UnlimitedCollector(reverse=False)¶

A collector that returns all scored results.

Parameters:	replace – Number of matches between attempts to replace the matcher with a more efficient version.

class whoosh.collectors.SortingCollector(sortedby, limit=10, reverse=False)¶

A collector that returns results sorted by a given whoosh.sorting.Facet object. See Sorting and faceting for more information.

Parameters:	sortedby – see Sorting and faceting. reverse – If True, reverse the overall results. Note that you can reverse individual facets in a multi-facet sort key as well.

Wrappers¶

class whoosh.collectors.FilterCollector(child, allow=None, restrict=None)¶

A collector that lets you allow and/or restrict certain document numbers in the results:

uc = collectors.UnlimitedCollector()

ins = query.Term("chapter", "rendering")
outs = query.Term("status", "restricted")
fc = FilterCollector(uc, allow=ins, restrict=outs)

mysearcher.search_with_collector(myquery, fc)
print(fc.results())

This collector discards a document if:

The allowed set is not None and a document number is not in the set, or
The restrict set is not None and a document number is in the set.

(So, if the same document number is in both sets, that document will be discarded.)

If you have a reference to the collector, you can use FilterCollector.filtered_count to get the number of matching documents filtered out of the results by the collector.

Parameters:	child – the collector to wrap. allow – a query, Results object, or set-like object containing docnument numbers that are allowed in the results, or None (meaning everything is allowed). restrict – a query, Results object, or set-like object containing document numbers to disallow from the results, or None (meaning nothing is disallowed).

class whoosh.collectors.FacetCollector(child, groupedby, maptype=None)¶

A collector that creates groups of documents based on whoosh.sorting.Facet objects. See Sorting and faceting for more information.

This collector is used if you specify a groupedby parameter in the whoosh.searching.Searcher.search() method. You can use the whoosh.searching.Results.groups() method to access the facet groups.

If you have a reference to the collector can also use FacetedCollector.facetmaps to access the groups directly:

uc = collectors.UnlimitedCollector()
fc = FacetedCollector(uc, sorting.FieldFacet("category"))
mysearcher.search_with_collector(myquery, fc)
print(fc.facetmaps)

Parameters:	groupedby – see Sorting and faceting. maptype – a `whoosh.sorting.FacetMap` type to use for any facets that don’t specify their own.

class whoosh.collectors.CollapseCollector(child, keyfacet, limit=1, order=None)¶

A collector that collapses results based on a facet. That is, it eliminates all but the top N results that share the same facet key. Documents with an empty key for the facet are never eliminated.

The “top” results within each group is determined by the result ordering (e.g. highest score in a scored search) or an optional second “ordering” facet.

If you have a reference to the collector you can use CollapseCollector.collapsed_counts to access the number of documents eliminated based on each key:

tc = TopCollector(limit=20)
cc = CollapseCollector(tc, "group", limit=3)
mysearcher.search_with_collector(myquery, cc)
print(cc.collapsed_counts)

See Collapsing results for more information.

Parameters:

Parameters:	child – the collector to wrap. keyfacet – a `whoosh.sorting.Facet` to use for collapsing. All but the top N documents that share a key will be eliminated from the results. limit – the maximum number of documents to keep for each key. order – an optional `whoosh.sorting.Facet` to use to determine the “top” document(s) to keep when collapsing. The default (`orderfaceet=None`) uses the results order (e.g. the highest score in a scored search).

child – the collector to wrap.
keyfacet – a whoosh.sorting.Facet to use for collapsing. All but the top N documents that share a key will be eliminated from the results.
limit – the maximum number of documents to keep for each key.
order – an optional whoosh.sorting.Facet to use to determine the “top” document(s) to keep when collapsing. The default (orderfaceet=None) uses the results order (e.g. the highest score in a scored search).

class whoosh.collectors.TimeLimitCollector(child, timelimit, greedy=False, use_alarm=True)¶

A collector that raises a TimeLimit exception if the search does not complete within a certain number of seconds:

uc = collectors.UnlimitedCollector()
tlc = TimeLimitedCollector(uc, timelimit=5.8)
try:
    mysearcher.search_with_collector(myquery, tlc)
except collectors.TimeLimit:
    print("The search ran out of time!")

# We can still get partial results from the collector
print(tlc.results())

IMPORTANT: On Unix systems (systems where signal.SIGALRM is defined), the code uses signals to stop searching immediately when the time limit is reached. On Windows, the OS does not support this functionality, so the search only checks the time between each found document, so if a matcher is slow the search could exceed the time limit.

Parameters:

Parameters:	child – the collector to wrap. timelimit – the maximum amount of time (in seconds) to allow for searching. If the search takes longer than this, it will raise a `TimeLimit` exception. greedy – if `True`, the collector will finish adding the most recent hit before raising the `TimeLimit` exception. use_alarm – if `True` (the default), the collector will try to use signal.SIGALRM (on UNIX).

child – the collector to wrap.
timelimit – the maximum amount of time (in seconds) to allow for searching. If the search takes longer than this, it will raise a TimeLimit exception.
greedy – if True, the collector will finish adding the most recent hit before raising the TimeLimit exception.
use_alarm – if True (the default), the collector will try to use signal.SIGALRM (on UNIX).

class whoosh.collectors.TermsCollector(child, settype=<class 'set'>)¶

A collector that remembers which terms appeared in which terms appeared in each matched document.

This collector is used if you specify terms=True in the whoosh.searching.Searcher.search() method.

If you have a reference to the collector can also use TermsCollector.termslist to access the term lists directly:

uc = collectors.UnlimitedCollector()
tc = TermsCollector(uc)
mysearcher.search_with_collector(myquery, tc)
# tc.termdocs is a dictionary mapping (fieldname, text) tuples to
# sets of document numbers
print(tc.termdocs)
# tc.docterms is a dictionary mapping docnums to lists of
# (fieldname, text) tuples
print(tc.docterms)