Release notes
- Whoosh 2.x release notes
- Whoosh 1.x release notes
- Whoosh 0.3 release notes
Quick start
- A quick introduction
- The Index and Schema objects
- The IndexWriter object
- The Searcher object
Introduction to Whoosh
- About Whoosh
- What is Whoosh?
- What can Whoosh do for you?
- Getting help with Whoosh
Glossary
Designing a schema
- About schemas and fields
- Built-in field types
- Creating a Schema
- Modifying the schema after indexing
- Dynamic fields
- Advanced schema setup
How to index documents
- Creating an Index object
- Clearing the index
- Indexing documents
  - Indexing and storing different values for the same field
  - Finishing adding documents
- Merging segments
- Deleting documents
- Updating documents
- Incremental indexing
- Clearing the index
How to search
- The Searcher object
- Results object
- Scoring and sorting
  - Scoring
  - Sorting
- Highlighting snippets and More Like This
- Filtering results
- Which terms from my query matched?
- Collapsing results
- Time limited searches
- Convenience methods
- Combining Results objects
Parsing user queries
- Overview
- Using the default parser
- Common customizations
- Advanced customization
The default query language
- Overview
- Individual terms and phrases
- Boolean operators
- Fields
- Inexact terms
- Ranges
- Boosting query elements
- Making a term from literal text
Indexing and parsing dates/times
- Indexing dates
- Parsing date queries
- About time zones and basetime
- Date parser notes
- Limitations
Query objects
About analyzers
- Overview
- Using analyzers
- Advanced Analysis
Stemming, variations, and accent folding
- The problem
- Stemming
- Variations
- Lemmatization
- Character folding
Indexing and searching N-grams
- Overview
Sorting and faceting
- Overview
- Sorting
- Grouping
- Facet types
- MultiFacet
- Missing values
- Using overlapping groups
- Using a custom sort order
- Expert: writing your own facet
How to create highlighted search result excerpts
- Overview
- Requirements
- How to
- The character limit
- Customizing the highlights
- Highlighter object
- Speeding up highlighting
  - PinpointFragmenter
  - PinpointFragmenter limitations
- Using the low-level API
  - Usage
Query expansion and Key word extraction
- Overview
- Usage
- Expansion models
“Did you mean… ?” Correcting errors in user queries
- Overview
- Pulling suggestions from an indexed field
- Pulling suggestions from a word list
- Merging two or more correctors
- Correcting user queries
Field caches
- Customizing cache behaviour
- Creating a custom caching policy
Tips for speeding up batch indexing
- Overview
- StemmingAnalyzer cache
- The limitmb parameter
- The procs parameter
- The multisegment parameter
Concurrency, locking, and versioning
- Concurrency
- Locking
  - Lock files
- Versioning
Indexing and searching document hierarchies
- Overview
- Using nested document indexing
- Using query-time joins
Whoosh recipes
- General
  - Get the stored fields for a document from the document number
- Analysis
  - Eliminate words shorter/longer than N
  - Allow optional case-sensitive searches
- Searching
  - Find every document
  - iTunes-style search-as-you-type
- Shortcuts
  - Look up documents by a field value
- Sorting and scoring
  - Score results based on the position of the matched term
- Results
  - How many hits were there?
  - Which terms matched in each hit?
- Global information
Whoosh API
- analysis module
  - Analyzers
  - Tokenizers
  - Filters
  - Token classes and functions
    - Token
    - unstopped()
- codec.base module
  - Classes
- collectors module
  - Base classes
  - Basic collectors
  - Wrappers
- columns module
  - Base classes
  - Basic columns
  - Technical columns
  - Experimental columns
    - ClampedNumericColumn
- fields module
  - Schema class
    - Schema
    - SchemaClass
  - FieldType base class
    - FieldType
  - Pre-made field types
    - ID
    - IDLIST
    - STORED
    - KEYWORD
    - TEXT
    - NUMERIC
    - DATETIME
    - BOOLEAN
    - NGRAM
    - NGRAMWORDS
  - Exceptions
    - FieldConfigurationError
    - UnknownFieldError
- filedb.filestore module
  - Base class
    - Storage
  - Implementation classes
    - FileStorage
    - RamStorage
  - Helper functions
    - copy_storage()
    - copy_to_ram()
  - Exceptions
    - ReadOnlyError
- filedb.filetables module
  - Hash file
    - HashWriter
      - HashWriter.add()
      - HashWriter.add_all()
    - HashReader
  - Ordered Hash file
    - OrderedHashWriter
    - OrderedHashReader
- filedb.structfile module
  - Classes
- formats module
  - Base class
    - Format
  - Formats
- highlight module
  - Manual highlighting
    - Highlighter
    - highlight()
  - Fragmenters
  - Scorers
    - FragmentScorer
    - BasicFragmentScorer
  - Formatters
  - Utility classes
    - Fragment
- support.bitvector module
  - Base classes
    - DocIdSet
    - BaseBitSet
  - Implementation classes
- index module
  - Functions
  - Base class
    - Index
  - Implementation
    - FileIndex
  - Exceptions
- lang.morph_en module
  - variations()
- lang.porter module
  - stem()
- lang.wordnet module
  - Thesaurus
    - Thesaurus
  - Low-level functions
- matching module
  - Matchers
  - Exceptions
    - ReadTooFar
    - NoQualityAvailable
- qparser module
  - Parser object
    - QueryParser
    - Pre-made configurations
  - Plug-ins
  - Syntax node objects
- query module
  - Base classes
  - Query classes
    - Term
    - Variations
    - FuzzyTerm
    - Phrase
    - And
    - Or
    - DisjunctionMax
    - Not
    - Prefix
    - Wildcard
    - Regex
    - TermRange
    - NumericRange
    - DateRange
    - Every
    - NullQuery
  - Binary queries
  - Span queries
  - Special queries
  - Exceptions
    - QueryError
- reading module
  - Classes
  - Exceptions
    - TermNotFound
- scoring module
  - Base classes
  - Scoring algorithm classes
  - Scoring utility classes
- searching module
  - Searching classes
    - Searcher
  - Results classes
  - Exceptions
    - NoTermsException
    - TimeLimit
- sorting module
  - Base types
    - FacetType
      - FacetType.categorizer()
    - Categorizer
  - Facet types
  - Facets object
    - Facets
  - FacetType objects
- spelling module
  - Corrector objects
  - QueryCorrector objects
- support.charset module
  - default_charset
  - charset_table_to_dict()
- support.levenshtein module
  - relative()
  - distance()
- util module
  - fib()
  - make_binary_tree()
  - make_weighted_tree()
  - synchronized()
  - unclosed()
- writing module
  - Writer
    - IndexWriter
  - Utility writers
    - BufferedWriter
    - AsyncWriter
  - Exceptions
    - IndexingError
Technical notes
- How to implement a new backend
  - Index
  - IndexWriter
  - IndexReader
  - Matcher
- filedb notes
  - Files created

Whoosh recipes¶

General¶

Get the stored fields for a document from the document number¶

stored_fields = searcher.stored_fields(docnum)

Analysis¶

Eliminate words shorter/longer than N¶

Use a StopFilter and the minsize and maxsize keyword arguments. If you just want to filter based on size and not common words, set the stoplist to None:

sf = analysis.StopFilter(stoplist=None, minsize=2, maxsize=40)

Allow optional case-sensitive searches¶

A quick and easy way to do this is to index both the original and lowercased versions of each word. If the user searches for an all-lowercase word, it acts as a case-insensitive search, but if they search for a word with any uppercase characters, it acts as a case-sensitive search:

class CaseSensitivizer(analysis.Filter):
    def __call__(self, tokens):
        for t in tokens:
            yield t
            if t.mode == "index":
               low = t.text.lower()
               if low != t.text:
                   t.text = low
                   yield t

ana = analysis.RegexTokenizer() | CaseSensitivizer()
[t.text for t in ana("The new SuperTurbo 5000", mode="index")]
# ["The", "the", "new", "SuperTurbo", "superturbo", "5000"]

Searching¶

Find every document¶

myquery = query.Every()

iTunes-style search-as-you-type¶

Use the whoosh.analysis.NgramWordAnalyzer as the analyzer for the field you want to search as the user types. You can save space in the index by turning off positions in the field using phrase=False, since phrase searching on N-gram fields usually doesn’t make much sense:

# For example, to search the "title" field as the user types
analyzer = analysis.NgramWordAnalyzer()
title_field = fields.TEXT(analyzer=analyzer, phrase=False)
schema = fields.Schema(title=title_field)

See the documentation for the NgramWordAnalyzer class for information on the available options.

Shortcuts¶

Look up documents by a field value¶

# Single document (unique field value)
stored_fields = searcher.document(id="bacon")

# Multiple documents
for stored_fields in searcher.documents(tag="cake"):
    ...

Sorting and scoring¶

See Sorting and faceting.

Score results based on the position of the matched term¶

The following scoring function uses the position of the first occurance of a term in each document to calculate the score, so documents with the given term earlier in the document will score higher:

from whoosh import scoring

def pos_score_fn(searcher, fieldname, text, matcher):
    poses = matcher.value_as("positions")
    return 1.0 / (poses[0] + 1)

pos_weighting = scoring.FunctionWeighting(pos_score_fn)
with myindex.searcher(weighting=pos_weighting) as s:
    ...

Results¶

How many hits were there?¶

The number of scored hits:

found = results.scored_length()

Depending on the arguments to the search, the exact total number of hits may be known:

if results.has_exact_length():
    print("Scored", found, "of exactly", len(results), "documents")

Usually, however, the exact number of documents that match the query is not known, because the searcher can skip over blocks of documents it knows won’t show up in the “top N” list. If you call len(results) on a query where the exact length is unknown, Whoosh will run an unscored version of the original query to get the exact number. This is faster than the scored search, but may still be noticeably slow on very large indexes or complex queries.

As an alternative, you might display the estimated total hits:

found = results.scored_length()
if results.has_exact_length():
    print("Scored", found, "of exactly", len(results), "documents")
else:
    low = results.estimated_min_length()
    high = results.estimated_length()

    print("Scored", found, "of between", low, "and", high, "documents")

Which terms matched in each hit?¶

# Use terms=True to record term matches for each hit
results = searcher.search(myquery, terms=True)

for hit in results:
    # Which terms matched in this hit?
    print("Matched:", hit.matched_terms())

    # Which terms from the query didn't match in this hit?
    print("Didn't match:", myquery.all_terms() - hit.matched_terms())

Global information¶

How many documents are in the index?¶

# Including documents that are deleted but not yet optimized away
numdocs = searcher.doc_count_all()

# Not including deleted documents
numdocs = searcher.doc_count()

What fields are in the index?¶

return myindex.schema.names()

Is term X in the index?¶

return ("content", "wobble") in searcher

How many times does term X occur in the index?¶

# Number of times content:wobble appears in all documents
freq = searcher.frequency("content", "wobble")

# Number of documents containing content:wobble
docfreq = searcher.doc_frequency("content", "wobble")

Is term X in document Y?¶

# Check if the "content" field of document 500 contains the term "wobble"

# Without term vectors, skipping through list...
postings = searcher.postings("content", "wobble")
postings.skip_to(500)
return postings.id() == 500

# ...or the slower but easier way
docset = set(searcher.postings("content", "wobble").all_ids())
return 500 in docset

# If field has term vectors, skipping through list...
vector = searcher.vector(500, "content")
vector.skip_to("wobble")
return vector.id() == "wobble"

# ...or the slower but easier way
wordset = set(searcher.vector(500, "content").all_ids())
return "wobble" in wordset

Whoosh recipes