Release notes
- Whoosh 2.x release notes
- Whoosh 1.x release notes
- Whoosh 0.3 release notes
Quick start
- A quick introduction
- The Index and Schema objects
- The IndexWriter object
- The Searcher object
Introduction to Whoosh
- About Whoosh
- What is Whoosh?
- What can Whoosh do for you?
- Getting help with Whoosh
Glossary
Designing a schema
- About schemas and fields
- Built-in field types
- Creating a Schema
- Modifying the schema after indexing
- Dynamic fields
- Advanced schema setup
How to index documents
- Creating an Index object
- Clearing the index
- Indexing documents
  - Indexing and storing different values for the same field
  - Finishing adding documents
- Merging segments
- Deleting documents
- Updating documents
- Incremental indexing
- Clearing the index
How to search
- The Searcher object
- Results object
- Scoring and sorting
  - Scoring
  - Sorting
- Highlighting snippets and More Like This
- Filtering results
- Which terms from my query matched?
- Collapsing results
- Time limited searches
- Convenience methods
- Combining Results objects
Parsing user queries
- Overview
- Using the default parser
- Common customizations
- Advanced customization
The default query language
- Overview
- Individual terms and phrases
- Boolean operators
- Fields
- Inexact terms
- Ranges
- Boosting query elements
- Making a term from literal text
Indexing and parsing dates/times
- Indexing dates
- Parsing date queries
- About time zones and basetime
- Date parser notes
- Limitations
Query objects
About analyzers
- Overview
- Using analyzers
- Advanced Analysis
Stemming, variations, and accent folding
- The problem
- Stemming
- Variations
- Lemmatization
- Character folding
Indexing and searching N-grams
- Overview
Sorting and faceting
- Overview
- Sorting
- Grouping
- Facet types
- MultiFacet
- Missing values
- Using overlapping groups
- Using a custom sort order
- Expert: writing your own facet
How to create highlighted search result excerpts
- Overview
- Requirements
- How to
- The character limit
- Customizing the highlights
- Highlighter object
- Speeding up highlighting
  - PinpointFragmenter
  - PinpointFragmenter limitations
- Using the low-level API
  - Usage
Query expansion and Key word extraction
- Overview
- Usage
- Expansion models
“Did you mean… ?” Correcting errors in user queries
- Overview
- Pulling suggestions from an indexed field
- Pulling suggestions from a word list
- Merging two or more correctors
- Correcting user queries
Field caches
- Customizing cache behaviour
- Creating a custom caching policy
Tips for speeding up batch indexing
- Overview
- StemmingAnalyzer cache
- The limitmb parameter
- The procs parameter
- The multisegment parameter
Concurrency, locking, and versioning
- Concurrency
- Locking
  - Lock files
- Versioning
Indexing and searching document hierarchies
- Overview
- Using nested document indexing
- Using query-time joins
Whoosh recipes
- General
  - Get the stored fields for a document from the document number
- Analysis
  - Eliminate words shorter/longer than N
  - Allow optional case-sensitive searches
- Searching
  - Find every document
  - iTunes-style search-as-you-type
- Shortcuts
  - Look up documents by a field value
- Sorting and scoring
  - Score results based on the position of the matched term
- Results
  - How many hits were there?
  - Which terms matched in each hit?
- Global information
Whoosh API
- analysis module
  - Analyzers
  - Tokenizers
  - Filters
  - Token classes and functions
    - Token
    - unstopped()
- codec.base module
  - Classes
- collectors module
  - Base classes
  - Basic collectors
  - Wrappers
- columns module
  - Base classes
  - Basic columns
  - Technical columns
  - Experimental columns
    - ClampedNumericColumn
- fields module
  - Schema class
    - Schema
    - SchemaClass
  - FieldType base class
    - FieldType
  - Pre-made field types
    - ID
    - IDLIST
    - STORED
    - KEYWORD
    - TEXT
    - NUMERIC
    - DATETIME
    - BOOLEAN
    - NGRAM
    - NGRAMWORDS
  - Exceptions
    - FieldConfigurationError
    - UnknownFieldError
- filedb.filestore module
  - Base class
    - Storage
  - Implementation classes
    - FileStorage
    - RamStorage
  - Helper functions
    - copy_storage()
    - copy_to_ram()
  - Exceptions
    - ReadOnlyError
- filedb.filetables module
  - Hash file
    - HashWriter
      - HashWriter.add()
      - HashWriter.add_all()
    - HashReader
  - Ordered Hash file
    - OrderedHashWriter
    - OrderedHashReader
- filedb.structfile module
  - Classes
- formats module
  - Base class
    - Format
  - Formats
- highlight module
  - Manual highlighting
    - Highlighter
    - highlight()
  - Fragmenters
  - Scorers
    - FragmentScorer
    - BasicFragmentScorer
  - Formatters
  - Utility classes
    - Fragment
- support.bitvector module
  - Base classes
    - DocIdSet
    - BaseBitSet
  - Implementation classes
- index module
  - Functions
  - Base class
    - Index
  - Implementation
    - FileIndex
  - Exceptions
- lang.morph_en module
  - variations()
- lang.porter module
  - stem()
- lang.wordnet module
  - Thesaurus
    - Thesaurus
  - Low-level functions
- matching module
  - Matchers
  - Exceptions
    - ReadTooFar
    - NoQualityAvailable
- qparser module
  - Parser object
    - QueryParser
    - Pre-made configurations
  - Plug-ins
  - Syntax node objects
- query module
  - Base classes
  - Query classes
    - Term
    - Variations
    - FuzzyTerm
    - Phrase
    - And
    - Or
    - DisjunctionMax
    - Not
    - Prefix
    - Wildcard
    - Regex
    - TermRange
    - NumericRange
    - DateRange
    - Every
    - NullQuery
  - Binary queries
  - Span queries
  - Special queries
  - Exceptions
    - QueryError
- reading module
  - Classes
  - Exceptions
    - TermNotFound
- scoring module
  - Base classes
  - Scoring algorithm classes
  - Scoring utility classes
- searching module
  - Searching classes
    - Searcher
  - Results classes
  - Exceptions
    - NoTermsException
    - TimeLimit
- sorting module
  - Base types
    - FacetType
      - FacetType.categorizer()
    - Categorizer
  - Facet types
  - Facets object
    - Facets
  - FacetType objects
- spelling module
  - Corrector objects
  - QueryCorrector objects
- support.charset module
  - default_charset
  - charset_table_to_dict()
- support.levenshtein module
  - relative()
  - distance()
- util module
  - fib()
  - make_binary_tree()
  - make_weighted_tree()
  - synchronized()
  - unclosed()
- writing module
  - Writer
    - IndexWriter
  - Utility writers
    - BufferedWriter
    - AsyncWriter
  - Exceptions
    - IndexingError
Technical notes
- How to implement a new backend
  - Index
  - IndexWriter
  - IndexReader
  - Matcher
- filedb notes
  - Files created

`highlight` module¶

The highlight module contains classes and functions for displaying short excerpts from hit documents in the search results you present to the user, with query terms highlighted.

The highlighting system has four main elements.

Fragmenters chop up the original text into __fragments__, based on the locations of matched terms in the text.
Scorers assign a score to each fragment, allowing the system to rank the best fragments by whatever criterion.
Order functions control in what order the top-scoring fragments are presented to the user. For example, you can show the fragments in the order they appear in the document (FIRST) or show higher-scoring fragments first (SCORE)
Formatters turn the fragment objects into human-readable output, such as an HTML string.

See How to create highlighted search result excerpts for more information.

See how to highlight terms in search results.

Manual highlighting¶

class whoosh.highlight.Highlighter(fragmenter=None, scorer=None, formatter=None, always_retokenize=False, order=<function FIRST>)¶

whoosh.highlight.highlight(text, terms, analyzer, fragmenter, formatter, top=3, scorer=None, minscore=1, order=<function FIRST>, mode='query')¶

Fragmenters¶

class whoosh.highlight.Fragmenter¶

fragment_matches(text, matched_tokens)¶

Yields Fragment objects based on the text and the matched terms.

Parameters:	text – the string being highlighted. matched_tokens – a list of `analysis.Token` objects representing the term matches in the string.

fragment_tokens(text, all_tokens)¶

Yields Fragment objects based on the tokenized text.

Parameters:	text – the string being highlighted. all_tokens – an iterator of `analysis.Token` objects from the string.

must_retokenize()¶

Returns True if this fragmenter requires retokenized text.

If this method returns True, the fragmenter’s fragment_tokens method will be called with an iterator of ALL tokens from the text, with the tokens for matched terms having the matched attribute set to True.

If this method returns False, the fragmenter’s fragment_matches method will be called with a LIST of matching tokens.

class whoosh.highlight.WholeFragmenter(charlimit=32768)¶

Doesn’t fragment the token stream. This object just returns the entire entire stream as one “fragment”. This is useful if you want to highlight the entire text.

Note that even if you use the WholeFragmenter, the highlight code will return no fragment if no terms matched in the given field. To return the whole fragment even in that case, call highlights() with minscore=0:

# Query where no terms match in the "text" field
q = query.Term("tag", "new")

r = mysearcher.search(q)
r.fragmenter = highlight.WholeFragmenter()
r.formatter = highlight.UppercaseFormatter()
# Since no terms in the "text" field matched, we get no fragments back
assert r[0].highlights("text") == ""

# If we lower the minimum score to 0, we get a fragment even though it
# has no matching terms
assert r[0].highlights("text", minscore=0) == "This is the text field."

class whoosh.highlight.SentenceFragmenter(maxchars=200, sentencechars='.!?', charlimit=32768)¶

Breaks the text up on sentence end punctuation characters (“.”, “!”, or “?”). This object works by looking in the original text for a sentence end as the next character after each token’s ‘endchar’.

When highlighting with this fragmenter, you should use an analyzer that does NOT remove stop words, for example:

sa = StandardAnalyzer(stoplist=None)

Parameters:	maxchars – The maximum number of characters allowed in a fragment.

class whoosh.highlight.ContextFragmenter(maxchars=200, surround=20, charlimit=32768)¶

Looks for matched terms and aggregates them with their surrounding context.

Parameters:	maxchars – The maximum number of characters allowed in a fragment. surround – The number of extra characters of context to add both before the first matched term and after the last matched term.

class whoosh.highlight.PinpointFragmenter(maxchars=200, surround=20, autotrim=False, charlimit=32768)¶

This is a NON-RETOKENIZING fragmenter. It builds fragments from the positions of the matched terms.

Parameters:

Parameters:	maxchars – The maximum number of characters allowed in a fragment. surround – The number of extra characters of context to add both before the first matched term and after the last matched term. autotrim – automatically trims text before the first space and after the last space in the fragments, to try to avoid truncated words at the start and end. For short fragments or fragments with long runs between spaces this may give strange results.

maxchars – The maximum number of characters allowed in a fragment.
surround – The number of extra characters of context to add both before the first matched term and after the last matched term.
autotrim – automatically trims text before the first space and after the last space in the fragments, to try to avoid truncated words at the start and end. For short fragments or fragments with long runs between spaces this may give strange results.

Scorers¶

class whoosh.highlight.FragmentScorer¶

class whoosh.highlight.BasicFragmentScorer¶

Formatters¶

class whoosh.highlight.UppercaseFormatter(between='...')¶

Returns a string in which the matched terms are in UPPERCASE.

Parameters:	between – the text to add between fragments.

class whoosh.highlight.HtmlFormatter(tagname='strong', between='...', classname='match', termclass='term', maxclasses=5, attrquote='"')¶

Returns a string containing HTML formatting around the matched terms.

This formatter wraps matched terms in an HTML element with two class names. The first class name (set with the constructor argument classname) is the same for each match. The second class name (set with the constructor argument termclass is different depending on which term matched. This allows you to give different formatting (for example, different background colors) to the different terms in the excerpt.

>>> hf = HtmlFormatter(tagname="span", classname="match", termclass="term")
>>> hf(mytext, myfragments)
"The <span class="match term0">template</span> <span class="match term1">geometry</span> is..."

This object maintains a dictionary mapping terms to HTML class names (e.g. term0 and term1 above), so that multiple excerpts will use the same class for the same term. If you want to re-use the same HtmlFormatter object with different searches, you should call HtmlFormatter.clear() between searches to clear the mapping.

Parameters:

Parameters:	tagname – the tag to wrap around matching terms. between – the text to add between fragments. classname – the class name to add to the elements wrapped around matching terms. termclass – the class name prefix for the second class which is different for each matched term. maxclasses – the maximum number of term classes to produce. This limits the number of classes you have to define in CSS by recycling term class names. For example, if you set maxclasses to 3 and have 5 terms, the 5 terms will use the CSS classes `term0`, `term1`, `term2`, `term0`, `term1`.

tagname – the tag to wrap around matching terms.
between – the text to add between fragments.
classname – the class name to add to the elements wrapped around matching terms.
termclass – the class name prefix for the second class which is different for each matched term.
maxclasses – the maximum number of term classes to produce. This limits the number of classes you have to define in CSS by recycling term class names. For example, if you set maxclasses to 3 and have 5 terms, the 5 terms will use the CSS classes term0, term1, term2, term0, term1.

class whoosh.highlight.GenshiFormatter(qname='strong', between='...')¶

Returns a Genshi event stream containing HTML formatting around the matched terms.

Parameters:	qname – the QName for the tag to wrap around matched terms. between – the text to add between fragments.

Utility classes¶

class whoosh.highlight.Fragment(text, matches, startchar=0, endchar=-1)¶

Represents a fragment (extract) from a hit document. This object is mainly used to keep track of the start and end points of the fragment and the “matched” character ranges inside; it does not contain the text of the fragment or do much else.

The useful attributes are:

Fragment.text: The entire original text from which this fragment is taken.
Fragment.matches: An ordered list of objects representing the matched terms in the fragment. These objects have startchar and endchar attributes.
Fragment.startchar: The index of the first character in the fragment.
Fragment.endchar: The index of the last character in the fragment.
Fragment.matched_terms: A set of the text of the matched terms in the fragment (if available).

Parameters:	text – the source text of the fragment. matches – a list of objects which have `startchar` and `endchar` attributes, and optionally a `text` attribute. startchar – the index into `text` at which the fragment starts. The default is 0. endchar – the index into `text` at which the fragment ends. The default is -1, which is interpreted as the length of `text`.