Release notes
- Whoosh 2.x release notes
- Whoosh 1.x release notes
- Whoosh 0.3 release notes
Quick start
- A quick introduction
- The Index and Schema objects
- The IndexWriter object
- The Searcher object
Introduction to Whoosh
- About Whoosh
- What is Whoosh?
- What can Whoosh do for you?
- Getting help with Whoosh
Glossary
Designing a schema
- About schemas and fields
- Built-in field types
- Creating a Schema
- Modifying the schema after indexing
- Dynamic fields
- Advanced schema setup
How to index documents
- Creating an Index object
- Clearing the index
- Indexing documents
  - Indexing and storing different values for the same field
  - Finishing adding documents
- Merging segments
- Deleting documents
- Updating documents
- Incremental indexing
- Clearing the index
How to search
- The Searcher object
- Results object
- Scoring and sorting
  - Scoring
  - Sorting
- Highlighting snippets and More Like This
- Filtering results
- Which terms from my query matched?
- Collapsing results
- Time limited searches
- Convenience methods
- Combining Results objects
Parsing user queries
- Overview
- Using the default parser
- Common customizations
- Advanced customization
The default query language
- Overview
- Individual terms and phrases
- Boolean operators
- Fields
- Inexact terms
- Ranges
- Boosting query elements
- Making a term from literal text
Indexing and parsing dates/times
- Indexing dates
- Parsing date queries
- About time zones and basetime
- Date parser notes
- Limitations
Query objects
About analyzers
- Overview
- Using analyzers
- Advanced Analysis
Stemming, variations, and accent folding
- The problem
- Stemming
- Variations
- Lemmatization
- Character folding
Indexing and searching N-grams
- Overview
Sorting and faceting
- Overview
- Sorting
- Grouping
- Facet types
- MultiFacet
- Missing values
- Using overlapping groups
- Using a custom sort order
- Expert: writing your own facet
How to create highlighted search result excerpts
- Overview
- Requirements
- How to
- The character limit
- Customizing the highlights
- Highlighter object
- Speeding up highlighting
  - PinpointFragmenter
  - PinpointFragmenter limitations
- Using the low-level API
  - Usage
Query expansion and Key word extraction
- Overview
- Usage
- Expansion models
“Did you mean… ?” Correcting errors in user queries
- Overview
- Pulling suggestions from an indexed field
- Pulling suggestions from a word list
- Merging two or more correctors
- Correcting user queries
Field caches
- Customizing cache behaviour
- Creating a custom caching policy
Tips for speeding up batch indexing
- Overview
- StemmingAnalyzer cache
- The limitmb parameter
- The procs parameter
- The multisegment parameter
Concurrency, locking, and versioning
- Concurrency
- Locking
  - Lock files
- Versioning
Indexing and searching document hierarchies
- Overview
- Using nested document indexing
- Using query-time joins
Whoosh recipes
- General
  - Get the stored fields for a document from the document number
- Analysis
  - Eliminate words shorter/longer than N
  - Allow optional case-sensitive searches
- Searching
  - Find every document
  - iTunes-style search-as-you-type
- Shortcuts
  - Look up documents by a field value
- Sorting and scoring
  - Score results based on the position of the matched term
- Results
  - How many hits were there?
  - Which terms matched in each hit?
- Global information
Whoosh API
- analysis module
  - Analyzers
  - Tokenizers
  - Filters
  - Token classes and functions
    - Token
    - unstopped()
- codec.base module
  - Classes
- collectors module
  - Base classes
  - Basic collectors
  - Wrappers
- columns module
  - Base classes
  - Basic columns
  - Technical columns
  - Experimental columns
    - ClampedNumericColumn
- fields module
  - Schema class
    - Schema
    - SchemaClass
  - FieldType base class
    - FieldType
  - Pre-made field types
    - ID
    - IDLIST
    - STORED
    - KEYWORD
    - TEXT
    - NUMERIC
    - DATETIME
    - BOOLEAN
    - NGRAM
    - NGRAMWORDS
  - Exceptions
    - FieldConfigurationError
    - UnknownFieldError
- filedb.filestore module
  - Base class
    - Storage
  - Implementation classes
    - FileStorage
    - RamStorage
  - Helper functions
    - copy_storage()
    - copy_to_ram()
  - Exceptions
    - ReadOnlyError
- filedb.filetables module
  - Hash file
    - HashWriter
      - HashWriter.add()
      - HashWriter.add_all()
    - HashReader
  - Ordered Hash file
    - OrderedHashWriter
    - OrderedHashReader
- filedb.structfile module
  - Classes
- formats module
  - Base class
    - Format
  - Formats
- highlight module
  - Manual highlighting
    - Highlighter
    - highlight()
  - Fragmenters
  - Scorers
    - FragmentScorer
    - BasicFragmentScorer
  - Formatters
  - Utility classes
    - Fragment
- support.bitvector module
  - Base classes
    - DocIdSet
    - BaseBitSet
  - Implementation classes
- index module
  - Functions
  - Base class
    - Index
  - Implementation
    - FileIndex
  - Exceptions
- lang.morph_en module
  - variations()
- lang.porter module
  - stem()
- lang.wordnet module
  - Thesaurus
    - Thesaurus
  - Low-level functions
- matching module
  - Matchers
  - Exceptions
    - ReadTooFar
    - NoQualityAvailable
- qparser module
  - Parser object
    - QueryParser
    - Pre-made configurations
  - Plug-ins
  - Syntax node objects
- query module
  - Base classes
  - Query classes
    - Term
    - Variations
    - FuzzyTerm
    - Phrase
    - And
    - Or
    - DisjunctionMax
    - Not
    - Prefix
    - Wildcard
    - Regex
    - TermRange
    - NumericRange
    - DateRange
    - Every
    - NullQuery
  - Binary queries
  - Span queries
  - Special queries
  - Exceptions
    - QueryError
- reading module
  - Classes
  - Exceptions
    - TermNotFound
- scoring module
  - Base classes
  - Scoring algorithm classes
  - Scoring utility classes
- searching module
  - Searching classes
    - Searcher
  - Results classes
  - Exceptions
    - NoTermsException
    - TimeLimit
- sorting module
  - Base types
    - FacetType
      - FacetType.categorizer()
    - Categorizer
  - Facet types
  - Facets object
    - Facets
  - FacetType objects
- spelling module
  - Corrector objects
  - QueryCorrector objects
- support.charset module
  - default_charset
  - charset_table_to_dict()
- support.levenshtein module
  - relative()
  - distance()
- util module
  - fib()
  - make_binary_tree()
  - make_weighted_tree()
  - synchronized()
  - unclosed()
- writing module
  - Writer
    - IndexWriter
  - Utility writers
    - BufferedWriter
    - AsyncWriter
  - Exceptions
    - IndexingError
Technical notes
- How to implement a new backend
  - Index
  - IndexWriter
  - IndexReader
  - Matcher
- filedb notes
  - Files created

Whoosh 2.x release notes¶

Whoosh 2.7¶

Removed on-disk word graph implementation of spell checking in favor of much simpler and faster FSA implementation over the term file.
Many bug fixes.
Removed backwards compatibility with indexes created by versions prior to 2.5. You may need to re-index if you are using an old index that hasn’t been updated.
This is the last 2.x release before a major overhaul that will break backwards compatibility.

Whoosh 2.5¶

Whoosh 2.5 will read existing indexes, but segments created by 2.5 will not be readable by older versions of Whoosh.
As a replacement for field caches to speed up sorting, Whoosh now supports adding a sortable=True keyword argument to fields. This makes Whoosh store a sortable representation of the field’s values in a “column” format (which associates a “key” value with each document). This is more robust, efficient, and customizable than the old behavior. You should now specify sortable=True on fields that you plan on using to sort or group search results.

(You can still sort/group on fields that don’t have sortable=True, however it will use more RAM and be slower as Whoosh caches the field values in memory.)

Fields that use sortable=True can avoid specifying stored=True. The field’s value will still be available on Hit objects (the value will be retrieved from the column instead of from the stored fields). This may actually be faster for certain types of values.
Whoosh will now detect common types of OR queries and use optimized read-ahead matchers to speed them up by several times.
Whoosh now includes pure-Python implementations of the Snowball stemmers and stop word lists for various languages adapted from NLTK. These are available through the whoosh.analysis.LanguageAnalyzer analyzer or through the lang= keyword argument to the TEXT field.
You can now use the whoosh.filedb.filestore.Storage.create() and whoosh.filedb.filestore.Storage.destory() methods as a consistent API to set up and tear down different types of storage.
Many bug fixes and speed improvements.
Switched unit tests to use py.test instead of nose.
Removed obsolete SpellChecker class.

Whoosh 2.4¶

By default, Whoosh now assembles the individual files of a segment into a single file when committing. This has a small performance penalty but solves a problem where Whoosh can keep too many files open. Whoosh is also now smarter about using mmap.
Added functionality to index and search hierarchical documents. See Indexing and searching document hierarchies.
Rewrote the Directed Acyclic Word Graph implementation (used in spell checking) to be faster and more space-efficient. Word graph files created by previous versions will be ignored, meaning that spell checking may become slower unless/until you replace the old segments (for example, by optimizing).
Rewrote multiprocessing indexing to be faster and simpler. You can now do myindex.writer(procs=n) to get a multiprocessing writer, or myindex.writer(procs=n, multisegment=True) to get a multiprocessing writer that leaves behind multiple segments, like the old MultiSegmentWriter. (MultiSegmentWriter is still available as a function that returns the new class.)
When creating Term query objects for special fields (e.g. NUMERIC or BOOLEAN), you can now use the field’s literal type instead of a string as the second argument, for example Term("num", 20) or Term("bool", True). (This change may cause problems interacting with functions that expect query objects to be pure textual, such as spell checking.)
All writing to and reading from on-disk indexes is now done through “codec” objects. This architecture should make it easier to add optional or experimental features, and maintain backwards compatibility.
Fixes issues #75, #137, #206, #213, #215, #219, #223, #226, #230, #233, #238, #239, #240, #241, #243, #244, #245, #252, #253, and other bugs. Thanks to Thomas Waldmann and Alexei Gousev for the help!

Whoosh 2.3.2¶

Fixes bug in BM25F scoring function, leading to increased precision in search results.
Fixes issues #203, #205, #206, #208, #209, #212.

Whoosh 2.3.1¶

Fixes issue #200.

Whoosh 2.3¶

Added a whoosh.query.Regex term query type, similar to whoosh.query.Wildcard. The parser does not allow regex term queries by default. You need to add the whoosh.qparser.RegexPlugin plugin. After you add the plugin, you can use r"expression" query syntax for regular expression term queries. For example, r"foo.*bar".
Added the whoosh.qparser.PseudoFieldPlugin parser plugin. This plugin lets you create “pseudo-fields” that run a transform function on whatever query syntax the user applies the field to. This is fairly advanced functionality right now; I’m trying to think of ways to make its power easier to access.
The documents in the lists in the dictionary returned by Results.groups() by default are now in the same relative order as in the results. This makes it much easier to display the “top N” results in each category, for example.
The groupids keyword argument to Searcher.search has been removed. Instead you can now pass a whoosh.sorting.FacetMap object to the Searcher.search method’s maptype argument to control how faceted documents are grouped, and/or set the maptype argument on individual whoosh.sorting.FacetType` objects to set custom grouping per facet. See Sorting and faceting for more information.
Calling Searcher.documents() or Searcher.document_numbers() with no arguments now yields all documents/numbers.
Calling Writer.update_document() with no unique fields is now equivalent to calling Writer.add_document() with the same arguments.
Fixed a problem with keyword expansion where the code was building a cache that was fast on small indexes, but unacceptably slow on large indexes.
Added the hyphen (-) to the list of characters that match a “wildcard” token, to make parsing slightly more predictable. A true fix will have to wait for another parser rewrite.
Fixed an unused __future__ import and use of float("nan") which were breaking under Python 2.5.
Fixed a bug where vectored fields with only one term stored an empty term vector.
Various other bug fixes.

Whoosh 2.2¶

Fixes several bugs, including a bad bug in BM25F scoring.
Added allow_overlap option to whoosh.sorting.StoredFieldFacet.
In add_document(), You can now pass query-like strings for BOOLEAN and DATETIME fields (e.g boolfield="true" and dtfield="20101131-16:01") as an alternative to actual bool or datetime objects. The implementation of this is incomplete: it only works in the default filedb backend, and if the field is stored, the stored value will be the string, not the parsed object.
Added whoosh.analysis.CompoundWordFilter and whoosh.analysis.TeeFilter.

Whoosh 2.1¶

This release fixes several bugs, and contains speed improvments to highlighting. See How to create highlighted search result excerpts for more information.

Whoosh 2.0¶

Improvements¶

Whoosh is now compatible with Python 3 (tested with Python 3.2). Special thanks to Vinay Sajip who did the work, and also Jordan Sherer who helped fix later issues.
Sorting and grouping (faceting) now use a new system of “facet” objects which are much more flexible than the previous field-based system.

For example, to sort by first name and then score:
```
from whoosh import sorting

mf = sorting.MultiFacet([sorting.FieldFacet("firstname"),
                         sorting.ScoreFacet()])
results = searcher.search(myquery, sortedby=mf)
```
In addition to the previously supported sorting/grouping by field contents and/or query results, you can now use numeric ranges, date ranges, score, and more. The new faceting system also supports overlapping groups.

(The old “Sorter” API still works but is deprecated and may be removed in a future version.)

See Sorting and faceting for more information.
Completely revamped spell-checking to make it much faster, easier, and more flexible. You can enable generation of the graph files use by spell checking using the spelling=True argument to a field type:
```
schema = fields.Schema(text=fields.TEXT(spelling=True))
```
(Spelling suggestion methods will work on fields without spelling=True but will slower.) The spelling graph will be updated automatically as new documents are added – it is no longer necessary to maintain a separate “spelling index”.

You can get suggestions for individual words using whoosh.searching.Searcher.suggest():
```
suglist = searcher.suggest("content", "werd", limit=3)
```
Whoosh now includes convenience methods to spell-check and correct user queries, with optional highlighting of corrections using the whoosh.highlight module:
```
from whoosh import highlight, qparser

# User query string
qstring = request.get("q")

# Parse into query object
parser = qparser.QueryParser("content", myindex.schema)
qobject = parser.parse(qstring)

results = searcher.search(qobject)

if not results:
  correction = searcher.correct_query(gobject, gstring)
  # correction.query = corrected query object
  # correction.string = corrected query string

  # Format the corrected query string with HTML highlighting
  cstring = correction.format_string(highlight.HtmlFormatter())
```
Spelling suggestions can come from field contents and/or lists of words. For stemmed fields the spelling suggestions automatically use the unstemmed forms of the words.

There are APIs for spelling suggestions and query correction, so highly motivated users could conceivably replace the defaults with more sophisticated behaviors (for example, to take context into account).

See “Did you mean… ?” Correcting errors in user queries for more information.
whoosh.query.FuzzyTerm now uses the new word graph feature as well and so is much faster.
You can now set a boost factor for individual documents as you index them, to increase the score of terms in those documents in searches. See the documentation for the add_document() for more information.
Added built-in recording of which terms matched in which documents. Use the terms=True argument to whoosh.searching.Searcher.search() and use whoosh.searching.Hit.matched_terms() and whoosh.searching.Hit.contains_term() to check matched terms.
Whoosh now supports whole-term quality optimizations, so for example if the system knows that a UnionMatcher cannot possibly contribute to the “top N” results unless both sub-matchers match, it will replace the UnionMatcher with an IntersectionMatcher which is faster to compute. The performance improvement is not as dramatic as from block quality optimizations, but it can be noticeable.
Fixed a bug that prevented block quality optimizations in queries with words not in the index, which could severely degrade performance.
Block quality optimizations now use the actual scoring algorithm to calculate block quality instead of an approximation, which fixes issues where ordering of results could be different for searches with and without the optimizations.
the BOOLEAN field type now supports field boosts.
Re-architected the query parser to make the code easier to understand. Custom parser plugins from previous versions will probably break in Whoosh 2.0.
Various bug-fixes and performance improvements.
Removed the “read lock”, which caused more problems than it solved. Now when opening a reader, if segments are deleted out from under the reader as it is opened, the code simply retries.

Compatibility¶

The term quality optimizations required changes to the on-disk formats. Whoosh 2.0 if backwards-compatible with the old format. As you rewrite an index using Whoosh 2.0, by default it will use the new formats for new segments, making the index incompatible with older versions.

To upgrade an existing index to use the new formats immediately, use Index.optimize().
Removed the experimental TermTrackingCollector since it is replaced by the new built-in term recording functionality.
Removed the experimental Searcher.define_facets feature until a future release when it will be replaced by a more robust and useful feature.
Reader iteration methods (__iter__, iter_from, iter_field, etc.) now yield whoosh.reading.TermInfo objects.
The arguments to whoosh.query.FuzzyTerm changed.

Whoosh 2.x release notes

Whoosh 2.x release notes

Table Of Contents

Whoosh 2.x release notes¶

Whoosh 2.7¶

Whoosh 2.5¶

Whoosh 2.4¶

Whoosh 2.3.2¶

Whoosh 2.3.1¶

Whoosh 2.3¶

Whoosh 2.2¶

Whoosh 2.1¶

Whoosh 2.0¶

Improvements¶

Compatibility¶