Release notes
- Whoosh 2.x release notes
- Whoosh 1.x release notes
- Whoosh 0.3 release notes
Quick start
- A quick introduction
- The Index and Schema objects
- The IndexWriter object
- The Searcher object
Introduction to Whoosh
- About Whoosh
- What is Whoosh?
- What can Whoosh do for you?
- Getting help with Whoosh
Glossary
Designing a schema
- About schemas and fields
- Built-in field types
- Creating a Schema
- Modifying the schema after indexing
- Dynamic fields
- Advanced schema setup
How to index documents
- Creating an Index object
- Clearing the index
- Indexing documents
  - Indexing and storing different values for the same field
  - Finishing adding documents
- Merging segments
- Deleting documents
- Updating documents
- Incremental indexing
- Clearing the index
How to search
- The Searcher object
- Results object
- Scoring and sorting
  - Scoring
  - Sorting
- Highlighting snippets and More Like This
- Filtering results
- Which terms from my query matched?
- Collapsing results
- Time limited searches
- Convenience methods
- Combining Results objects
Parsing user queries
- Overview
- Using the default parser
- Common customizations
- Advanced customization
The default query language
- Overview
- Individual terms and phrases
- Boolean operators
- Fields
- Inexact terms
- Ranges
- Boosting query elements
- Making a term from literal text
Indexing and parsing dates/times
- Indexing dates
- Parsing date queries
- About time zones and basetime
- Date parser notes
- Limitations
Query objects
About analyzers
- Overview
- Using analyzers
- Advanced Analysis
Stemming, variations, and accent folding
- The problem
- Stemming
- Variations
- Lemmatization
- Character folding
Indexing and searching N-grams
- Overview
Sorting and faceting
- Overview
- Sorting
- Grouping
- Facet types
- MultiFacet
- Missing values
- Using overlapping groups
- Using a custom sort order
- Expert: writing your own facet
How to create highlighted search result excerpts
- Overview
- Requirements
- How to
- The character limit
- Customizing the highlights
- Highlighter object
- Speeding up highlighting
  - PinpointFragmenter
  - PinpointFragmenter limitations
- Using the low-level API
  - Usage
Query expansion and Key word extraction
- Overview
- Usage
- Expansion models
“Did you mean… ?” Correcting errors in user queries
- Overview
- Pulling suggestions from an indexed field
- Pulling suggestions from a word list
- Merging two or more correctors
- Correcting user queries
Field caches
- Customizing cache behaviour
- Creating a custom caching policy
Tips for speeding up batch indexing
- Overview
- StemmingAnalyzer cache
- The limitmb parameter
- The procs parameter
- The multisegment parameter
Concurrency, locking, and versioning
- Concurrency
- Locking
  - Lock files
- Versioning
Indexing and searching document hierarchies
- Overview
- Using nested document indexing
- Using query-time joins
Whoosh recipes
- General
  - Get the stored fields for a document from the document number
- Analysis
  - Eliminate words shorter/longer than N
  - Allow optional case-sensitive searches
- Searching
  - Find every document
  - iTunes-style search-as-you-type
- Shortcuts
  - Look up documents by a field value
- Sorting and scoring
  - Score results based on the position of the matched term
- Results
  - How many hits were there?
  - Which terms matched in each hit?
- Global information
Whoosh API
- analysis module
  - Analyzers
  - Tokenizers
  - Filters
  - Token classes and functions
    - Token
    - unstopped()
- codec.base module
  - Classes
- collectors module
  - Base classes
  - Basic collectors
  - Wrappers
- columns module
  - Base classes
  - Basic columns
  - Technical columns
  - Experimental columns
    - ClampedNumericColumn
- fields module
  - Schema class
    - Schema
    - SchemaClass
  - FieldType base class
    - FieldType
  - Pre-made field types
    - ID
    - IDLIST
    - STORED
    - KEYWORD
    - TEXT
    - NUMERIC
    - DATETIME
    - BOOLEAN
    - NGRAM
    - NGRAMWORDS
  - Exceptions
    - FieldConfigurationError
    - UnknownFieldError
- filedb.filestore module
  - Base class
    - Storage
  - Implementation classes
    - FileStorage
    - RamStorage
  - Helper functions
    - copy_storage()
    - copy_to_ram()
  - Exceptions
    - ReadOnlyError
- filedb.filetables module
  - Hash file
    - HashWriter
      - HashWriter.add()
      - HashWriter.add_all()
    - HashReader
  - Ordered Hash file
    - OrderedHashWriter
    - OrderedHashReader
- filedb.structfile module
  - Classes
- formats module
  - Base class
    - Format
  - Formats
- highlight module
  - Manual highlighting
    - Highlighter
    - highlight()
  - Fragmenters
  - Scorers
    - FragmentScorer
    - BasicFragmentScorer
  - Formatters
  - Utility classes
    - Fragment
- support.bitvector module
  - Base classes
    - DocIdSet
    - BaseBitSet
  - Implementation classes
- index module
  - Functions
  - Base class
    - Index
  - Implementation
    - FileIndex
  - Exceptions
- lang.morph_en module
  - variations()
- lang.porter module
  - stem()
- lang.wordnet module
  - Thesaurus
    - Thesaurus
  - Low-level functions
- matching module
  - Matchers
  - Exceptions
    - ReadTooFar
    - NoQualityAvailable
- qparser module
  - Parser object
    - QueryParser
    - Pre-made configurations
  - Plug-ins
  - Syntax node objects
- query module
  - Base classes
  - Query classes
    - Term
    - Variations
    - FuzzyTerm
    - Phrase
    - And
    - Or
    - DisjunctionMax
    - Not
    - Prefix
    - Wildcard
    - Regex
    - TermRange
    - NumericRange
    - DateRange
    - Every
    - NullQuery
  - Binary queries
  - Span queries
  - Special queries
  - Exceptions
    - QueryError
- reading module
  - Classes
  - Exceptions
    - TermNotFound
- scoring module
  - Base classes
  - Scoring algorithm classes
  - Scoring utility classes
- searching module
  - Searching classes
    - Searcher
  - Results classes
  - Exceptions
    - NoTermsException
    - TimeLimit
- sorting module
  - Base types
    - FacetType
      - FacetType.categorizer()
    - Categorizer
  - Facet types
  - Facets object
    - Facets
  - FacetType objects
- spelling module
  - Corrector objects
  - QueryCorrector objects
- support.charset module
  - default_charset
  - charset_table_to_dict()
- support.levenshtein module
  - relative()
  - distance()
- util module
  - fib()
  - make_binary_tree()
  - make_weighted_tree()
  - synchronized()
  - unclosed()
- writing module
  - Writer
    - IndexWriter
  - Utility writers
    - BufferedWriter
    - AsyncWriter
  - Exceptions
    - IndexingError
Technical notes
- How to implement a new backend
  - Index
  - IndexWriter
  - IndexReader
  - Matcher
- filedb notes
  - Files created

`reading` module¶

This module contains classes that allow reading from an index.

Classes¶

class whoosh.reading.IndexReader¶

Do not instantiate this object directly. Instead use Index.reader().

all_doc_ids()¶: Returns an iterator of all (undeleted) document IDs in the reader.

all_stored_fields()¶: Yields the stored fields for all non-deleted documents.

abstract all_terms()¶: Yields (fieldname, text) tuples for every term in the index.

close()¶: Closes the open files associated with this reader.

codec()¶: Returns the whoosh.codec.base.Codec object used to read this reader’s segment. If this reader is not atomic (reader.is_atomic() == True), returns None.

column_reader(fieldname, column=None, reverse=False, translate=False)¶

Parameters:

Parameters:	fieldname – the name of the field for which to get a reader. column – if passed, use this Column object instead of the one associated with the field in the Schema. reverse – if passed, reverses the order of keys returned by the reader’s `sort_key()` method. If the column type is not reversible, this will raise a `NotImplementedError`. translate – if True, wrap the reader to call the field’s `from_bytes()` method on the returned values.
Returns:	a `whoosh.columns.ColumnReader` object.

fieldname – the name of the field for which to get a reader.
column – if passed, use this Column object instead of the one associated with the field in the Schema.
reverse – if passed, reverses the order of keys returned by the reader’s sort_key() method. If the column type is not reversible, this will raise a NotImplementedError.
translate – if True, wrap the reader to call the field’s from_bytes() method on the returned values.

Returns:

a whoosh.columns.ColumnReader object.

corrector(fieldname)¶: Returns a whoosh.spelling.Corrector object that suggests corrections based on the terms in the given field.

abstract doc_count()¶: Returns the total number of UNDELETED documents in this reader.

abstract doc_count_all()¶: Returns the total number of documents, DELETED OR UNDELETED, in this reader.

abstract doc_field_length(docnum, fieldname, default=0)¶: Returns the number of terms in the given field in the given document. This is used by some scoring algorithms.

abstract doc_frequency(fieldname, text)¶: Returns how many documents the given term appears in.

expand_prefix(fieldname, prefix)¶: Yields terms in the given field that start with the given prefix.

abstract field_length(fieldname)¶: Returns the total number of terms in the given field. This is used by some scoring algorithms.

field_terms(fieldname)¶: Yields all term values (converted from on-disk bytes) in the given field.

first_id(fieldname, text)¶: Returns the first ID in the posting list for the given term. This may be optimized in certain backends.

abstract frequency(fieldname, text)¶: Returns the total number of instances of the given term in the collection.

generation()¶: Returns the generation of the index being read, or -1 if the backend is not versioned.

abstract has_deletions()¶: Returns True if the underlying index/segment has deleted documents.

abstract has_vector(docnum, fieldname)¶: Returns True if the given document has a term vector for the given field.

abstract indexed_field_names()¶: Returns an iterable of strings representing the names of the indexed fields. This may include additional names not explicitly listed in the Schema if you use “glob” fields.

abstract is_deleted(docnum)¶: Returns True if the given document number is marked deleted.

iter_docs()¶: Yields a series of (docnum, stored_fields_dict) tuples for the undeleted documents in the reader.

iter_field(fieldname, prefix='')¶: Yields (text, terminfo) tuples for all terms in the given field.

iter_from(fieldname, text)¶: Yields ((fieldname, text), terminfo) tuples for all terms in the reader, starting at the given term.

iter_postings()¶: Low-level method, yields all postings in the reader as (fieldname, text, docnum, weight, valuestring) tuples.

iter_prefix(fieldname, prefix)¶: Yields (text, terminfo) tuples for all terms in the given field with a certain prefix.

leaf_readers()¶: Returns a list of (IndexReader, docbase) pairs for the child readers of this reader if it is a composite reader. If this is not a composite reader, it returns [(self, 0)].

lexicon(fieldname)¶: Yields all bytestrings in the given field.

abstract max_field_length(fieldname)¶: Returns the minimum length of the field across all documents. This is used by some scoring algorithms.

abstract min_field_length(fieldname)¶: Returns the minimum length of the field across all documents. This is used by some scoring algorithms.

most_distinctive_terms(fieldname, number=5, prefix='')¶: Returns the top ‘number’ terms with the highest tf*idf scores as a list of (score, text) tuples.

most_frequent_terms(fieldname, number=5, prefix='')¶: Returns the top ‘number’ most frequent terms in the given field as a list of (frequency, text) tuples.

abstract postings(fieldname, text)¶

Returns a Matcher for the postings of the given term.

>>> pr = reader.postings("content", "render")
>>> pr.skip_to(10)
>>> pr.id
12

Parameters:	fieldname – the field name or field number of the term. text – the text of the term.
Return type:	`whoosh.matching.Matcher`

segment()¶: Returns the whoosh.index.Segment object used by this reader. If this reader is not atomic (reader.is_atomic() == True), returns None.

storage()¶: Returns the whoosh.filedb.filestore.Storage object used by this reader to read its files. If the reader is not atomic, (reader.is_atomic() == True), returns None.

abstract stored_fields(docnum)¶

Returns the stored fields for the given document number.

Parameters:	numerickeys – use field numbers as the dictionary keys instead of field names.

abstract term_info(fieldname, text)¶: Returns a TermInfo object allowing access to various statistics about the given term.

terms_from(fieldname, prefix)¶: Yields (fieldname, text) tuples for every term in the index starting at the given prefix.

terms_within(fieldname, text, maxdist, prefix=0)¶

Returns a generator of words in the given field within maxdist Damerau-Levenshtein edit distance of the given text.

Important: the terms are returned in no particular order. The only criterion is that they are within maxdist edits of text. You may want to run this method multiple times with increasing maxdist values to ensure you get the closest matches first. You may also have additional information (such as term frequency or an acoustic matching algorithm) you can use to rank terms with the same edit distance.

Parameters:

Parameters:	maxdist – the maximum edit distance. prefix – require suggestions to share a prefix of this length with the given word. This is often justifiable since most misspellings do not involve the first letter of the word. Using a prefix dramatically decreases the time it takes to generate the list of words. seen – an optional set object. Words that appear in the set will not be yielded.

maxdist – the maximum edit distance.
prefix – require suggestions to share a prefix of this length with the given word. This is often justifiable since most misspellings do not involve the first letter of the word. Using a prefix dramatically decreases the time it takes to generate the list of words.
seen – an optional set object. Words that appear in the set will not be yielded.

abstract vector(docnum, fieldname, format_=None)¶

Returns a Matcher object for the given term vector.

>>> docnum = searcher.document_number(path=u'/a/b/c')
>>> v = searcher.vector(docnum, "content")
>>> v.all_as("frequency")
[(u"apple", 3), (u"bear", 2), (u"cab", 2)]

Parameters:	docnum – the document number of the document for which you want the term vector. fieldname – the field name or field number of the field for which you want the term vector.
Return type:	`whoosh.matching.Matcher`

vector_as(astype, docnum, fieldname)¶

Returns an iterator of (termtext, value) pairs for the terms in the given term vector. This is a convenient shortcut to calling vector() and using the Matcher object when all you want are the terms and/or values.

>>> docnum = searcher.document_number(path=u'/a/b/c')
>>> searcher.vector_as("frequency", docnum, "content")
[(u"apple", 3), (u"bear", 2), (u"cab", 2)]

Parameters:	docnum – the document number of the document for which you want the term vector. fieldname – the field name or field number of the field for which you want the term vector. astype – a string containing the name of the format you want the term vector’s data in, for example “weights”.

class whoosh.reading.MultiReader(readers, generation=None)¶: Do not instantiate this object directly. Instead use Index.reader().

class whoosh.reading.TermInfo(weight=0, df=0, minlength=None, maxlength=0, maxweight=0, minid=None, maxid=0)¶

Represents a set of statistics about a term. This object is returned by IndexReader.term_info(). These statistics may be useful for optimizations and scoring algorithms.

doc_frequency()¶: Returns the number of documents the term appears in.

max_id()¶: Returns the highest document ID this term appears in.

max_length()¶: Returns the length of the longest field value the term appears in.

max_weight()¶: Returns the number of times the term appears in the document in which it appears the most.

min_id()¶: Returns the lowest document ID this term appears in.

min_length()¶: Returns the length of the shortest field value the term appears in.

weight()¶: Returns the total frequency of the term across all documents.

Exceptions¶

exception whoosh.reading.TermNotFound¶

reading module

`reading` module

Table Of Contents

`reading` module¶

Classes¶

Exceptions¶

reading module

reading module

Table Of Contents

reading module¶

Classes¶

Exceptions¶

`reading` module

`reading` module¶