- Release notes
- Quick start
- Introduction to Whoosh
- Glossary
- Designing a schema
- How to index documents
- How to search
- Parsing user queries
- The default query language
- Indexing and parsing dates/times
- Query objects
- About analyzers
- Stemming, variations, and accent folding
- Indexing and searching N-grams
- Sorting and faceting
- How to create highlighted search result excerpts
- Query expansion and Key word extraction
- “Did you mean… ?” Correcting errors in user queries
- Field caches
- Tips for speeding up batch indexing
- Concurrency, locking, and versioning
- Indexing and searching document hierarchies
- Whoosh recipes
- Whoosh API
analysismodulecodec.basemodulecollectorsmodule- Base classes
CollectorScoredCollectorWrappingCollectorWrappingCollector.all_ids()WrappingCollector.collect()WrappingCollector.collect_matches()WrappingCollector.count()WrappingCollector.finish()WrappingCollector.matches()WrappingCollector.prepare()WrappingCollector.remove()WrappingCollector.results()WrappingCollector.set_subsearcher()WrappingCollector.sort_key()
- Basic collectors
- Wrappers
- Base classes
columnsmodulefieldsmodule- Schema class
- FieldType base class
FieldTypeFieldType.clean()FieldType.index()FieldType.parse_query()FieldType.parse_range()FieldType.process_text()FieldType.self_parsing()FieldType.separate_spelling()FieldType.sortable_terms()FieldType.spellable_words()FieldType.spelling_fieldname()FieldType.subfields()FieldType.supports()FieldType.to_bytes()FieldType.to_column_value()FieldType.tokenize()
- Pre-made field types
- Exceptions
filedb.filestoremodule- Base class
StorageStorage.close()Storage.create()Storage.create_file()Storage.create_index()Storage.delete_file()Storage.destroy()Storage.file_exists()Storage.file_length()Storage.file_modified()Storage.index_exists()Storage.list()Storage.lock()Storage.open_file()Storage.open_index()Storage.optimize()Storage.rename_file()Storage.temp_storage()
- Implementation classes
- Helper functions
- Exceptions
- Base class
filedb.filetablesmodulefiledb.structfilemodule- Classes
StructFileStructFile.close()StructFile.flush()StructFile.read_pickle()StructFile.read_string()StructFile.read_svarint()StructFile.read_tagint()StructFile.read_varint()StructFile.write_byte()StructFile.write_pickle()StructFile.write_string()StructFile.write_svarint()StructFile.write_tagint()StructFile.write_varint()
BufferFileChecksumFile
- Classes
formatsmodulehighlightmodulesupport.bitvectormoduleindexmodule- Functions
- Base class
IndexIndex.add_field()Index.close()Index.doc_count()Index.doc_count_all()Index.field_length()Index.is_empty()Index.last_modified()Index.latest_generation()Index.max_field_length()Index.optimize()Index.reader()Index.refresh()Index.remove_field()Index.searcher()Index.up_to_date()Index.writer()
- Implementation
- Exceptions
lang.morph_enmodulelang.portermodulelang.wordnetmodulematchingmodule- Matchers
MatcherMatcher.all_ids()Matcher.all_items()Matcher.block_quality()Matcher.children()Matcher.copy()Matcher.depth()Matcher.id()Matcher.is_active()Matcher.items_as()Matcher.matching_terms()Matcher.max_quality()Matcher.next()Matcher.replace()Matcher.reset()Matcher.score()Matcher.skip_to()Matcher.skip_to_quality()Matcher.spans()Matcher.supports()Matcher.supports_block_quality()Matcher.term()Matcher.term_matchers()Matcher.value()Matcher.value_as()Matcher.weight()
NullMatcherListMatcherWrappingMatcherMultiMatcherFilterMatcherBiMatcherAdditiveBiMatcherUnionMatcherDisjunctionMaxMatcherIntersectionMatcherAndNotMatcherInverseMatcherRequireMatcherAndMaybeMatcherConstantScoreMatcher
- Exceptions
- Matchers
qparsermodule- Parser object
QueryParserQueryParser.add_plugin()QueryParser.add_plugins()QueryParser.default_set()QueryParser.filterize()QueryParser.filters()QueryParser.multitoken_query()QueryParser.parse()QueryParser.process()QueryParser.remove_plugin()QueryParser.remove_plugin_class()QueryParser.replace_plugin()QueryParser.tag()QueryParser.taggers()QueryParser.term_query()
- Pre-made configurations
- Plug-ins
- Syntax node objects
- Parser object
querymodule- Base classes
QueryQuery.accept()Query.all_terms()Query.all_tokens()Query.apply()Query.children()Query.copy()Query.deletion_docs()Query.docs()Query.estimate_min_size()Query.estimate_size()Query.existing_terms()Query.field()Query.has_terms()Query.is_leaf()Query.is_range()Query.iter_all_terms()Query.leaves()Query.matcher()Query.normalize()Query.replace()Query.requires()Query.simplify()Query.terms()Query.tokens()Query.with_boost()
CompoundQueryMultiTermExpandingTermWrappingQuery
- Query classes
- Binary queries
- Span queries
- Special queries
- Exceptions
- Base classes
readingmodule- Classes
IndexReaderIndexReader.all_doc_ids()IndexReader.all_stored_fields()IndexReader.all_terms()IndexReader.close()IndexReader.codec()IndexReader.column_reader()IndexReader.corrector()IndexReader.doc_count()IndexReader.doc_count_all()IndexReader.doc_field_length()IndexReader.doc_frequency()IndexReader.expand_prefix()IndexReader.field_length()IndexReader.field_terms()IndexReader.first_id()IndexReader.frequency()IndexReader.generation()IndexReader.has_deletions()IndexReader.has_vector()IndexReader.indexed_field_names()IndexReader.is_deleted()IndexReader.iter_docs()IndexReader.iter_field()IndexReader.iter_from()IndexReader.iter_postings()IndexReader.iter_prefix()IndexReader.leaf_readers()IndexReader.lexicon()IndexReader.max_field_length()IndexReader.min_field_length()IndexReader.most_distinctive_terms()IndexReader.most_frequent_terms()IndexReader.postings()IndexReader.segment()IndexReader.storage()IndexReader.stored_fields()IndexReader.term_info()IndexReader.terms_from()IndexReader.terms_within()IndexReader.vector()IndexReader.vector_as()
MultiReaderTermInfo
- Exceptions
- Classes
scoringmodulesearchingmodule- Searching classes
SearcherSearcher.boolean_context()Searcher.collector()Searcher.context()Searcher.correct_query()Searcher.doc_count()Searcher.doc_count_all()Searcher.docs_for_query()Searcher.document()Searcher.document_number()Searcher.document_numbers()Searcher.documents()Searcher.get_parent()Searcher.idf()Searcher.key_terms()Searcher.key_terms_from_text()Searcher.more_like()Searcher.postings()Searcher.reader()Searcher.refresh()Searcher.search()Searcher.search_page()Searcher.search_with_collector()Searcher.suggest()Searcher.up_to_date()
- Results classes
ResultsResults.copy()Results.docnum()Results.docs()Results.estimated_length()Results.estimated_min_length()Results.extend()Results.facet_names()Results.fields()Results.filter()Results.groups()Results.has_exact_length()Results.has_matched_terms()Results.is_empty()Results.items()Results.key_terms()Results.matched_terms()Results.score()Results.scored_length()Results.upgrade()Results.upgrade_and_extend()
HitResultsPage
- Exceptions
- Searching classes
sortingmodulespellingmodulesupport.charsetmodulesupport.levenshteinmoduleutilmodulewritingmodule- Writer
IndexWriterIndexWriter.add_document()IndexWriter.add_field()IndexWriter.cancel()IndexWriter.commit()IndexWriter.delete_by_query()IndexWriter.delete_by_term()IndexWriter.delete_document()IndexWriter.end_group()IndexWriter.group()IndexWriter.reader()IndexWriter.remove_field()IndexWriter.start_group()IndexWriter.update_document()
- Utility writers
- Exceptions
- Writer
- Technical notes
Indexing and searching N-grams¶
Overview¶
N-gram indexing is a powerful method for getting fast, “search as you type” functionality like iTunes. It is also useful for quick and effective indexing of languages such as Chinese and Japanese without word breaks.
N-grams refers to groups of N characters… bigrams are groups of two characters, trigrams are groups of three characters, and so on.
Whoosh includes two methods for analyzing N-gram fields: an N-gram tokenizer, and a filter that breaks tokens into N-grams.
whoosh.analysis.NgramTokenizer tokenizes the entire field into N-grams.
This is more useful for Chinese/Japanese/Korean languages, where it’s useful
to index bigrams of characters rather than individual characters. Using this
tokenizer with roman languages leads to spaces in the tokens.
>>> ngt = NgramTokenizer(minsize=2, maxsize=4)
>>> [token.text for token in ngt(u"hi there")]
[u'hi', u'hi ', u'hi t',u'i ', u'i t', u'i th', u' t', u' th', u' the', u'th',
u'the', u'ther', u'he', u'her', u'here', u'er', u'ere', u're']
whoosh.analysis.NgramFilter breaks individual tokens into N-grams as
part of an analysis pipeline. This is more useful for languages with word
separation.
>>> my_analyzer = StandardAnalyzer() | NgramFilter(minsize=2, maxsize=4)
>>> [token.text for token in my_analyzer(u"rendering shaders")]
[u'ren', u'rend', u'end', u'ende', u'nde', u'nder', u'der', u'deri', u'eri',
u'erin', u'rin', u'ring', u'ing', u'sha', u'shad', u'had', u'hade', u'ade',
u'ader', u'der', u'ders', u'ers']
Whoosh includes two pre-configured field types for N-grams:
whoosh.fields.NGRAM and whoosh.fields.NGRAMWORDS. The only
difference is that NGRAM runs all text through the N-gram filter, including
whitespace and punctuation, while NGRAMWORDS extracts words from the text
using a tokenizer, then runs each word through the N-gram filter.
TBD.