- Release notes
- Quick start
- Introduction to Whoosh
- Glossary
- Designing a schema
- How to index documents
- How to search
- Parsing user queries
- The default query language
- Indexing and parsing dates/times
- Query objects
- About analyzers
- Stemming, variations, and accent folding
- Indexing and searching N-grams
- Sorting and faceting
- How to create highlighted search result excerpts
- Query expansion and Key word extraction
- “Did you mean… ?” Correcting errors in user queries
- Field caches
- Tips for speeding up batch indexing
- Concurrency, locking, and versioning
- Indexing and searching document hierarchies
- Whoosh recipes
- Whoosh API
analysis
modulecodec.base
modulecollectors
module- Base classes
Collector
ScoredCollector
WrappingCollector
WrappingCollector.all_ids()
WrappingCollector.collect()
WrappingCollector.collect_matches()
WrappingCollector.count()
WrappingCollector.finish()
WrappingCollector.matches()
WrappingCollector.prepare()
WrappingCollector.remove()
WrappingCollector.results()
WrappingCollector.set_subsearcher()
WrappingCollector.sort_key()
- Basic collectors
- Wrappers
- Base classes
columns
modulefields
module- Schema class
- FieldType base class
FieldType
FieldType.clean()
FieldType.index()
FieldType.parse_query()
FieldType.parse_range()
FieldType.process_text()
FieldType.self_parsing()
FieldType.separate_spelling()
FieldType.sortable_terms()
FieldType.spellable_words()
FieldType.spelling_fieldname()
FieldType.subfields()
FieldType.supports()
FieldType.to_bytes()
FieldType.to_column_value()
FieldType.tokenize()
- Pre-made field types
- Exceptions
filedb.filestore
module- Base class
Storage
Storage.close()
Storage.create()
Storage.create_file()
Storage.create_index()
Storage.delete_file()
Storage.destroy()
Storage.file_exists()
Storage.file_length()
Storage.file_modified()
Storage.index_exists()
Storage.list()
Storage.lock()
Storage.open_file()
Storage.open_index()
Storage.optimize()
Storage.rename_file()
Storage.temp_storage()
- Implementation classes
- Helper functions
- Exceptions
- Base class
filedb.filetables
modulefiledb.structfile
module- Classes
StructFile
StructFile.close()
StructFile.flush()
StructFile.read_pickle()
StructFile.read_string()
StructFile.read_svarint()
StructFile.read_tagint()
StructFile.read_varint()
StructFile.write_byte()
StructFile.write_pickle()
StructFile.write_string()
StructFile.write_svarint()
StructFile.write_tagint()
StructFile.write_varint()
BufferFile
ChecksumFile
- Classes
formats
modulehighlight
modulesupport.bitvector
moduleindex
module- Functions
- Base class
Index
Index.add_field()
Index.close()
Index.doc_count()
Index.doc_count_all()
Index.field_length()
Index.is_empty()
Index.last_modified()
Index.latest_generation()
Index.max_field_length()
Index.optimize()
Index.reader()
Index.refresh()
Index.remove_field()
Index.searcher()
Index.up_to_date()
Index.writer()
- Implementation
- Exceptions
lang.morph_en
modulelang.porter
modulelang.wordnet
modulematching
module- Matchers
Matcher
Matcher.all_ids()
Matcher.all_items()
Matcher.block_quality()
Matcher.children()
Matcher.copy()
Matcher.depth()
Matcher.id()
Matcher.is_active()
Matcher.items_as()
Matcher.matching_terms()
Matcher.max_quality()
Matcher.next()
Matcher.replace()
Matcher.reset()
Matcher.score()
Matcher.skip_to()
Matcher.skip_to_quality()
Matcher.spans()
Matcher.supports()
Matcher.supports_block_quality()
Matcher.term()
Matcher.term_matchers()
Matcher.value()
Matcher.value_as()
Matcher.weight()
NullMatcher
ListMatcher
WrappingMatcher
MultiMatcher
FilterMatcher
BiMatcher
AdditiveBiMatcher
UnionMatcher
DisjunctionMaxMatcher
IntersectionMatcher
AndNotMatcher
InverseMatcher
RequireMatcher
AndMaybeMatcher
ConstantScoreMatcher
- Exceptions
- Matchers
qparser
module- Parser object
QueryParser
QueryParser.add_plugin()
QueryParser.add_plugins()
QueryParser.default_set()
QueryParser.filterize()
QueryParser.filters()
QueryParser.multitoken_query()
QueryParser.parse()
QueryParser.process()
QueryParser.remove_plugin()
QueryParser.remove_plugin_class()
QueryParser.replace_plugin()
QueryParser.tag()
QueryParser.taggers()
QueryParser.term_query()
- Pre-made configurations
- Plug-ins
- Syntax node objects
- Parser object
query
module- Base classes
Query
Query.accept()
Query.all_terms()
Query.all_tokens()
Query.apply()
Query.children()
Query.copy()
Query.deletion_docs()
Query.docs()
Query.estimate_min_size()
Query.estimate_size()
Query.existing_terms()
Query.field()
Query.has_terms()
Query.is_leaf()
Query.is_range()
Query.iter_all_terms()
Query.leaves()
Query.matcher()
Query.normalize()
Query.replace()
Query.requires()
Query.simplify()
Query.terms()
Query.tokens()
Query.with_boost()
CompoundQuery
MultiTerm
ExpandingTerm
WrappingQuery
- Query classes
- Binary queries
- Span queries
- Special queries
- Exceptions
- Base classes
reading
module- Classes
IndexReader
IndexReader.all_doc_ids()
IndexReader.all_stored_fields()
IndexReader.all_terms()
IndexReader.close()
IndexReader.codec()
IndexReader.column_reader()
IndexReader.corrector()
IndexReader.doc_count()
IndexReader.doc_count_all()
IndexReader.doc_field_length()
IndexReader.doc_frequency()
IndexReader.expand_prefix()
IndexReader.field_length()
IndexReader.field_terms()
IndexReader.first_id()
IndexReader.frequency()
IndexReader.generation()
IndexReader.has_deletions()
IndexReader.has_vector()
IndexReader.indexed_field_names()
IndexReader.is_deleted()
IndexReader.iter_docs()
IndexReader.iter_field()
IndexReader.iter_from()
IndexReader.iter_postings()
IndexReader.iter_prefix()
IndexReader.leaf_readers()
IndexReader.lexicon()
IndexReader.max_field_length()
IndexReader.min_field_length()
IndexReader.most_distinctive_terms()
IndexReader.most_frequent_terms()
IndexReader.postings()
IndexReader.segment()
IndexReader.storage()
IndexReader.stored_fields()
IndexReader.term_info()
IndexReader.terms_from()
IndexReader.terms_within()
IndexReader.vector()
IndexReader.vector_as()
MultiReader
TermInfo
- Exceptions
- Classes
scoring
modulesearching
module- Searching classes
Searcher
Searcher.boolean_context()
Searcher.collector()
Searcher.context()
Searcher.correct_query()
Searcher.doc_count()
Searcher.doc_count_all()
Searcher.docs_for_query()
Searcher.document()
Searcher.document_number()
Searcher.document_numbers()
Searcher.documents()
Searcher.get_parent()
Searcher.idf()
Searcher.key_terms()
Searcher.key_terms_from_text()
Searcher.more_like()
Searcher.postings()
Searcher.reader()
Searcher.refresh()
Searcher.search()
Searcher.search_page()
Searcher.search_with_collector()
Searcher.suggest()
Searcher.up_to_date()
- Results classes
Results
Results.copy()
Results.docnum()
Results.docs()
Results.estimated_length()
Results.estimated_min_length()
Results.extend()
Results.facet_names()
Results.fields()
Results.filter()
Results.groups()
Results.has_exact_length()
Results.has_matched_terms()
Results.is_empty()
Results.items()
Results.key_terms()
Results.matched_terms()
Results.score()
Results.scored_length()
Results.upgrade()
Results.upgrade_and_extend()
Hit
ResultsPage
- Exceptions
- Searching classes
sorting
modulespelling
modulesupport.charset
modulesupport.levenshtein
moduleutil
modulewriting
module- Writer
IndexWriter
IndexWriter.add_document()
IndexWriter.add_field()
IndexWriter.cancel()
IndexWriter.commit()
IndexWriter.delete_by_query()
IndexWriter.delete_by_term()
IndexWriter.delete_document()
IndexWriter.end_group()
IndexWriter.group()
IndexWriter.reader()
IndexWriter.remove_field()
IndexWriter.start_group()
IndexWriter.update_document()
- Utility writers
- Exceptions
- Writer
- Technical notes
Indexing and searching N-grams¶
Overview¶
N-gram indexing is a powerful method for getting fast, “search as you type” functionality like iTunes. It is also useful for quick and effective indexing of languages such as Chinese and Japanese without word breaks.
N-grams refers to groups of N characters… bigrams are groups of two characters, trigrams are groups of three characters, and so on.
Whoosh includes two methods for analyzing N-gram fields: an N-gram tokenizer, and a filter that breaks tokens into N-grams.
whoosh.analysis.NgramTokenizer
tokenizes the entire field into N-grams.
This is more useful for Chinese/Japanese/Korean languages, where it’s useful
to index bigrams of characters rather than individual characters. Using this
tokenizer with roman languages leads to spaces in the tokens.
>>> ngt = NgramTokenizer(minsize=2, maxsize=4)
>>> [token.text for token in ngt(u"hi there")]
[u'hi', u'hi ', u'hi t',u'i ', u'i t', u'i th', u' t', u' th', u' the', u'th',
u'the', u'ther', u'he', u'her', u'here', u'er', u'ere', u're']
whoosh.analysis.NgramFilter
breaks individual tokens into N-grams as
part of an analysis pipeline. This is more useful for languages with word
separation.
>>> my_analyzer = StandardAnalyzer() | NgramFilter(minsize=2, maxsize=4)
>>> [token.text for token in my_analyzer(u"rendering shaders")]
[u'ren', u'rend', u'end', u'ende', u'nde', u'nder', u'der', u'deri', u'eri',
u'erin', u'rin', u'ring', u'ing', u'sha', u'shad', u'had', u'hade', u'ade',
u'ader', u'der', u'ders', u'ers']
Whoosh includes two pre-configured field types for N-grams:
whoosh.fields.NGRAM
and whoosh.fields.NGRAMWORDS
. The only
difference is that NGRAM
runs all text through the N-gram filter, including
whitespace and punctuation, while NGRAMWORDS
extracts words from the text
using a tokenizer, then runs each word through the N-gram filter.
TBD.