Indexing and searching N-grams¶
N-gram indexing is a powerful method for getting fast, “search as you type” functionality like iTunes. It is also useful for quick and effective indexing of languages such as Chinese and Japanese without word breaks.
N-grams refers to groups of N characters… bigrams are groups of two characters, trigrams are groups of three characters, and so on.
Whoosh includes two methods for analyzing N-gram fields: an N-gram tokenizer, and a filter that breaks tokens into N-grams.
whoosh.analysis.NgramTokenizer tokenizes the entire field into N-grams.
This is more useful for Chinese/Japanese/Korean languages, where it’s useful
to index bigrams of characters rather than individual characters. Using this
tokenizer with roman languages leads to spaces in the tokens.
>>> ngt = NgramTokenizer(minsize=2, maxsize=4) >>> [token.text for token in ngt(u"hi there")] [u'hi', u'hi ', u'hi t',u'i ', u'i t', u'i th', u' t', u' th', u' the', u'th', u'the', u'ther', u'he', u'her', u'here', u'er', u'ere', u're']
whoosh.analysis.NgramFilter breaks individual tokens into N-grams as
part of an analysis pipeline. This is more useful for languages with word
>>> my_analyzer = StandardAnalyzer() | NgramFilter(minsize=2, maxsize=4) >>> [token.text for token in my_analyzer(u"rendering shaders")] [u'ren', u'rend', u'end', u'ende', u'nde', u'nder', u'der', u'deri', u'eri', u'erin', u'rin', u'ring', u'ing', u'sha', u'shad', u'had', u'hade', u'ade', u'ader', u'der', u'ders', u'ers']
Whoosh includes two pre-configured field types for N-grams:
whoosh.fields.NGRAMWORDS. The only
difference is that
NGRAM runs all text through the N-gram filter, including
whitespace and punctuation, while
NGRAMWORDS extracts words from the text
using a tokenizer, then runs each word through the N-gram filter.