fields
module¶
Contains functions and classes related to fields.
Schema class¶
-
class
whoosh.fields.
Schema
(**fields)¶ Represents the collection of fields in an index. Maps field names to FieldType objects which define the behavior of each field.
Low-level parts of the index use field numbers instead of field names for compactness. This class has several methods for converting between the field name, field number, and field object itself.
All keyword arguments to the constructor are treated as fieldname = fieldtype pairs. The fieldtype can be an instantiated FieldType object, or a FieldType sub-class (in which case the Schema will instantiate it with the default constructor before adding it).
For example:
s = Schema(content = TEXT, title = TEXT(stored = True), tags = KEYWORD(stored = True))
-
add
(name, fieldtype, glob=False)¶ Adds a field to this schema.
Parameters: - name – The name of the field.
- fieldtype – An instantiated fields.FieldType object, or a FieldType subclass. If you pass an instantiated object, the schema will use that as the field configuration for this field. If you pass a FieldType subclass, the schema will automatically instantiate it with the default constructor.
-
copy
()¶ Returns a shallow copy of the schema. The field instances are not deep copied, so they are shared between schema copies.
-
items
()¶ Returns a list of (“fieldname”, field_object) pairs for the fields in this schema.
-
names
(check_names=None)¶ Returns a list of the names of the fields in this schema.
Parameters: check_names – (optional) sequence of field names to check whether the schema accepts them as (dynamic) field names - acceptable names will also be in the result list. Note: You may also have static field names in check_names, that won’t create duplicates in the result list. Unsupported names will not be in the result list.
-
scorable_names
()¶ Returns a list of the names of fields that store field lengths.
-
stored_names
()¶ Returns a list of the names of fields that are stored.
-
-
class
whoosh.fields.
SchemaClass
(*args, **kwargs)¶ Allows you to define a schema using declarative syntax, similar to Django models:
class MySchema(SchemaClass): path = ID date = DATETIME content = TEXT
You can use inheritance to share common fields between schemas:
class Parent(SchemaClass): path = ID(stored=True) date = DATETIME class Child1(Parent): content = TEXT(positions=False) class Child2(Parent): tags = KEYWORD
This class overrides
__new__
so instantiating your sub-class always results in an instance ofSchema
.>>> class MySchema(SchemaClass): ... title = TEXT(stored=True) ... content = TEXT ... >>> s = MySchema() >>> type(s) <class 'whoosh.fields.Schema'>
All keyword arguments to the constructor are treated as fieldname = fieldtype pairs. The fieldtype can be an instantiated FieldType object, or a FieldType sub-class (in which case the Schema will instantiate it with the default constructor before adding it).
For example:
s = Schema(content = TEXT, title = TEXT(stored = True), tags = KEYWORD(stored = True))
FieldType base class¶
-
class
whoosh.fields.
FieldType
(format, analyzer, scorable=False, stored=False, unique=False, multitoken_query='default', sortable=False, vector=None)¶ Represents a field configuration.
The FieldType object supports the following attributes:
- format (formats.Format): the storage format for posting blocks.
- analyzer (analysis.Analyzer): the analyzer to use to turn text into terms.
- scorable (boolean): whether searches against this field may be scored. This controls whether the index stores per-document field lengths for this field.
- stored (boolean): whether the content of this field is stored for each document. For example, in addition to indexing the title of a document, you usually want to store the title so it can be presented as part of the search results.
- unique (boolean): whether this field’s value is unique to each document. For example, ‘path’ or ‘ID’. IndexWriter.update_document() will use fields marked as ‘unique’ to find the previous version of a document being updated.
- multitoken_query is a string indicating what kind of query to use when a “word” in a user query parses into multiple tokens. The string is interpreted by the query parser. The strings understood by the default query parser are “first” (use first token only), “and” (join the tokens with an AND query), “or” (join the tokens with OR), “phrase” (join the tokens with a phrase query), and “default” (use the query parser’s default join type).
- vector (formats.Format or boolean): the format to use to store term
- vectors. If not a
Format
object, any true value means to use the index format as the term vector format. Any flase value means don’t store term vectors for this field.
The constructor for the base field type simply lets you supply your own attribute values. Subclasses may configure some or all of this for you.
-
clean
()¶ Clears any cached information in the field and any child objects.
-
index
(value, **kwargs)¶ Returns an iterator of (btext, frequency, weight, encoded_value) tuples for each unique word in the input value.
The default implementation uses the
analyzer
attribute to tokenize the value into strings, then encodes them into bytes using UTF-8.
-
parse_query
(fieldname, qstring, boost=1.0)¶ When
self_parsing()
returns True, the query parser will call this method to parse basic query text.
-
parse_range
(fieldname, start, end, startexcl, endexcl, boost=1.0)¶ When
self_parsing()
returns True, the query parser will call this method to parse range query text. If this method returns None instead of a query object, the parser will fall back to parsing the start and end terms using process_text().
-
process_text
(qstring, mode='', **kwargs)¶ Analyzes the given string and returns an iterator of token texts.
>>> field = fields.TEXT() >>> list(field.process_text("The ides of March")) ["ides", "march"]
-
self_parsing
()¶ Subclasses should override this method to return True if they want the query parser to call the field’s
parse_query()
method instead of running the analyzer on text in this field. This is useful where the field needs full control over how queries are interpreted, such as in the numeric field type.
-
separate_spelling
()¶ Returns True if the field stores unstemmed words in a separate field for spelling suggestions.
-
sortable_terms
(ixreader, fieldname)¶ Returns an iterator of the “sortable” tokens in the given reader and field. These values can be used for sorting. The default implementation simply returns all tokens in the field.
This can be overridden by field types such as NUMERIC where some values in a field are not useful for sorting.
-
spellable_words
(value)¶ Returns an iterator of each unique word (in sorted order) in the input value, suitable for inclusion in the field’s word graph.
The default behavior is to call the field analyzer with the keyword argument
no_morph=True
, which should make the analyzer skip any morphological transformation filters (e.g. stemming) to preserve the original form of the words. Exotic field types may need to override this behavior.
-
spelling_fieldname
(fieldname)¶ Returns the name of a field to use for spelling suggestions instead of this field.
Parameters: fieldname – the name of this field.
-
subfields
()¶ Returns an iterator of
(name_prefix, fieldobject)
pairs for the fields that need to be indexed when content is put in this field. The default implementation simply yields("", self)
.
-
supports
(name)¶ Returns True if the underlying format supports the given posting value type.
>>> field = TEXT() >>> field.supports("positions") True >>> field.supports("chars") False
-
to_bytes
(value)¶ Returns a bytes representation of the given value, appropriate to be written to disk. The default implementation assumes a unicode value and encodes it using UTF-8.
-
to_column_value
(value)¶ Returns an object suitable to be inserted into the document values column for this field. The default implementation simply calls
self.to_bytes(value)
.
-
tokenize
(value, **kwargs)¶ Analyzes the given string and returns an iterator of Token objects (note: for performance reasons, actually the same token yielded over and over with different attributes).
Pre-made field types¶
-
class
whoosh.fields.
ID
(stored=False, unique=False, field_boost=1.0, sortable=False, analyzer=None)¶ Configured field type that indexes the entire value of the field as one token. This is useful for data you don’t want to tokenize, such as the path of a file.
Parameters: stored – Whether the value of this field is stored with the document.
-
class
whoosh.fields.
IDLIST
(stored=False, unique=False, expression=None, field_boost=1.0)¶ Configured field type for fields containing IDs separated by whitespace and/or punctuation (or anything else, using the expression param).
Parameters: - stored – Whether the value of this field is stored with the document.
- unique – Whether the value of this field is unique per-document.
- expression – The regular expression object to use to extract tokens. The default expression breaks tokens on CRs, LFs, tabs, spaces, commas, and semicolons.
-
class
whoosh.fields.
STORED
¶ Configured field type for fields you want to store but not index.
-
class
whoosh.fields.
KEYWORD
(stored=False, lowercase=False, commas=False, scorable=False, unique=False, field_boost=1.0, sortable=False, vector=None, analyzer=None)¶ Configured field type for fields containing space-separated or comma-separated keyword-like data (such as tags). The default is to not store positional information (so phrase searching is not allowed in this field) and to not make the field scorable.
Parameters: - stored – Whether to store the value of the field with the document.
- commas – Whether this is a comma-separated field. If this is False (the default), it is treated as a space-separated field.
- scorable – Whether this field is scorable.
-
class
whoosh.fields.
TEXT
(analyzer=None, phrase=True, chars=False, stored=False, field_boost=1.0, multitoken_query='default', spelling=False, sortable=False, lang=None, vector=None, spelling_prefix='spell_')¶ Configured field type for text fields (for example, the body text of an article). The default is to store positional information to allow phrase searching. This field type is always scorable.
Parameters: - analyzer – The analysis.Analyzer to use to index the field contents. See the analysis module for more information. If you omit this argument, the field uses analysis.StandardAnalyzer.
- phrase – Whether the store positional information to allow phrase searching.
- chars – Whether to store character ranges along with positions. If this is True, “phrase” is also implied.
- stored – Whether to store the value of this field with the document. Since this field type generally contains a lot of text, you should avoid storing it with the document unless you need to, for example to allow fast excerpts in the search results.
- spelling – if True, and if the field’s analyzer changes the form
of term text (such as a stemming analyzer), this field will store
extra information in a separate field (named using the
spelling_prefix
keyword argument) to allow spelling suggestions to use the unchanged word forms as spelling suggestions. - sortable – If True, make this field sortable using the default
column type. If you pass a
whoosh.columns.Column
instance instead of True, the field will use the given column type. - lang – automaticaly configure a
whoosh.analysis.LanguageAnalyzer
for the given language. This is ignored if you also specify ananalyzer
. - vector – if this value evaluates to true, store a list of the
terms in this field in each document. If the value is an instance
of
whoosh.formats.Format
, the index will use the object to store the term vector. Any other true value (e.g.vector=True
) will use the field’s index format to store the term vector as well.
-
class
whoosh.fields.
NUMERIC
(numtype=<class 'int'>, bits=32, stored=False, unique=False, field_boost=1.0, decimal_places=0, shift_step=4, signed=True, sortable=False, default=None)¶ Special field type that lets you index integer or floating point numbers in relatively short fixed-width terms. The field converts numbers to sortable bytes for you before indexing.
You specify the numeric type of the field (
int
orfloat
) when you create theNUMERIC
object. The default isint
. Forint
, you can specify a size in bits (32
or64
). For bothint
andfloat
you can specify asigned
keyword argument (default isTrue
).>>> schema = Schema(path=STORED, position=NUMERIC(int, 64, signed=False)) >>> ix = storage.create_index(schema) >>> with ix.writer() as w: ... w.add_document(path="/a", position=5820402204) ...
You can also use the NUMERIC field to store Decimal instances by specifying a type of
int
orlong
and thedecimal_places
keyword argument. This simply multiplies each number by(10 ** decimal_places)
before storing it as an integer. Of course this may throw away decimal prcesision (by truncating, not rounding) and imposes the same maximum value limits asint
/long
, but these may be acceptable for certain applications.>>> from decimal import Decimal >>> schema = Schema(path=STORED, position=NUMERIC(int, decimal_places=4)) >>> ix = storage.create_index(schema) >>> with ix.writer() as w: ... w.add_document(path="/a", position=Decimal("123.45") ...
Parameters: - numtype – the type of numbers that can be stored in this field,
either
int
,float
. If you useDecimal
, use thedecimal_places
argument to control how many decimal places the field will store. - bits – When
numtype
isint
, the number of bits to use to store the number: 8, 16, 32, or 64. - stored – Whether the value of this field is stored with the document.
- unique – Whether the value of this field is unique per-document.
- decimal_places – specifies the number of decimal places to save when storing Decimal instances. If you set this, you will always get Decimal instances back from the field.
- shift_steps – The number of bits of precision to shift away at each tiered indexing level. Values should generally be 1-8. Lower values yield faster searches but take up more space. A value of 0 means no tiered indexing.
- signed – Whether the numbers stored in this field may be negative.
- numtype – the type of numbers that can be stored in this field,
either
-
class
whoosh.fields.
DATETIME
(stored=False, unique=False, sortable=False)¶ Special field type that lets you index datetime objects. The field converts the datetime objects to sortable text for you before indexing.
Since this field is based on Python’s datetime module it shares all the limitations of that module, such as the inability to represent dates before year 1 in the proleptic Gregorian calendar. However, since this field stores datetimes as an integer number of microseconds, it could easily represent a much wider range of dates if the Python datetime implementation ever supports them.
>>> schema = Schema(path=STORED, date=DATETIME) >>> ix = storage.create_index(schema) >>> w = ix.writer() >>> w.add_document(path="/a", date=datetime.now()) >>> w.commit()
Parameters: - stored – Whether the value of this field is stored with the document.
- unique – Whether the value of this field is unique per-document.
-
class
whoosh.fields.
BOOLEAN
(stored=False, field_boost=1.0)¶ Special field type that lets you index boolean values (True and False). The field converts the boolean values to text for you before indexing.
>>> schema = Schema(path=STORED, done=BOOLEAN) >>> ix = storage.create_index(schema) >>> w = ix.writer() >>> w.add_document(path="/a", done=False) >>> w.commit()
Parameters: stored – Whether the value of this field is stored with the document.
-
class
whoosh.fields.
NGRAM
(minsize=2, maxsize=4, stored=False, field_boost=1.0, queryor=False, phrase=False, sortable=False)¶ Configured field that indexes text as N-grams. For example, with a field type NGRAM(3,4), the value “hello” will be indexed as tokens “hel”, “hell”, “ell”, “ello”, “llo”. This field type chops the entire text into N-grams, including whitespace and punctuation. See
NGRAMWORDS
for a field type that breaks the text into words first before chopping the words into N-grams.Parameters: - minsize – The minimum length of the N-grams.
- maxsize – The maximum length of the N-grams.
- stored – Whether to store the value of this field with the document. Since this field type generally contains a lot of text, you should avoid storing it with the document unless you need to, for example to allow fast excerpts in the search results.
- queryor – if True, combine the N-grams with an Or query. The default is to combine N-grams with an And query.
- phrase – store positions on the N-grams to allow exact phrase searching. The default is off.
-
class
whoosh.fields.
NGRAMWORDS
(minsize=2, maxsize=4, stored=False, field_boost=1.0, tokenizer=None, at=None, queryor=False, sortable=False)¶ Configured field that chops text into words using a tokenizer, lowercases the words, and then chops the words into N-grams.
Parameters: - minsize – The minimum length of the N-grams.
- maxsize – The maximum length of the N-grams.
- stored – Whether to store the value of this field with the document. Since this field type generally contains a lot of text, you should avoid storing it with the document unless you need to, for example to allow fast excerpts in the search results.
- tokenizer – an instance of
whoosh.analysis.Tokenizer
used to break the text into words. - at – if ‘start’, only takes N-grams from the start of the word. If ‘end’, only takes N-grams from the end. Otherwise the default is to take all N-grams from each word.
- queryor – if True, combine the N-grams with an Or query. The default is to combine N-grams with an And query.