Parsing user queries¶
Overview¶
The job of a query parser is to convert a query string submitted by a user
into query objects (objects from the whoosh.query
module).
For example, the user query:
rendering shading
might be parsed into query objects like this:
And([Term("content", u"rendering"), Term("content", u"shading")])
Whoosh includes a powerful, modular parser for user queries in the
whoosh.qparser
module. The default parser implements a query language
similar to the one that ships with Lucene. However, by changing plugins or using
functions such as whoosh.qparser.MultifieldParser()
,
whoosh.qparser.SimpleParser()
or whoosh.qparser.DisMaxParser()
, you
can change how the parser works, get a simpler parser or change the query
language syntax.
(In previous versions of Whoosh, the query parser was based on pyparsing
.
The new hand-written parser is less brittle and more flexible.)
Note
Remember that you can directly create query objects programmatically using
the objects in the whoosh.query
module. If you are not processing
actual user queries, this is preferable to building a query string just to
parse it.
Using the default parser¶
To create a whoosh.qparser.QueryParser
object, pass it the name of the
default field to search and the schema of the index you’ll be searching.
from whoosh.qparser import QueryParser
parser = QueryParser("content", schema=myindex.schema)
Tip
You can instantiate a QueryParser
object without specifying a schema,
however the parser will not process the text of the user query. This is
useful for debugging, when you want to see how QueryParser will build a
query, but don’t want to make up a schema just for testing.
Once you have a QueryParser
object, you can call parse()
on it to parse a
query string into a query object:
>>> parser.parse(u"alpha OR beta gamma")
And([Or([Term('content', u'alpha'), Term('content', u'beta')]), Term('content', u'gamma')])
See the query language reference for the features and syntax of the default parser’s query language.
Common customizations¶
Searching for any terms instead of all terms by default¶
If the user doesn’t explicitly specify AND
or OR
clauses:
physically based rendering
…by default, the parser treats the words as if they were connected by AND
,
meaning all the terms must be present for a document to match:
physically AND based AND rendering
To change the parser to use OR
instead, so that any of the terms may be
present for a document to match, i.e.:
physically OR based OR rendering
…configure the QueryParser using the group
keyword argument like this:
from whoosh import qparser
parser = qparser.QueryParser(fieldname, schema=myindex.schema,
group=qparser.OrGroup)
The Or query lets you specify that documents that contain more of the query
terms score higher. For example, if the user searches for foo bar
, a
document with four occurances of foo
would normally outscore a document
that contained one occurance each of foo
and bar
. However, users
usually expect documents that contain more of the words they searched for
to score higher. To configure the parser to produce Or groups with this
behavior, use the factory()
class method of OrGroup
:
og = qparser.OrGroup.factory(0.9)
parser = qparser.QueryParser(fieldname, schema, group=og)
where the argument to factory()
is a scaling factor on the bonus
(between 0 and 1).
Letting the user search multiple fields by default¶
The default QueryParser configuration takes terms without explicit fields and assigns them to the default field you specified when you created the object, so for example if you created the object with:
parser = QueryParser("content", schema=myschema)
And the user entered the query:
three blind mice
The parser would treat it as:
content:three content:blind content:mice
However, you might want to let the user search multiple fields by default. For
example, you might want “unfielded” terms to search both the title
and
content
fields.
In that case, you can use a whoosh.qparser.MultifieldParser
. This is
just like the normal QueryParser, but instead of a default field name string, it
takes a sequence of field names:
from whoosh.qparser import MultifieldParser
mparser = MultifieldParser(["title", "content"], schema=myschema)
When this MultifieldParser instance parses three blind mice
, it treats it
as:
(title:three OR content:three) (title:blind OR content:blind) (title:mice OR content:mice)
Simplifying the query language¶
Once you have a parser:
parser = qparser.QueryParser("content", schema=myschema)
you can remove features from it using the
remove_plugin_class()
method.
For example, to remove the ability of the user to specify fields to search:
parser.remove_plugin_class(qparser.FieldsPlugin)
To remove the ability to search for wildcards, which can be harmful to query performance:
parser.remove_plugin_class(qparser.WildcardPlugin)
See qparser module for information about the plugins included with Whoosh’s query parser.
Changing the AND, OR, ANDNOT, ANDMAYBE, and NOT syntax¶
The default parser uses English keywords for the AND, OR, ANDNOT, ANDMAYBE, and NOT functions:
parser = qparser.QueryParser("content", schema=myschema)
You can replace the default OperatorsPlugin
object to
replace the default English tokens with your own regular expressions.
The whoosh.qparser.OperatorsPlugin
implements the ability to use AND,
OR, NOT, ANDNOT, and ANDMAYBE clauses in queries. You can instantiate a new
OperatorsPlugin
and use the And
, Or
, Not
, AndNot
, and
AndMaybe
keyword arguments to change the token patterns:
# Use Spanish equivalents instead of AND and OR
op = qparser.OperatorsPlugin(And=" Y ", Or=" O ")
parser.replace_plugin(op)
Further, you may change the syntax of the NOT
operator:
np = qparser.OperatorsPlugin(Not=' NO ')
parser.replace_plugin(np)
The arguments can be pattern strings or precompiled regular expression objects.
For example, to change the default parser to use typographic symbols instead of words for the AND, OR, ANDNOT, ANDMAYBE, and NOT functions:
parser = qparser.QueryParser("content", schema=myschema)
# These are regular expressions, so we have to escape the vertical bar
op = qparser.OperatorsPlugin(And="&", Or="\\|", AndNot="&!", AndMaybe="&~", Not="\\-")
parser.replace_plugin(op)
Adding less-than, greater-than, etc.¶
Normally, the way you match all terms in a field greater than “apple” is with an open ended range:
field:{apple to]
The whoosh.qparser.GtLtPlugin
lets you specify the same search like
this:
field:>apple
The plugin lets you use >
, <
, >=
, <=
, =>
, or =<
after
a field specifier, and translates the expression into the equivalent range:
date:>='31 march 2001'
date:[31 march 2001 to]
Adding fuzzy term queries¶
Fuzzy queries are good for catching misspellings and similar words.
The whoosh.qparser.FuzzyTermPlugin
lets you search for “fuzzy” terms,
that is, terms that don’t have to match exactly. The fuzzy term will match any
similar term within a certain number of “edits” (character insertions,
deletions, and/or transpositions – this is called the “Damerau-Levenshtein
edit distance”).
To add the fuzzy plugin:
parser = qparser.QueryParser("fieldname", my_index.schema)
parser.add_plugin(qparser.FuzzyTermPlugin())
Once you add the fuzzy plugin to the parser, you can specify a fuzzy term by
adding a ~
followed by an optional maximum edit distance. If you don’t
specify an edit distance, the default is 1
.
For example, the following “fuzzy” term query:
cat~
would match cat
and all terms in the index within one “edit” of cat,
for example cast
(insert s
), at
(delete c
), and act
(transpose c
and a
).
If you wanted cat
to match bat
, it requires two edits (delete c
and
insert b
) so you would need to set the maximum edit distance to 2
:
cat~2
Because each additional edit you allow increases the number of possibilities
that must be checked, edit distances greater than 2
can be very slow.
It is often useful to require that the first few characters of a fuzzy term
match exactly. This is called a prefix. You can set the length of the prefix
by adding a slash and a number after the edit distance. For example, to use
a maximum edit distance of 2
and a prefix length of 3
:
johannson~2/3
You can specify a prefix without specifying an edit distance:
johannson~/3
The default prefix distance is 0
.
Allowing complex phrase queries¶
The default parser setup allows phrase (proximity) queries such as:
"whoosh search library"
The default phrase query tokenizes the text between the quotes and creates a search for those terms in proximity.
If you want to do more complex proximity searches, you can replace the phrase
plugin with the whoosh.qparser.SequencePlugin
, which allows any query
between the quotes. For example:
"(john OR jon OR jonathan~) peters*"
The sequence syntax lets you add a “slop” factor just like the regular phrase:
"(john OR jon OR jonathan~) peters*"~2
To replace the default phrase plugin with the sequence plugin:
parser = qparser.QueryParser("fieldname", my_index.schema)
parser.remove_plugin_class(qparser.PhrasePlugin)
parser.add_plugin(qparser.SequencePlugin())
Alternatively, you could keep the default phrase plugin and give the sequence
plugin different syntax by specifying a regular expression for the start/end
marker when you create the sequence plugin. The regular expression should have
a named group slop
for the slop factor. For example:
parser = qparser.QueryParser("fieldname", my_index.schema)
parser.add_plugin(qparser.SequencePlugin("!(~(?P<slop>[1-9][0-9]*))?"))
This would allow you to use regular phrase queries and sequence queries at the same time:
"regular phrase" AND !sequence query~2!
Advanced customization¶
QueryParser arguments¶
QueryParser supports two extra keyword arguments:
group
The query class to use to join sub-queries when the user doesn’t explicitly specify a boolean operator, such as
AND
orOR
. This lets you change the default operator fromAND
toOR
.This will be the
whoosh.qparser.AndGroup
orwhoosh.qparser.OrGroup
class (not an instantiated object) unless you’ve written your own custom grouping syntax you want to use.termclass
The query class to use to wrap single terms.
This must be a
whoosh.query.Query
subclass (not an instantiated object) that accepts a fieldname string and term text unicode string in its__init__
method. The default iswhoosh.query.Term
.This is useful if you want to change the default term class to
whoosh.query.Variations
, or if you’ve written a custom term class you want the parser to use instead of the ones shipped with Whoosh.
>>> from whoosh.qparser import QueryParser, OrGroup
>>> orparser = QueryParser("content", schema=myschema, group=OrGroup)
Configuring plugins¶
The query parser’s functionality is provided by a set of plugins. You can remove plugins to remove functionality, add plugins to add functionality, or replace default plugins with re-configured or rewritten versions.
The whoosh.qparser.QueryParser.add_plugin()
,
whoosh.qparser.QueryParser.remove_plugin_class()
, and
whoosh.qparser.QueryParser.replace_plugin()
methods let you manipulate
the plugins in a QueryParser
object.
See qparser module for information about the available plugins.
Creating custom operators¶
- Decide whether you want a
PrefixOperator
,PostfixOperator
, orInfixOperator
. - Create a new
whoosh.qparser.syntax.GroupNode
subclass to hold nodes affected by your operator. This object is responsible for generating awhoosh.query.Query
object corresponding to the syntax. - Create a regular expression pattern for the operator’s query syntax.
- Create an
OperatorsPlugin.OpTagger
object from the above information. - Create a new
OperatorsPlugin
instance configured with your custom operator(s). - Replace the default
OperatorsPlugin
in your parser with your new instance.
For example, if you were creating a BEFORE
operator:
from whoosh import qparser, query
optype = qparser.InfixOperator
pattern = " BEFORE "
class BeforeGroup(qparser.GroupNode):
merging = True
qclass = query.Ordered
Create an OpTagger for your operator:
btagger = qparser.OperatorPlugin.OpTagger(pattern, BeforeGroup,
qparser.InfixOperator)
By default, infix operators are left-associative. To make a right-associative infix operator, do this:
btagger = qparser.OperatorPlugin.OpTagger(pattern, BeforeGroup,
qparser.InfixOperator,
leftassoc=False)
Create an OperatorsPlugin
instance with your
new operator, and replace the default operators plugin in your query parser:
qp = qparser.QueryParser("text", myschema)
my_op_plugin = qparser.OperatorsPlugin([(btagger, 0)])
qp.replace_plugin(my_op_plugin)
Note that the list of operators you specify with the first argument is IN ADDITION TO the default operators (AND, OR, etc.). To turn off one of the default operators, you can pass None to the corresponding keyword argument:
cp = qparser.OperatorsPlugin([(optagger, 0)], And=None)
If you want ONLY your list of operators and none of the default operators,
use the clean
keyword argument:
cp = qparser.OperatorsPlugin([(optagger, 0)], clean=True)
Operators earlier in the list bind more closely than operators later in the list.