collectors
module¶
This module contains “collector” objects. Collectors provide a way to gather
“raw” results from a whoosh.matching.Matcher
object, implement
sorting, filtering, collation, etc., and produce a
whoosh.searching.Results
object.
The basic collectors are:
- TopCollector
- Returns the top N matching results sorted by score, using block-quality
optimizations to skip blocks of documents that can’t contribute to the top
N. The
whoosh.searching.Searcher.search()
method uses this type of collector by default or when you specify alimit
. - UnlimitedCollector
- Returns all matching results sorted by score. The
whoosh.searching.Searcher.search()
method uses this type of collector when you specifylimit=None
or you specify a limit equal to or greater than the number of documents in the searcher. - SortingCollector
- Returns all matching results sorted by a
whoosh.sorting.Facet
object. Thewhoosh.searching.Searcher.search()
method uses this type of collector when you use thesortedby
parameter.
Here’s an example of a simple collector that instead of remembering the matched documents just counts up the number of matches:
class CountingCollector(Collector):
def prepare(self, top_searcher, q, context):
# Always call super method in prepare
Collector.prepare(self, top_searcher, q, context)
self.count = 0
def collect(self, sub_docnum):
self.count += 1
c = CountingCollector()
mysearcher.search_with_collector(myquery, c)
print(c.count)
There are also several wrapping collectors that extend or modify the functionality of other collectors. The meth:whoosh.searching.Searcher.search method uses many of these when you specify various parameters.
NOTE: collectors are not designed to be reentrant or thread-safe. It is generally a good idea to create a new collector for each search.
Base classes¶
-
class
whoosh.collectors.
Collector
¶ Base class for collectors.
-
all_ids
()¶ Returns a sequence of docnums matched in this collector. (Only valid after the collector is run.)
The default implementation is based on the docset. If a collector does not maintain the docset, it will need to override this method.
-
abstract
collect
(sub_docnum)¶ This method is called for every matched document. It should do the work of adding a matched document to the results, and it should return an object to use as a “sorting key” for the given document (such as the document’s score, a key generated by a facet, or just None). Subclasses must implement this method.
If you want the score for the current document, use
self.matcher.score()
.Overriding methods should add the current document offset (
self.offset
) to thesub_docnum
to get the top-level document number for the matching document to add to results.Parameters: sub_docnum – the document number of the current match within the current sub-searcher. You must add self.offset
to this number to get the document’s top-level document number.
-
collect_matches
()¶ This method calls
Collector.matches()
and then for each matched document callsCollector.collect()
. Sub-classes that want to intervene between finding matches and adding them to the collection (for example, to filter out certain documents) can override this method.
-
computes_count
()¶ Returns True if the collector naturally computes the exact number of matching documents. Collectors that use block optimizations will return False since they might skip blocks containing matching documents.
Note that if this method returns False you can still call
count()
, but it means that method might have to do more work to calculate the number of matching documents.
-
count
()¶ Returns the total number of documents matched in this collector. (Only valid after the collector is run.)
The default implementation is based on the docset. If a collector does not maintain the docset, it will need to override this method.
-
finish
()¶ This method is called after a search.
Subclasses can override this to perform set-up work, but they should still call the superclass’s method because it sets several necessary attributes on the collector object:
- self.runtime
- The time (in seconds) the search took.
-
matches
()¶ Yields a series of relative document numbers for matches in the current subsearcher.
-
prepare
(top_searcher, q, context)¶ This method is called before a search.
Subclasses can override this to perform set-up work, but they should still call the superclass’s method because it sets several necessary attributes on the collector object:
- self.top_searcher
- The top-level searcher.
- self.q
- The query object
- self.context
context.needs_current
controls whether a wrapping collector requires that this collector’s matcher be in a valid state at every call tocollect()
. If this isFalse
, the collector is free to use faster methods that don’t necessarily keep the matcher updated, such asmatcher.all_ids()
.
Parameters: - top_searcher – the top-level
whoosh.searching.Searcher
object. - q – the
whoosh.query.Query
object being searched for. - context – a
whoosh.searching.SearchContext
object containing information about the search.
-
remove
(global_docnum)¶ Removes a document from the collector. Not that this method uses the global document number as opposed to
Collector.collect()
which takes a segment-relative docnum.
-
abstract
results
()¶ Returns a
Results
object containing the results of the search. Subclasses must implement this method
-
set_subsearcher
(subsearcher, offset)¶ This method is called each time the collector starts on a new sub-searcher.
Subclasses can override this to perform set-up work, but they should still call the superclass’s method because it sets several necessary attributes on the collector object:
- self.subsearcher
- The current sub-searcher. If the top-level searcher is atomic, this is the same as the top-level searcher.
- self.offset
- The document number offset of the current searcher. You must add
this number to the document number passed to
Collector.collect()
to get the top-level document number for use in results. - self.matcher
- A
whoosh.matching.Matcher
object representing the matches for the query in the current sub-searcher.
-
abstract
sort_key
(sub_docnum)¶ Returns a sorting key for the current match. This should return the same value returned by
Collector.collect()
, but without the side effect of adding the current document to the results.If the collector has been prepared with
context.needs_current=True
, this method can useself.matcher
to get information, for example the score. Otherwise, it should only use the providedsub_docnum
, since the matcher may be in an inconsistent state.Subclasses must implement this method.
-
-
class
whoosh.collectors.
ScoredCollector
(replace=10)¶ Base class for collectors that sort the results based on document score.
Parameters: replace – Number of matches between attempts to replace the matcher with a more efficient version. -
collect
(sub_docnum)¶ This method is called for every matched document. It should do the work of adding a matched document to the results, and it should return an object to use as a “sorting key” for the given document (such as the document’s score, a key generated by a facet, or just None). Subclasses must implement this method.
If you want the score for the current document, use
self.matcher.score()
.Overriding methods should add the current document offset (
self.offset
) to thesub_docnum
to get the top-level document number for the matching document to add to results.Parameters: sub_docnum – the document number of the current match within the current sub-searcher. You must add self.offset
to this number to get the document’s top-level document number.
-
matches
()¶ Yields a series of relative document numbers for matches in the current subsearcher.
-
prepare
(top_searcher, q, context)¶ This method is called before a search.
Subclasses can override this to perform set-up work, but they should still call the superclass’s method because it sets several necessary attributes on the collector object:
- self.top_searcher
- The top-level searcher.
- self.q
- The query object
- self.context
context.needs_current
controls whether a wrapping collector requires that this collector’s matcher be in a valid state at every call tocollect()
. If this isFalse
, the collector is free to use faster methods that don’t necessarily keep the matcher updated, such asmatcher.all_ids()
.
Parameters: - top_searcher – the top-level
whoosh.searching.Searcher
object. - q – the
whoosh.query.Query
object being searched for. - context – a
whoosh.searching.SearchContext
object containing information about the search.
-
sort_key
(sub_docnum)¶ Returns a sorting key for the current match. This should return the same value returned by
Collector.collect()
, but without the side effect of adding the current document to the results.If the collector has been prepared with
context.needs_current=True
, this method can useself.matcher
to get information, for example the score. Otherwise, it should only use the providedsub_docnum
, since the matcher may be in an inconsistent state.Subclasses must implement this method.
-
-
class
whoosh.collectors.
WrappingCollector
(child)¶ Base class for collectors that wrap other collectors.
-
all_ids
()¶ Returns a sequence of docnums matched in this collector. (Only valid after the collector is run.)
The default implementation is based on the docset. If a collector does not maintain the docset, it will need to override this method.
-
collect
(sub_docnum)¶ This method is called for every matched document. It should do the work of adding a matched document to the results, and it should return an object to use as a “sorting key” for the given document (such as the document’s score, a key generated by a facet, or just None). Subclasses must implement this method.
If you want the score for the current document, use
self.matcher.score()
.Overriding methods should add the current document offset (
self.offset
) to thesub_docnum
to get the top-level document number for the matching document to add to results.Parameters: sub_docnum – the document number of the current match within the current sub-searcher. You must add self.offset
to this number to get the document’s top-level document number.
-
collect_matches
()¶ This method calls
Collector.matches()
and then for each matched document callsCollector.collect()
. Sub-classes that want to intervene between finding matches and adding them to the collection (for example, to filter out certain documents) can override this method.
-
count
()¶ Returns the total number of documents matched in this collector. (Only valid after the collector is run.)
The default implementation is based on the docset. If a collector does not maintain the docset, it will need to override this method.
-
finish
()¶ This method is called after a search.
Subclasses can override this to perform set-up work, but they should still call the superclass’s method because it sets several necessary attributes on the collector object:
- self.runtime
- The time (in seconds) the search took.
-
matches
()¶ Yields a series of relative document numbers for matches in the current subsearcher.
-
prepare
(top_searcher, q, context)¶ This method is called before a search.
Subclasses can override this to perform set-up work, but they should still call the superclass’s method because it sets several necessary attributes on the collector object:
- self.top_searcher
- The top-level searcher.
- self.q
- The query object
- self.context
context.needs_current
controls whether a wrapping collector requires that this collector’s matcher be in a valid state at every call tocollect()
. If this isFalse
, the collector is free to use faster methods that don’t necessarily keep the matcher updated, such asmatcher.all_ids()
.
Parameters: - top_searcher – the top-level
whoosh.searching.Searcher
object. - q – the
whoosh.query.Query
object being searched for. - context – a
whoosh.searching.SearchContext
object containing information about the search.
-
remove
(global_docnum)¶ Removes a document from the collector. Not that this method uses the global document number as opposed to
Collector.collect()
which takes a segment-relative docnum.
-
results
()¶ Returns a
Results
object containing the results of the search. Subclasses must implement this method
-
set_subsearcher
(subsearcher, offset)¶ This method is called each time the collector starts on a new sub-searcher.
Subclasses can override this to perform set-up work, but they should still call the superclass’s method because it sets several necessary attributes on the collector object:
- self.subsearcher
- The current sub-searcher. If the top-level searcher is atomic, this is the same as the top-level searcher.
- self.offset
- The document number offset of the current searcher. You must add
this number to the document number passed to
Collector.collect()
to get the top-level document number for use in results. - self.matcher
- A
whoosh.matching.Matcher
object representing the matches for the query in the current sub-searcher.
-
sort_key
(sub_docnum)¶ Returns a sorting key for the current match. This should return the same value returned by
Collector.collect()
, but without the side effect of adding the current document to the results.If the collector has been prepared with
context.needs_current=True
, this method can useself.matcher
to get information, for example the score. Otherwise, it should only use the providedsub_docnum
, since the matcher may be in an inconsistent state.Subclasses must implement this method.
-
Basic collectors¶
-
class
whoosh.collectors.
TopCollector
(limit=10, usequality=True, **kwargs)¶ A collector that only returns the top “N” scored results.
Parameters: - limit – the maximum number of results to return.
- usequality – whether to use block-quality optimizations. This may be useful for debugging.
-
class
whoosh.collectors.
UnlimitedCollector
(reverse=False)¶ A collector that returns all scored results.
Parameters: replace – Number of matches between attempts to replace the matcher with a more efficient version.
-
class
whoosh.collectors.
SortingCollector
(sortedby, limit=10, reverse=False)¶ A collector that returns results sorted by a given
whoosh.sorting.Facet
object. See Sorting and faceting for more information.Parameters: - sortedby – see Sorting and faceting.
- reverse – If True, reverse the overall results. Note that you can reverse individual facets in a multi-facet sort key as well.
Wrappers¶
-
class
whoosh.collectors.
FilterCollector
(child, allow=None, restrict=None)¶ A collector that lets you allow and/or restrict certain document numbers in the results:
uc = collectors.UnlimitedCollector() ins = query.Term("chapter", "rendering") outs = query.Term("status", "restricted") fc = FilterCollector(uc, allow=ins, restrict=outs) mysearcher.search_with_collector(myquery, fc) print(fc.results())
This collector discards a document if:
- The allowed set is not None and a document number is not in the set, or
- The restrict set is not None and a document number is in the set.
(So, if the same document number is in both sets, that document will be discarded.)
If you have a reference to the collector, you can use
FilterCollector.filtered_count
to get the number of matching documents filtered out of the results by the collector.Parameters: - child – the collector to wrap.
- allow – a query, Results object, or set-like object containing docnument numbers that are allowed in the results, or None (meaning everything is allowed).
- restrict – a query, Results object, or set-like object containing document numbers to disallow from the results, or None (meaning nothing is disallowed).
-
class
whoosh.collectors.
FacetCollector
(child, groupedby, maptype=None)¶ A collector that creates groups of documents based on
whoosh.sorting.Facet
objects. See Sorting and faceting for more information.This collector is used if you specify a
groupedby
parameter in thewhoosh.searching.Searcher.search()
method. You can use thewhoosh.searching.Results.groups()
method to access the facet groups.If you have a reference to the collector can also use
FacetedCollector.facetmaps
to access the groups directly:uc = collectors.UnlimitedCollector() fc = FacetedCollector(uc, sorting.FieldFacet("category")) mysearcher.search_with_collector(myquery, fc) print(fc.facetmaps)
Parameters: - groupedby – see Sorting and faceting.
- maptype – a
whoosh.sorting.FacetMap
type to use for any facets that don’t specify their own.
-
class
whoosh.collectors.
CollapseCollector
(child, keyfacet, limit=1, order=None)¶ A collector that collapses results based on a facet. That is, it eliminates all but the top N results that share the same facet key. Documents with an empty key for the facet are never eliminated.
The “top” results within each group is determined by the result ordering (e.g. highest score in a scored search) or an optional second “ordering” facet.
If you have a reference to the collector you can use
CollapseCollector.collapsed_counts
to access the number of documents eliminated based on each key:tc = TopCollector(limit=20) cc = CollapseCollector(tc, "group", limit=3) mysearcher.search_with_collector(myquery, cc) print(cc.collapsed_counts)
See Collapsing results for more information.
Parameters: - child – the collector to wrap.
- keyfacet – a
whoosh.sorting.Facet
to use for collapsing. All but the top N documents that share a key will be eliminated from the results. - limit – the maximum number of documents to keep for each key.
- order – an optional
whoosh.sorting.Facet
to use to determine the “top” document(s) to keep when collapsing. The default (orderfaceet=None
) uses the results order (e.g. the highest score in a scored search).
-
class
whoosh.collectors.
TimeLimitCollector
(child, timelimit, greedy=False, use_alarm=True)¶ A collector that raises a
TimeLimit
exception if the search does not complete within a certain number of seconds:uc = collectors.UnlimitedCollector() tlc = TimeLimitedCollector(uc, timelimit=5.8) try: mysearcher.search_with_collector(myquery, tlc) except collectors.TimeLimit: print("The search ran out of time!") # We can still get partial results from the collector print(tlc.results())
IMPORTANT: On Unix systems (systems where signal.SIGALRM is defined), the code uses signals to stop searching immediately when the time limit is reached. On Windows, the OS does not support this functionality, so the search only checks the time between each found document, so if a matcher is slow the search could exceed the time limit.
Parameters: - child – the collector to wrap.
- timelimit – the maximum amount of time (in seconds) to
allow for searching. If the search takes longer than this, it will
raise a
TimeLimit
exception. - greedy – if
True
, the collector will finish adding the most recent hit before raising theTimeLimit
exception. - use_alarm – if
True
(the default), the collector will try to use signal.SIGALRM (on UNIX).
-
class
whoosh.collectors.
TermsCollector
(child, settype=<class 'set'>)¶ A collector that remembers which terms appeared in which terms appeared in each matched document.
This collector is used if you specify
terms=True
in thewhoosh.searching.Searcher.search()
method.If you have a reference to the collector can also use
TermsCollector.termslist
to access the term lists directly:uc = collectors.UnlimitedCollector() tc = TermsCollector(uc) mysearcher.search_with_collector(myquery, tc) # tc.termdocs is a dictionary mapping (fieldname, text) tuples to # sets of document numbers print(tc.termdocs) # tc.docterms is a dictionary mapping docnums to lists of # (fieldname, text) tuples print(tc.docterms)