collectors module

collectors module

collectors module

This module contains “collector” objects. Collectors provide a way to gather “raw” results from a whoosh.matching.Matcher object, implement sorting, filtering, collation, etc., and produce a whoosh.searching.Results object.

The basic collectors are:

TopCollector
Returns the top N matching results sorted by score, using block-quality optimizations to skip blocks of documents that can’t contribute to the top N. The whoosh.searching.Searcher.search() method uses this type of collector by default or when you specify a limit.
UnlimitedCollector
Returns all matching results sorted by score. The whoosh.searching.Searcher.search() method uses this type of collector when you specify limit=None or you specify a limit equal to or greater than the number of documents in the searcher.
SortingCollector
Returns all matching results sorted by a whoosh.sorting.Facet object. The whoosh.searching.Searcher.search() method uses this type of collector when you use the sortedby parameter.

Here’s an example of a simple collector that instead of remembering the matched documents just counts up the number of matches:

class CountingCollector(Collector):
    def prepare(self, top_searcher, q, context):
        # Always call super method in prepare
        Collector.prepare(self, top_searcher, q, context)

        self.count = 0

    def collect(self, sub_docnum):
        self.count += 1

c = CountingCollector()
mysearcher.search_with_collector(myquery, c)
print(c.count)

There are also several wrapping collectors that extend or modify the functionality of other collectors. The meth:whoosh.searching.Searcher.search method uses many of these when you specify various parameters.

NOTE: collectors are not designed to be reentrant or thread-safe. It is generally a good idea to create a new collector for each search.

Base classes

class whoosh.collectors.Collector

Base class for collectors.

all_ids()

Returns a sequence of docnums matched in this collector. (Only valid after the collector is run.)

The default implementation is based on the docset. If a collector does not maintain the docset, it will need to override this method.

abstract collect(sub_docnum)

This method is called for every matched document. It should do the work of adding a matched document to the results, and it should return an object to use as a “sorting key” for the given document (such as the document’s score, a key generated by a facet, or just None). Subclasses must implement this method.

If you want the score for the current document, use self.matcher.score().

Overriding methods should add the current document offset (self.offset) to the sub_docnum to get the top-level document number for the matching document to add to results.

Parameters:sub_docnum – the document number of the current match within the current sub-searcher. You must add self.offset to this number to get the document’s top-level document number.
collect_matches()

This method calls Collector.matches() and then for each matched document calls Collector.collect(). Sub-classes that want to intervene between finding matches and adding them to the collection (for example, to filter out certain documents) can override this method.

computes_count()

Returns True if the collector naturally computes the exact number of matching documents. Collectors that use block optimizations will return False since they might skip blocks containing matching documents.

Note that if this method returns False you can still call count(), but it means that method might have to do more work to calculate the number of matching documents.

count()

Returns the total number of documents matched in this collector. (Only valid after the collector is run.)

The default implementation is based on the docset. If a collector does not maintain the docset, it will need to override this method.

finish()

This method is called after a search.

Subclasses can override this to perform set-up work, but they should still call the superclass’s method because it sets several necessary attributes on the collector object:

self.runtime
The time (in seconds) the search took.
matches()

Yields a series of relative document numbers for matches in the current subsearcher.

prepare(top_searcher, q, context)

This method is called before a search.

Subclasses can override this to perform set-up work, but they should still call the superclass’s method because it sets several necessary attributes on the collector object:

self.top_searcher
The top-level searcher.
self.q
The query object
self.context
context.needs_current controls whether a wrapping collector requires that this collector’s matcher be in a valid state at every call to collect(). If this is False, the collector is free to use faster methods that don’t necessarily keep the matcher updated, such as matcher.all_ids().
Parameters:
  • top_searcher – the top-level whoosh.searching.Searcher object.
  • q – the whoosh.query.Query object being searched for.
  • context – a whoosh.searching.SearchContext object containing information about the search.
remove(global_docnum)

Removes a document from the collector. Not that this method uses the global document number as opposed to Collector.collect() which takes a segment-relative docnum.

abstract results()

Returns a Results object containing the results of the search. Subclasses must implement this method

set_subsearcher(subsearcher, offset)

This method is called each time the collector starts on a new sub-searcher.

Subclasses can override this to perform set-up work, but they should still call the superclass’s method because it sets several necessary attributes on the collector object:

self.subsearcher
The current sub-searcher. If the top-level searcher is atomic, this is the same as the top-level searcher.
self.offset
The document number offset of the current searcher. You must add this number to the document number passed to Collector.collect() to get the top-level document number for use in results.
self.matcher
A whoosh.matching.Matcher object representing the matches for the query in the current sub-searcher.
abstract sort_key(sub_docnum)

Returns a sorting key for the current match. This should return the same value returned by Collector.collect(), but without the side effect of adding the current document to the results.

If the collector has been prepared with context.needs_current=True, this method can use self.matcher to get information, for example the score. Otherwise, it should only use the provided sub_docnum, since the matcher may be in an inconsistent state.

Subclasses must implement this method.

class whoosh.collectors.ScoredCollector(replace=10)

Base class for collectors that sort the results based on document score.

Parameters:replace – Number of matches between attempts to replace the matcher with a more efficient version.
collect(sub_docnum)

This method is called for every matched document. It should do the work of adding a matched document to the results, and it should return an object to use as a “sorting key” for the given document (such as the document’s score, a key generated by a facet, or just None). Subclasses must implement this method.

If you want the score for the current document, use self.matcher.score().

Overriding methods should add the current document offset (self.offset) to the sub_docnum to get the top-level document number for the matching document to add to results.

Parameters:sub_docnum – the document number of the current match within the current sub-searcher. You must add self.offset to this number to get the document’s top-level document number.
matches()

Yields a series of relative document numbers for matches in the current subsearcher.

prepare(top_searcher, q, context)

This method is called before a search.

Subclasses can override this to perform set-up work, but they should still call the superclass’s method because it sets several necessary attributes on the collector object:

self.top_searcher
The top-level searcher.
self.q
The query object
self.context
context.needs_current controls whether a wrapping collector requires that this collector’s matcher be in a valid state at every call to collect(). If this is False, the collector is free to use faster methods that don’t necessarily keep the matcher updated, such as matcher.all_ids().
Parameters:
  • top_searcher – the top-level whoosh.searching.Searcher object.
  • q – the whoosh.query.Query object being searched for.
  • context – a whoosh.searching.SearchContext object containing information about the search.
sort_key(sub_docnum)

Returns a sorting key for the current match. This should return the same value returned by Collector.collect(), but without the side effect of adding the current document to the results.

If the collector has been prepared with context.needs_current=True, this method can use self.matcher to get information, for example the score. Otherwise, it should only use the provided sub_docnum, since the matcher may be in an inconsistent state.

Subclasses must implement this method.

class whoosh.collectors.WrappingCollector(child)

Base class for collectors that wrap other collectors.

all_ids()

Returns a sequence of docnums matched in this collector. (Only valid after the collector is run.)

The default implementation is based on the docset. If a collector does not maintain the docset, it will need to override this method.

collect(sub_docnum)

This method is called for every matched document. It should do the work of adding a matched document to the results, and it should return an object to use as a “sorting key” for the given document (such as the document’s score, a key generated by a facet, or just None). Subclasses must implement this method.

If you want the score for the current document, use self.matcher.score().

Overriding methods should add the current document offset (self.offset) to the sub_docnum to get the top-level document number for the matching document to add to results.

Parameters:sub_docnum – the document number of the current match within the current sub-searcher. You must add self.offset to this number to get the document’s top-level document number.
collect_matches()

This method calls Collector.matches() and then for each matched document calls Collector.collect(). Sub-classes that want to intervene between finding matches and adding them to the collection (for example, to filter out certain documents) can override this method.

count()

Returns the total number of documents matched in this collector. (Only valid after the collector is run.)

The default implementation is based on the docset. If a collector does not maintain the docset, it will need to override this method.

finish()

This method is called after a search.

Subclasses can override this to perform set-up work, but they should still call the superclass’s method because it sets several necessary attributes on the collector object:

self.runtime
The time (in seconds) the search took.
matches()

Yields a series of relative document numbers for matches in the current subsearcher.

prepare(top_searcher, q, context)

This method is called before a search.

Subclasses can override this to perform set-up work, but they should still call the superclass’s method because it sets several necessary attributes on the collector object:

self.top_searcher
The top-level searcher.
self.q
The query object
self.context
context.needs_current controls whether a wrapping collector requires that this collector’s matcher be in a valid state at every call to collect(). If this is False, the collector is free to use faster methods that don’t necessarily keep the matcher updated, such as matcher.all_ids().
Parameters:
  • top_searcher – the top-level whoosh.searching.Searcher object.
  • q – the whoosh.query.Query object being searched for.
  • context – a whoosh.searching.SearchContext object containing information about the search.
remove(global_docnum)

Removes a document from the collector. Not that this method uses the global document number as opposed to Collector.collect() which takes a segment-relative docnum.

results()

Returns a Results object containing the results of the search. Subclasses must implement this method

set_subsearcher(subsearcher, offset)

This method is called each time the collector starts on a new sub-searcher.

Subclasses can override this to perform set-up work, but they should still call the superclass’s method because it sets several necessary attributes on the collector object:

self.subsearcher
The current sub-searcher. If the top-level searcher is atomic, this is the same as the top-level searcher.
self.offset
The document number offset of the current searcher. You must add this number to the document number passed to Collector.collect() to get the top-level document number for use in results.
self.matcher
A whoosh.matching.Matcher object representing the matches for the query in the current sub-searcher.
sort_key(sub_docnum)

Returns a sorting key for the current match. This should return the same value returned by Collector.collect(), but without the side effect of adding the current document to the results.

If the collector has been prepared with context.needs_current=True, this method can use self.matcher to get information, for example the score. Otherwise, it should only use the provided sub_docnum, since the matcher may be in an inconsistent state.

Subclasses must implement this method.

Basic collectors

class whoosh.collectors.TopCollector(limit=10, usequality=True, **kwargs)

A collector that only returns the top “N” scored results.

Parameters:
  • limit – the maximum number of results to return.
  • usequality – whether to use block-quality optimizations. This may be useful for debugging.
class whoosh.collectors.UnlimitedCollector(reverse=False)

A collector that returns all scored results.

Parameters:replace – Number of matches between attempts to replace the matcher with a more efficient version.
class whoosh.collectors.SortingCollector(sortedby, limit=10, reverse=False)

A collector that returns results sorted by a given whoosh.sorting.Facet object. See Sorting and faceting for more information.

Parameters:
  • sortedby – see Sorting and faceting.
  • reverse – If True, reverse the overall results. Note that you can reverse individual facets in a multi-facet sort key as well.

Wrappers

class whoosh.collectors.FilterCollector(child, allow=None, restrict=None)

A collector that lets you allow and/or restrict certain document numbers in the results:

uc = collectors.UnlimitedCollector()

ins = query.Term("chapter", "rendering")
outs = query.Term("status", "restricted")
fc = FilterCollector(uc, allow=ins, restrict=outs)

mysearcher.search_with_collector(myquery, fc)
print(fc.results())

This collector discards a document if:

  • The allowed set is not None and a document number is not in the set, or
  • The restrict set is not None and a document number is in the set.

(So, if the same document number is in both sets, that document will be discarded.)

If you have a reference to the collector, you can use FilterCollector.filtered_count to get the number of matching documents filtered out of the results by the collector.

Parameters:
  • child – the collector to wrap.
  • allow – a query, Results object, or set-like object containing docnument numbers that are allowed in the results, or None (meaning everything is allowed).
  • restrict – a query, Results object, or set-like object containing document numbers to disallow from the results, or None (meaning nothing is disallowed).
class whoosh.collectors.FacetCollector(child, groupedby, maptype=None)

A collector that creates groups of documents based on whoosh.sorting.Facet objects. See Sorting and faceting for more information.

This collector is used if you specify a groupedby parameter in the whoosh.searching.Searcher.search() method. You can use the whoosh.searching.Results.groups() method to access the facet groups.

If you have a reference to the collector can also use FacetedCollector.facetmaps to access the groups directly:

uc = collectors.UnlimitedCollector()
fc = FacetedCollector(uc, sorting.FieldFacet("category"))
mysearcher.search_with_collector(myquery, fc)
print(fc.facetmaps)
Parameters:
class whoosh.collectors.CollapseCollector(child, keyfacet, limit=1, order=None)

A collector that collapses results based on a facet. That is, it eliminates all but the top N results that share the same facet key. Documents with an empty key for the facet are never eliminated.

The “top” results within each group is determined by the result ordering (e.g. highest score in a scored search) or an optional second “ordering” facet.

If you have a reference to the collector you can use CollapseCollector.collapsed_counts to access the number of documents eliminated based on each key:

tc = TopCollector(limit=20)
cc = CollapseCollector(tc, "group", limit=3)
mysearcher.search_with_collector(myquery, cc)
print(cc.collapsed_counts)

See Collapsing results for more information.

Parameters:
  • child – the collector to wrap.
  • keyfacet – a whoosh.sorting.Facet to use for collapsing. All but the top N documents that share a key will be eliminated from the results.
  • limit – the maximum number of documents to keep for each key.
  • order – an optional whoosh.sorting.Facet to use to determine the “top” document(s) to keep when collapsing. The default (orderfaceet=None) uses the results order (e.g. the highest score in a scored search).
class whoosh.collectors.TimeLimitCollector(child, timelimit, greedy=False, use_alarm=True)

A collector that raises a TimeLimit exception if the search does not complete within a certain number of seconds:

uc = collectors.UnlimitedCollector()
tlc = TimeLimitedCollector(uc, timelimit=5.8)
try:
    mysearcher.search_with_collector(myquery, tlc)
except collectors.TimeLimit:
    print("The search ran out of time!")

# We can still get partial results from the collector
print(tlc.results())

IMPORTANT: On Unix systems (systems where signal.SIGALRM is defined), the code uses signals to stop searching immediately when the time limit is reached. On Windows, the OS does not support this functionality, so the search only checks the time between each found document, so if a matcher is slow the search could exceed the time limit.

Parameters:
  • child – the collector to wrap.
  • timelimit – the maximum amount of time (in seconds) to allow for searching. If the search takes longer than this, it will raise a TimeLimit exception.
  • greedy – if True, the collector will finish adding the most recent hit before raising the TimeLimit exception.
  • use_alarm – if True (the default), the collector will try to use signal.SIGALRM (on UNIX).
class whoosh.collectors.TermsCollector(child, settype=<class 'set'>)

A collector that remembers which terms appeared in which terms appeared in each matched document.

This collector is used if you specify terms=True in the whoosh.searching.Searcher.search() method.

If you have a reference to the collector can also use TermsCollector.termslist to access the term lists directly:

uc = collectors.UnlimitedCollector()
tc = TermsCollector(uc)
mysearcher.search_with_collector(myquery, tc)
# tc.termdocs is a dictionary mapping (fieldname, text) tuples to
# sets of document numbers
print(tc.termdocs)
# tc.docterms is a dictionary mapping docnums to lists of
# (fieldname, text) tuples
print(tc.docterms)