Indexing and searching document hierarchies

Indexing and searching document hierarchies

Indexing and searching document hierarchies

Overview

Whoosh’s full-text index is essentially a flat database of documents. However, Whoosh supports two techniques for simulating the indexing and querying of hierarchical documents, that is, sets of documents that form a parent-child hierarchy, such as “Chapter - Section - Paragraph” or “Module - Class - Method”.

You can specify parent-child relationships at indexing time, by grouping documents in the same hierarchy, and then use the whoosh.query.NestedParent and/or whoosh.query.NestedChildren to find parents based on their children or vice-versa.

Alternatively, you can use query time joins, essentially like external key joins in a database, where you perform one search to find a relevant document, then use a stored value on that document (for example, a parent field) to look up another document.

Both methods have pros and cons.

Using nested document indexing

Indexing

This method works by indexing a “parent” document and all its “child” documents as a “group” so they are guaranteed to end up in the same segment. You can use the context manager returned by IndexWriter.group() to group documents:

with ix.writer() as w:
    with w.group():
        w.add_document(kind="class", name="Index")
        w.add_document(kind="method", name="add document")
        w.add_document(kind="method", name="add reader")
        w.add_document(kind="method", name="close")
    with w.group():
        w.add_document(kind="class", name="Accumulator")
        w.add_document(kind="method", name="add")
        w.add_document(kind="method", name="get result")
    with w.group():
        w.add_document(kind="class", name="Calculator")
        w.add_document(kind="method", name="add")
        w.add_document(kind="method", name="add all")
        w.add_document(kind="method", name="add some")
        w.add_document(kind="method", name="multiply")
        w.add_document(kind="method", name="close")
    with w.group():
        w.add_document(kind="class", name="Deleter")
        w.add_document(kind="method", name="add")
        w.add_document(kind="method", name="delete")

Alternatively you can use the start_group() and end_group() methods:

with ix.writer() as w:
    w.start_group()
    w.add_document(kind="class", name="Index")
    w.add_document(kind="method", name="add document")
    w.add_document(kind="method", name="add reader")
    w.add_document(kind="method", name="close")
    w.end_group()

Each level of the hierarchy should have a query that distinguishes it from other levels (for example, in the above index, you can use kind:class or kind:method to match different levels of the hierarchy).

Once you’ve indexed the hierarchy of documents, you can use two query types to find parents based on children or vice-versa.

(There is currently no support in the default query parser for nested queries.)

NestedParent query

The whoosh.query.NestedParent query type lets you specify a query for child documents, but have the query return an “ancestor” document from higher in the hierarchy:

# First, we need a query that matches all the documents in the "parent"
# level we want of the hierarchy
all_parents = query.Term("kind", "class")

# Then, we need a query that matches the children we want to find
wanted_kids = query.Term("name", "close")

# Now we can make a query that will match documents where "name" is
# "close", but the query will return the "parent" documents of the matching
# children
q = query.NestedParent(all_parents, wanted_kids)
# results = Index, Calculator

Note that in a hierarchy with more than two levels, you can specify a “parents” query that matches any level of the hierarchy, so you can return the top-level ancestors of the matching children, or the second level, third level, etc.

The query works by first building a bit vector representing which documents are “parents”:

Index
|      Calculator
|      |
1000100100000100
    |        |
    |        Deleter
    Accumulator

Then for each match of the “child” query, it calculates the previous parent from the bit vector and returns it as a match (it only returns each parent once no matter how many children match). This parent lookup is very efficient:

1000100100000100
   |
|<-+ close

NestedChildren query

The opposite of NestedParent is whoosh.query.NestedChildren. This query lets you match parents but return their children. This is useful, for example, to search for an album title and return the songs in the album:

# Query that matches all documents in the "parent" level we want to match
# at
all_parents = query.Term("kind", "album")

# Parent documents we want to match
wanted_parents = query.Term("album_title", "heaven")

# Now we can make a query that will match parent documents where "album_title"
# contains "heaven", but the query will return the "child" documents of the
# matching parents
q1 = query.NestedChildren(all_parents, wanted_parents)

You can then combine that query with an AND clause, for example to find songs with “hell” in the song title that occur on albums with “heaven” in the album title:

q2 = query.And([q1, query.Term("song_title", "hell")])

Deleting and updating hierarchical documents

The drawback of the index-time method is updating and deleting. Because the implementation of the queries depends on the parent and child documents being contiguous in the segment, you can’t update/delete just one child document. You can only update/delete an entire top-level document at once (for example, if your hierarchy is “Chapter - Section - Paragraph”, you can only update or delete entire chapters, not a section or paragraph). If the top-level of the hierarchy represents very large blocks of text, this can involve a lot of deleting and reindexing.

Currently Writer.update_document() does not automatically work with nested documents. You must manually delete and re-add document groups to update them.

To delete nested document groups, use the Writer.delete_by_query() method with a NestedParent query:

# Delete the "Accumulator" class
all_parents = query.Term("kind", "class")
to_delete = query.Term("name", "Accumulator")
q = query.NestedParent(all_parents, to_delete)
with myindex.writer() as w:
    w.delete_by_query(q)

Using query-time joins

A second technique for simulating hierarchical documents in Whoosh involves using a stored field on each document to point to its parent, and then using the value of that field at query time to find parents and children.

For example, if we index a hierarchy of classes and methods using pointers to parents instead of nesting:

# Store a pointer to the parent on each "method" document
with ix.writer() as w:
    w.add_document(kind="class", c_name="Index", docstring="...")
    w.add_document(kind="method", m_name="add document", parent="Index")
    w.add_document(kind="method", m_name="add reader", parent="Index")
    w.add_document(kind="method", m_name="close", parent="Index")

    w.add_document(kind="class", c_name="Accumulator", docstring="...")
    w.add_document(kind="method", m_name="add", parent="Accumulator")
    w.add_document(kind="method", m_name="get result", parent="Accumulator")

    w.add_document(kind="class", c_name="Calculator", docstring="...")
    w.add_document(kind="method", m_name="add", parent="Calculator")
    w.add_document(kind="method", m_name="add all", parent="Calculator")
    w.add_document(kind="method", m_name="add some", parent="Calculator")
    w.add_document(kind="method", m_name="multiply", parent="Calculator")
    w.add_document(kind="method", m_name="close", parent="Calculator")

    w.add_document(kind="class", c_name="Deleter", docstring="...")
    w.add_document(kind="method", m_name="add", parent="Deleter")
    w.add_document(kind="method", m_name="delete", parent="Deleter")

# Now do manual joins at query time
with ix.searcher() as s:
    # Tip: Searcher.document() and Searcher.documents() let you look up
    # documents by field values more easily than using Searcher.search()

    # Children to parents:
    # Print the docstrings of classes on which "close" methods occur
    for child_doc in s.documents(m_name="close"):
        # Use the stored value of the "parent" field to look up the parent
        # document
        parent_doc = s.document(c_name=child_doc["parent"])
        # Print the parent document's stored docstring field
        print(parent_doc["docstring"])

    # Parents to children:
    # Find classes with "big" in the docstring and print their methods
    q = query.Term("kind", "class") & query.Term("docstring", "big")
    for hit in s.search(q, limit=None):
        print("Class name=", hit["c_name"], "methods:")
        for child_doc in s.documents(parent=hit["c_name"]):
            print("  Method name=", child_doc["m_name"])

This technique is more flexible than index-time nesting in that you can delete/update individual documents in the hierarchy piece by piece, although it doesn’t support finding different parent levels as easily. It is also slower than index-time nesting (potentially much slower), since you must perform additional searches for each found document.

Future versions of Whoosh may include “join” queries to make this process more efficient (or at least more automatic).