Skip to content

Snippet generator documentation is incorrect #420

@kevinhu

Description

@kevinhu

The snippet generator example suggests that the offsets produced by snippet.highlighted() can be used for slicing the text of the corresponding document:

highlights = snippet.highlighted()
first_highlight = highlights[0]
assert first_highlight.start == 93
assert first_highlight.end == 97
assert hit_text[first_highlight.start:first_highlight.end] == "days"

However, looking at the source implementation of to_html, these offsets are relative to the snippet's fragment and not the document text: https://docs.rs/tantivy/latest/src/tantivy/snippet/mod.rs.html#149

Because the ranges are relative to the fragment and not the document, if the snippet is located in a later portion of the document such that the fragment itself is offset, then using these ranges will not retrieve the correct text for highlighting:

# %%
from tantivy import (
    Document,
    Index,
    SchemaBuilder,
    SnippetGenerator,
)

doc_schema = SchemaBuilder().add_text_field("text", stored=True).build()
index = Index(doc_schema)
writer = index.writer()

doc_1 = Document()
doc_1.add_text("text", "Teach a man to fish and he will eat for the rest of his life.")
_ = writer.add_document(doc_1)

doc_2 = Document()
doc_2.add_text(
    "text",
    """He was an old man who fished alone in a skiff in the Gulf Stream and he had gone eighty-four days now without taking a fish. In the first forty days a boy had been with him. But after forty days without a fish the boy's parents had told him that the old man was now definitely and finally salao, which is the worst form of unlucky, and the boy had gone at their orders in another boat which caught three good fish the first week. It made the boy sad to see the old man come in each day with his skiff empty and he always went down to help him carry either the coiled lines or the gaff and harpoon and the sail that was furled around the mast. The sail was patched with flour sacks and, furled, it looked like the flag of permanent defeat.

The old man was thin and gaunt with deep wrinkles in the back of his neck. The brown blotches of the benevolent skin cancer the sun brings from its reflection on the tropic sea were on his cheeks. The blotches ran well down the sides of his face and his hands had the deep-creased scars from handling heavy fish on the cords. But none of these scars were fresh. They were as old as erosions in a fishless desert.""",
)
_ = writer.add_document(doc_2)

_ = writer.commit()
_ = writer.wait_merging_threads()
index.reload()


def search(query_string: str) -> None:
    query = index.parse_query(query_string, ["text"])
    searcher = index.searcher()

    doc_results = searcher.search(query, limit=10).hits

    snippet_generator = SnippetGenerator.create(searcher, query, doc_schema, "text")

    for _, doc_address in doc_results:
        doc = searcher.doc(doc_address)

        doc_text = doc.get_first("text")

        if not doc_text:
            raise ValueError("Doc text not found")

        snippet = snippet_generator.snippet_from_doc(doc)

        print("Snippet HTML: ", snippet.to_html())

        for snippet_range in snippet.highlighted():
            print("Highlighted: ", doc_text[snippet_range.start : snippet_range.end])


search("fish")
"""
Snippet HTML:  Teach a man to <b>fish</b> and he will eat for the rest of his life
Highlighted:  fish
Snippet HTML:  He was an old man who fished alone in a skiff in the Gulf Stream and he had gone eighty-four days now without taking a <b>fish</b>. In the first forty days a
Highlighted:  fish
"""

search("heavy fish")
"""
Snippet HTML:  the tropic sea were on his cheeks. The blotches ran well down the sides of his face and his hands had the deep-creased scars from handling <b>heavy</b> <b>fish</b>
Highlighted:  orty 
Highlighted:  ays 
Snippet HTML:  Teach a man to <b>fish</b> and he will eat for the rest of his life
Highlighted:  fish
"""

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions