-
Notifications
You must be signed in to change notification settings - Fork 75
Open
Labels
bugSomething isn't workingSomething isn't workinghelp wantedExtra attention is neededExtra attention is needed
Description
The snippet generator example suggests that the offsets produced by snippet.highlighted()
can be used for slicing the text of the corresponding document:
highlights = snippet.highlighted()
first_highlight = highlights[0]
assert first_highlight.start == 93
assert first_highlight.end == 97
assert hit_text[first_highlight.start:first_highlight.end] == "days"
However, looking at the source implementation of to_html
, these offsets are relative to the snippet's fragment and not the document text: https://docs.rs/tantivy/latest/src/tantivy/snippet/mod.rs.html#149
Because the ranges are relative to the fragment and not the document, if the snippet is located in a later portion of the document such that the fragment itself is offset, then using these ranges will not retrieve the correct text for highlighting:
# %%
from tantivy import (
Document,
Index,
SchemaBuilder,
SnippetGenerator,
)
doc_schema = SchemaBuilder().add_text_field("text", stored=True).build()
index = Index(doc_schema)
writer = index.writer()
doc_1 = Document()
doc_1.add_text("text", "Teach a man to fish and he will eat for the rest of his life.")
_ = writer.add_document(doc_1)
doc_2 = Document()
doc_2.add_text(
"text",
"""He was an old man who fished alone in a skiff in the Gulf Stream and he had gone eighty-four days now without taking a fish. In the first forty days a boy had been with him. But after forty days without a fish the boy's parents had told him that the old man was now definitely and finally salao, which is the worst form of unlucky, and the boy had gone at their orders in another boat which caught three good fish the first week. It made the boy sad to see the old man come in each day with his skiff empty and he always went down to help him carry either the coiled lines or the gaff and harpoon and the sail that was furled around the mast. The sail was patched with flour sacks and, furled, it looked like the flag of permanent defeat.
The old man was thin and gaunt with deep wrinkles in the back of his neck. The brown blotches of the benevolent skin cancer the sun brings from its reflection on the tropic sea were on his cheeks. The blotches ran well down the sides of his face and his hands had the deep-creased scars from handling heavy fish on the cords. But none of these scars were fresh. They were as old as erosions in a fishless desert.""",
)
_ = writer.add_document(doc_2)
_ = writer.commit()
_ = writer.wait_merging_threads()
index.reload()
def search(query_string: str) -> None:
query = index.parse_query(query_string, ["text"])
searcher = index.searcher()
doc_results = searcher.search(query, limit=10).hits
snippet_generator = SnippetGenerator.create(searcher, query, doc_schema, "text")
for _, doc_address in doc_results:
doc = searcher.doc(doc_address)
doc_text = doc.get_first("text")
if not doc_text:
raise ValueError("Doc text not found")
snippet = snippet_generator.snippet_from_doc(doc)
print("Snippet HTML: ", snippet.to_html())
for snippet_range in snippet.highlighted():
print("Highlighted: ", doc_text[snippet_range.start : snippet_range.end])
search("fish")
"""
Snippet HTML: Teach a man to <b>fish</b> and he will eat for the rest of his life
Highlighted: fish
Snippet HTML: He was an old man who fished alone in a skiff in the Gulf Stream and he had gone eighty-four days now without taking a <b>fish</b>. In the first forty days a
Highlighted: fish
"""
search("heavy fish")
"""
Snippet HTML: the tropic sea were on his cheeks. The blotches ran well down the sides of his face and his hands had the deep-creased scars from handling <b>heavy</b> <b>fish</b>
Highlighted: orty
Highlighted: ays
Snippet HTML: Teach a man to <b>fish</b> and he will eat for the rest of his life
Highlighted: fish
"""
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinghelp wantedExtra attention is neededExtra attention is needed