Skip to content

Commit a2549a9

Browse files
authored
doc: Add mktestdocs, and tutorial for snippets. Fixes #219 (#223)
1 parent 61f37e7 commit a2549a9

File tree

5 files changed

+108
-6
lines changed

5 files changed

+108
-6
lines changed

docs/reference.md

+45-3
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,52 @@
11
# Reference
22

3+
## Setup
4+
5+
We'll use a test index for the examples that follow.
6+
7+
```python
8+
import os
9+
from tantivy import SchemaBuilder, Index, Document
10+
schema = (
11+
SchemaBuilder()
12+
.add_integer_field("doc_id", indexed=True, stored=True)
13+
.add_text_field("title", stored=True)
14+
.add_text_field("body")
15+
.build()
16+
)
17+
index = Index(schema=schema, path=None)
18+
writer = index.writer(heap_size=15_000_000, num_threads=1)
19+
doc = Document()
20+
doc.add_integer("doc_id", 1)
21+
doc.add_text("title", "The Old Man and the Sea")
22+
doc.add_text(
23+
"body",
24+
(
25+
"He was an old man who fished alone in a skiff in"
26+
"the Gulf Stream and he had gone eighty-four days "
27+
"now without taking a fish."
28+
),
29+
)
30+
writer.add_document(doc)
31+
32+
doc = Document()
33+
doc.add_integer("doc_id", 2)
34+
doc.add_text("title", "The Old Man and the Sea II")
35+
doc.add_text("body", "He was an old man who sailed alone.")
36+
37+
writer.add_document(doc)
38+
writer.commit()
39+
index.reload()
40+
```
41+
342
## Valid Query Formats
443

544
tantivy-py supports the [query language](https://docs.rs/tantivy/latest/tantivy/query/struct.QueryParser.html#method.parse_query) used in tantivy.
645
Below a few basic query formats are shown:
746

847
- AND and OR conjunctions.
948
```python
49+
searcher = index.searcher()
1050
query = index.parse_query('(Old AND Man) OR Stream', ["title", "body"])
1151
(best_score, best_doc_address) = searcher.search(query, 3).hits[0]
1252
best_doc = searcher.doc(best_doc_address)
@@ -29,7 +69,7 @@ best_doc = searcher.doc(best_doc_address)
2969

3070
- integer search
3171
```python
32-
query = index.parse_query('"eighty-four days"', ["doc_id"])
72+
query = index.parse_query('1', ["doc_id"])
3373
(best_score, best_doc_address) = searcher.search(query, 3).hits[0]
3474
best_doc = searcher.doc(best_doc_address)
3575
```
@@ -46,8 +86,10 @@ the search query in additional quotes, as if a phrase query were being used.
4686
The following will NOT work:
4787

4888
```python
49-
# Raises ValueError
50-
index.parse_query(r'sea\"', ["title", "body"])
89+
try:
90+
index.parse_query(r'sea\"', ["title", "body"])
91+
except ValueError as e:
92+
assert str(e) == r'Syntax Error: sea\"'
5193
```
5294

5395
However, the following will succeed:

docs/requirements.txt

+1
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
11
mkdocs==1.4.3
2+
mktestdocs==0.2.1

docs/tutorials.md

+49-3
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@
33
## Building an index and populating it
44

55
```python
6+
import tempfile
7+
import pathlib
68
import tantivy
79

810
# Declaring our schema.
@@ -20,7 +22,10 @@ To have a persistent index, use the path
2022
parameter to store the index on the disk, e.g:
2123

2224
```python
23-
index = tantivy.Index(schema, path=os.getcwd() + '/index')
25+
tmpdir = tempfile.TemporaryDirectory()
26+
index_path = pathlib.Path(tmpdir.name) / "index"
27+
index_path.mkdir()
28+
persistent_index = tantivy.Index(schema, path=str(index_path))
2429
```
2530

2631
By default, tantivy offers the following tokenizers
@@ -44,7 +49,8 @@ which can be used in tantivy-py:
4449

4550
to use the above tokenizers, simply provide them as a parameter to `add_text_field`. e.g.
4651
```python
47-
schema_builder.add_text_field("body", stored=True, tokenizer_name='en_stem')
52+
schema_builder_tok = tantivy.SchemaBuilder()
53+
schema_builder_tok.add_text_field("body", stored=True, tokenizer_name='en_stem')
4854
```
4955

5056
## Adding one document.
@@ -77,6 +83,46 @@ query = index.parse_query("fish days", ["title", "body"])
7783
(best_score, best_doc_address) = searcher.search(query, 3).hits[0]
7884
best_doc = searcher.doc(best_doc_address)
7985
assert best_doc["title"] == ["The Old Man and the Sea"]
80-
print(best_doc)
8186
```
8287

88+
## Using the snippet generator
89+
90+
```python
91+
hit_text = best_doc["body"][0]
92+
print(f"{hit_text=}")
93+
assert hit_text == (
94+
"He was an old man who fished alone in a skiff in the "
95+
"Gulf Stream and he had gone eighty-four days now "
96+
"without taking a fish."
97+
)
98+
99+
from tantivy import SnippetGenerator
100+
snippet_generator = SnippetGenerator.create(
101+
searcher, query, schema, "body"
102+
)
103+
snippet = snippet_generator.snippet_from_doc(best_doc)
104+
```
105+
106+
The snippet object provides the hit ranges. These are the marker
107+
offsets in the text that match the query.
108+
109+
```python
110+
highlights = snippet.highlighted()
111+
first_highlight = highlights[0]
112+
assert first_highlight.start == 93
113+
assert first_highlight.end == 97
114+
assert hit_text[first_highlight.start:first_highlight.end] == "days"
115+
```
116+
117+
The snippet object can also generate a marked-up HTML snippet:
118+
119+
```python
120+
html_snippet = snippet.to_html()
121+
assert html_snippet == (
122+
"He was an old man who fished alone in a skiff in the "
123+
"Gulf Stream and he had gone eighty-four <b>days</b> now "
124+
"without taking a <b>fish</b>"
125+
)
126+
```
127+
128+

requirements-dev.txt

+1
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
maturin
22
pytest>=4.0
3+
mktestdocs==0.2.1

tests/test_docs.py

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
from pathlib import Path
2+
import pytest
3+
4+
from mktestdocs import check_md_file
5+
6+
def test_hello():
7+
assert True
8+
9+
10+
@pytest.mark.parametrize("filepath", Path("docs").glob("**/*.md"), ids=str)
11+
def test_docs(filepath):
12+
check_md_file(filepath, memory=True)

0 commit comments

Comments
 (0)