Skip to content

Commit aa9d368

Browse files
committed
Add demo of hybrid retrieval.
Signed-off-by: wxywb <[email protected]>
1 parent 0c02d6c commit aa9d368

File tree

9 files changed

+325
-0
lines changed

9 files changed

+325
-0
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[theme]
2+
base = "dark"
3+
primaryColor = "#4fc4f9"
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# Hybrid Semantic Search with Milvus
2+
3+
<div style="text-align: center;">
4+
<figure>
5+
<img src="./pics/demo.jpg" alt="Description of Image" width="700"/>
6+
</figure>
7+
</div>
8+
9+
The Milvus Hybrid Search Demo uses the BGE-M3 model to provide advanced search results. Users can enter queries to receive Dense, Sparse, and Hybrid responses. Dense responses focus on the semantic context, while Sparse responses emphasize keyword matching. The Hybrid approach combines both methods, offering comprehensive results that capture both context and specific keywords. This demo highlights the effectiveness of integrating multiple retrieval strategies to enhance search result relevance with the balacne of both semantic and lexical similairty.
10+
11+
## Features
12+
1. Embed the text as dense and sparse vectors.
13+
2. Set up a Milvus collection to store the dense and sparse vectors.
14+
3. Insert the data into Milvus.
15+
4. Search and inspect the results.
16+
17+
## Quick Deploy
18+
19+
Follow these steps to quickly deploy the application locally:
20+
21+
### Preparation
22+
23+
> Prerequisites: Python 3.8 or higher
24+
25+
**1. Download Codes**
26+
27+
```bash
28+
$ git clone <https://github.com/milvus-io/bootcamp.git>
29+
$ cd bootcamp/bootcamp/tutorials/quickstart/app/hybrid_demo_with_milvus
30+
```
31+
32+
**2. Installation**
33+
34+
Run the following commands to install the required libraries:
35+
```bash
36+
$ pip install pymilvus
37+
$ pip install pymilvus[model]
38+
```
39+
40+
And install the dependencies:
41+
```bash
42+
$ pip install -r requirements.txt
43+
```
44+
45+
**3.Data Download**
46+
47+
Download the Quora Duplicate Questions dataset and place it in the same directory:
48+
49+
```bash
50+
wget http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv
51+
```
52+
53+
Credit for the dataset: [First Quora Dataset Release: Question Pairs](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs)
54+
55+
56+
### Start Service
57+
58+
Run the Streamlit application:
59+
60+
```bash
61+
$ streamlit run ui.py
62+
```
Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
"""
2+
Hybrid Semantic Search with Milvus
3+
4+
This demo showcases hybrid semantic search using both dense and sparse vectors with Milvus.
5+
You can optionally use the BGE-M3 model to embed text into dense and sparse vectors, or use randomly generated vectors as an example.
6+
Additionally, you can rerank the search results using the BGE CrossEncoder model.
7+
8+
Prerequisites:
9+
- Milvus 2.4.0 or higher (sparse vector search is available only in these versions).
10+
Follow this guide to set up Milvus: https://milvus.io/docs/install_standalone-docker.md
11+
- pymilvus Python client library to connect to the Milvus server.
12+
- Optional `model` module in pymilvus for BGE-M3 model.
13+
14+
Installation:
15+
Run the following commands to install the required libraries:
16+
pip install pymilvus
17+
pip install pymilvus[model]
18+
19+
Steps:
20+
1. Embed the text as dense and sparse vectors.
21+
2. Set up a Milvus collection to store the dense and sparse vectors.
22+
3. Insert the data into Milvus.
23+
4. Search and inspect the results.
24+
"""
25+
26+
use_bge_m3 = True
27+
use_reranker = True
28+
29+
import random
30+
import numpy as np
31+
import pandas as pd
32+
33+
from pymilvus import (
34+
FieldSchema,
35+
CollectionSchema,
36+
DataType,
37+
Collection,
38+
connections,
39+
)
40+
41+
# 1. prepare a small corpus to search
42+
file_path = "quora_duplicate_questions.tsv"
43+
df = pd.read_csv(file_path, sep="\t")
44+
questions = set()
45+
for _, row in df.iterrows():
46+
obj = row.to_dict()
47+
questions.add(obj["question1"][:512])
48+
questions.add(obj["question2"][:512])
49+
if len(questions) > 10000:
50+
break
51+
52+
docs = list(questions)
53+
54+
# add some randomly generated texts
55+
56+
57+
def random_embedding(texts):
58+
rng = np.random.default_rng()
59+
return {
60+
"dense": np.random.rand(len(texts), 768),
61+
"sparse": [
62+
{
63+
d: rng.random()
64+
for d in random.sample(range(1000), random.randint(20, 30))
65+
}
66+
for _ in texts
67+
],
68+
}
69+
70+
71+
dense_dim = 768
72+
ef = random_embedding
73+
74+
# BGE-M3 model can embed texts as dense and sparse vectors.
75+
# It is included in the optional `model` module in pymilvus, to install it,
76+
# simply run "pip install pymilvus[model]".
77+
from pymilvus.model.hybrid import BGEM3EmbeddingFunction
78+
79+
ef = BGEM3EmbeddingFunction(use_fp16=False, device="cuda")
80+
dense_dim = ef.dim["dense"]
81+
82+
docs_embeddings = ef(docs)
83+
84+
# 2. setup Milvus collection and index
85+
connections.connect("default", uri="milvus.db")
86+
87+
# Specify the data schema for the new Collection.
88+
fields = [
89+
# Use auto generated id as primary key
90+
FieldSchema(
91+
name="pk", dtype=DataType.VARCHAR, is_primary=True, auto_id=True, max_length=100
92+
),
93+
# Store the original text to retrieve based on semantically distance
94+
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=512),
95+
# Milvus now supports both sparse and dense vectors,
96+
# we can store each in a separate field to conduct hybrid search on both vectors
97+
FieldSchema(name="sparse_vector", dtype=DataType.SPARSE_FLOAT_VECTOR),
98+
FieldSchema(name="dense_vector", dtype=DataType.FLOAT_VECTOR, dim=dense_dim),
99+
]
100+
schema = CollectionSchema(fields, "")
101+
col_name = "hybrid_demo"
102+
# Now we can create the new collection with above name and schema.
103+
col = Collection(col_name, schema, consistency_level="Strong")
104+
105+
# We need to create indices for the vector fields. The indices will be loaded
106+
# into memory for efficient search.
107+
sparse_index = {"index_type": "SPARSE_INVERTED_INDEX", "metric_type": "IP"}
108+
col.create_index("sparse_vector", sparse_index)
109+
dense_index = {"index_type": "FLAT", "metric_type": "IP"}
110+
col.create_index("dense_vector", dense_index)
111+
col.load()
112+
113+
# 3. insert text and sparse/dense vector representations into the collection
114+
entities = [docs, docs_embeddings["sparse"], docs_embeddings["dense"]]
115+
for i in range(0, len(docs), 50):
116+
batched_entities = [
117+
docs[i : i + 50],
118+
docs_embeddings["sparse"][i : i + 50],
119+
docs_embeddings["dense"][i : i + 50],
120+
]
121+
col.insert(batched_entities)
122+
col.flush()
Loading
Loading
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
pandas
2+
numpy
3+
pymilvus
4+
pymilvus[model]
5+
streamlit
Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
import streamlit as st
2+
from streamlit import cache_resource
3+
from pymilvus.model.hybrid import BGEM3EmbeddingFunction
4+
from pymilvus import (
5+
Collection,
6+
AnnSearchRequest,
7+
WeightedRanker,
8+
connections,
9+
)
10+
11+
# Logo
12+
st.image("./pics/Milvus_Logo_Official.png", width=200)
13+
14+
15+
@cache_resource
16+
def get_model():
17+
ef = BGEM3EmbeddingFunction(use_fp16=False, device="cpu")
18+
return ef
19+
20+
21+
@cache_resource
22+
def get_collection():
23+
col_name = "hybrid_demo"
24+
connections.connect("default", uri="milvus.db")
25+
col = Collection(col_name)
26+
return col
27+
28+
29+
def search_from_source(source, query):
30+
return [f"{source} Result {i+1} for {query}" for i in range(5)]
31+
32+
33+
st.title("Milvus Hybird Search Demo")
34+
35+
query = st.text_input("Enter your search query:")
36+
search_button = st.button("Search")
37+
38+
39+
@cache_resource
40+
def get_tokenizer():
41+
ef = get_model()
42+
tokenizer = ef.model.tokenizer
43+
return tokenizer
44+
45+
46+
def doc_text_colorization(query, docs):
47+
tokenizer = get_tokenizer()
48+
query_tokens_ids = tokenizer.encode(query, return_offsets_mapping=True)
49+
query_tokens = tokenizer.convert_ids_to_tokens(query_tokens_ids)
50+
colored_texts = []
51+
52+
for doc in docs:
53+
ldx = 0
54+
landmarks = []
55+
encoding = tokenizer.encode_plus(doc, return_offsets_mapping=True)
56+
tokens = tokenizer.convert_ids_to_tokens(encoding["input_ids"])[1:-1]
57+
offsets = encoding["offset_mapping"][1:-1]
58+
for token, (start, end) in zip(tokens, offsets):
59+
if token in query_tokens:
60+
if len(landmarks) != 0 and start == landmarks[-1]:
61+
landmarks[-1] = end
62+
else:
63+
landmarks.append(start)
64+
landmarks.append(end)
65+
close = False
66+
color_text = ""
67+
for i, c in enumerate(doc):
68+
if ldx == len(landmarks):
69+
pass
70+
elif i == landmarks[ldx]:
71+
if close is True:
72+
color_text += "]"
73+
else:
74+
color_text += ":red["
75+
close = not close
76+
ldx = ldx + 1
77+
color_text += c
78+
if close is True:
79+
color_text += "]"
80+
colored_texts.append(color_text)
81+
return colored_texts
82+
83+
84+
def hybrid_search(query_embeddings, sparse_weight=1.0, dense_weight=1.0):
85+
col = get_collection()
86+
sparse_search_params = {"metric_type": "IP"}
87+
sparse_req = AnnSearchRequest(
88+
query_embeddings["sparse"], "sparse_vector", sparse_search_params, limit=10
89+
)
90+
dense_search_params = {"metric_type": "IP"}
91+
dense_req = AnnSearchRequest(
92+
query_embeddings["dense"], "dense_vector", dense_search_params, limit=10
93+
)
94+
rerank = WeightedRanker(sparse_weight, dense_weight)
95+
res = col.hybrid_search(
96+
[sparse_req, dense_req], rerank=rerank, limit=10, output_fields=["text"]
97+
)
98+
if len(res):
99+
return [hit.fields["text"] for hit in res[0]]
100+
else:
101+
return []
102+
103+
104+
# Display search results when the button is clicked
105+
if search_button and query:
106+
ef = get_model()
107+
query_embeddings = ef([query])
108+
col1, col2, col3 = st.columns(3)
109+
with col1:
110+
st.header("Dense")
111+
results = hybrid_search(query_embeddings, sparse_weight=0.0, dense_weight=1.0)
112+
for result in results:
113+
st.markdown(result)
114+
115+
with col2:
116+
st.header("Sparse")
117+
results = hybrid_search(query_embeddings, sparse_weight=1.0, dense_weight=0.0)
118+
colored_results = doc_text_colorization(query, results)
119+
for result in colored_results:
120+
st.markdown(result)
121+
122+
with col3:
123+
st.header("Hybrid")
124+
results = hybrid_search(query_embeddings, sparse_weight=0.7, dense_weight=1.0)
125+
colored_results = doc_text_colorization(query, results)
126+
for result in colored_results:
127+
st.markdown(result)
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[theme]
2+
base = "dark"
3+
primaryColor = "#4fc4f9"
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[theme]
2+
base = "dark"
3+
primaryColor = "#4fc4f9"

0 commit comments

Comments
 (0)