-
Notifications
You must be signed in to change notification settings - Fork 581
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: wxywb <[email protected]>
- Loading branch information
Showing
9 changed files
with
325 additions
and
0 deletions.
There are no files selected for viewing
3 changes: 3 additions & 0 deletions
3
bootcamp/tutorials/quickstart/apps/hybrid_demo_with_milvus/.streamlit/config.toml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
[theme] | ||
base = "dark" | ||
primaryColor = "#4fc4f9" |
62 changes: 62 additions & 0 deletions
62
bootcamp/tutorials/quickstart/apps/hybrid_demo_with_milvus/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
# Hybrid Semantic Search with Milvus | ||
|
||
<div style="text-align: center;"> | ||
<figure> | ||
<img src="./pics/demo.jpg" alt="Description of Image" width="700"/> | ||
</figure> | ||
</div> | ||
|
||
The Milvus Hybrid Search Demo uses the BGE-M3 model to provide advanced search results. Users can enter queries to receive Dense, Sparse, and Hybrid responses. Dense responses focus on the semantic context, while Sparse responses emphasize keyword matching. The Hybrid approach combines both methods, offering comprehensive results that capture both context and specific keywords. This demo highlights the effectiveness of integrating multiple retrieval strategies to enhance search result relevance with the balacne of both semantic and lexical similairty. | ||
|
||
## Features | ||
1. Embed the text as dense and sparse vectors. | ||
2. Set up a Milvus collection to store the dense and sparse vectors. | ||
3. Insert the data into Milvus. | ||
4. Search and inspect the results. | ||
|
||
## Quick Deploy | ||
|
||
Follow these steps to quickly deploy the application locally: | ||
|
||
### Preparation | ||
|
||
> Prerequisites: Python 3.8 or higher | ||
**1. Download Codes** | ||
|
||
```bash | ||
$ git clone <https://github.com/milvus-io/bootcamp.git> | ||
$ cd bootcamp/bootcamp/tutorials/quickstart/app/hybrid_demo_with_milvus | ||
``` | ||
|
||
**2. Installation** | ||
|
||
Run the following commands to install the required libraries: | ||
```bash | ||
$ pip install pymilvus | ||
$ pip install pymilvus[model] | ||
``` | ||
|
||
And install the dependencies: | ||
```bash | ||
$ pip install -r requirements.txt | ||
``` | ||
|
||
**3.Data Download** | ||
|
||
Download the Quora Duplicate Questions dataset and place it in the same directory: | ||
|
||
```bash | ||
wget http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv | ||
``` | ||
|
||
Credit for the dataset: [First Quora Dataset Release: Question Pairs](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs) | ||
|
||
|
||
### Start Service | ||
|
||
Run the Streamlit application: | ||
|
||
```bash | ||
$ streamlit run ui.py | ||
``` |
122 changes: 122 additions & 0 deletions
122
bootcamp/tutorials/quickstart/apps/hybrid_demo_with_milvus/index.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,122 @@ | ||
""" | ||
Hybrid Semantic Search with Milvus | ||
This demo showcases hybrid semantic search using both dense and sparse vectors with Milvus. | ||
You can optionally use the BGE-M3 model to embed text into dense and sparse vectors, or use randomly generated vectors as an example. | ||
Additionally, you can rerank the search results using the BGE CrossEncoder model. | ||
Prerequisites: | ||
- Milvus 2.4.0 or higher (sparse vector search is available only in these versions). | ||
Follow this guide to set up Milvus: https://milvus.io/docs/install_standalone-docker.md | ||
- pymilvus Python client library to connect to the Milvus server. | ||
- Optional `model` module in pymilvus for BGE-M3 model. | ||
Installation: | ||
Run the following commands to install the required libraries: | ||
pip install pymilvus | ||
pip install pymilvus[model] | ||
Steps: | ||
1. Embed the text as dense and sparse vectors. | ||
2. Set up a Milvus collection to store the dense and sparse vectors. | ||
3. Insert the data into Milvus. | ||
4. Search and inspect the results. | ||
""" | ||
|
||
use_bge_m3 = True | ||
use_reranker = True | ||
|
||
import random | ||
import numpy as np | ||
import pandas as pd | ||
|
||
from pymilvus import ( | ||
FieldSchema, | ||
CollectionSchema, | ||
DataType, | ||
Collection, | ||
connections, | ||
) | ||
|
||
# 1. prepare a small corpus to search | ||
file_path = "quora_duplicate_questions.tsv" | ||
df = pd.read_csv(file_path, sep="\t") | ||
questions = set() | ||
for _, row in df.iterrows(): | ||
obj = row.to_dict() | ||
questions.add(obj["question1"][:512]) | ||
questions.add(obj["question2"][:512]) | ||
if len(questions) > 10000: | ||
break | ||
|
||
docs = list(questions) | ||
|
||
# add some randomly generated texts | ||
|
||
|
||
def random_embedding(texts): | ||
rng = np.random.default_rng() | ||
return { | ||
"dense": np.random.rand(len(texts), 768), | ||
"sparse": [ | ||
{ | ||
d: rng.random() | ||
for d in random.sample(range(1000), random.randint(20, 30)) | ||
} | ||
for _ in texts | ||
], | ||
} | ||
|
||
|
||
dense_dim = 768 | ||
ef = random_embedding | ||
|
||
# BGE-M3 model can embed texts as dense and sparse vectors. | ||
# It is included in the optional `model` module in pymilvus, to install it, | ||
# simply run "pip install pymilvus[model]". | ||
from pymilvus.model.hybrid import BGEM3EmbeddingFunction | ||
|
||
ef = BGEM3EmbeddingFunction(use_fp16=False, device="cuda") | ||
dense_dim = ef.dim["dense"] | ||
|
||
docs_embeddings = ef(docs) | ||
|
||
# 2. setup Milvus collection and index | ||
connections.connect("default", uri="milvus.db") | ||
|
||
# Specify the data schema for the new Collection. | ||
fields = [ | ||
# Use auto generated id as primary key | ||
FieldSchema( | ||
name="pk", dtype=DataType.VARCHAR, is_primary=True, auto_id=True, max_length=100 | ||
), | ||
# Store the original text to retrieve based on semantically distance | ||
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=512), | ||
# Milvus now supports both sparse and dense vectors, | ||
# we can store each in a separate field to conduct hybrid search on both vectors | ||
FieldSchema(name="sparse_vector", dtype=DataType.SPARSE_FLOAT_VECTOR), | ||
FieldSchema(name="dense_vector", dtype=DataType.FLOAT_VECTOR, dim=dense_dim), | ||
] | ||
schema = CollectionSchema(fields, "") | ||
col_name = "hybrid_demo" | ||
# Now we can create the new collection with above name and schema. | ||
col = Collection(col_name, schema, consistency_level="Strong") | ||
|
||
# We need to create indices for the vector fields. The indices will be loaded | ||
# into memory for efficient search. | ||
sparse_index = {"index_type": "SPARSE_INVERTED_INDEX", "metric_type": "IP"} | ||
col.create_index("sparse_vector", sparse_index) | ||
dense_index = {"index_type": "FLAT", "metric_type": "IP"} | ||
col.create_index("dense_vector", dense_index) | ||
col.load() | ||
|
||
# 3. insert text and sparse/dense vector representations into the collection | ||
entities = [docs, docs_embeddings["sparse"], docs_embeddings["dense"]] | ||
for i in range(0, len(docs), 50): | ||
batched_entities = [ | ||
docs[i : i + 50], | ||
docs_embeddings["sparse"][i : i + 50], | ||
docs_embeddings["dense"][i : i + 50], | ||
] | ||
col.insert(batched_entities) | ||
col.flush() |
Binary file added
BIN
+38 KB
...tutorials/quickstart/apps/hybrid_demo_with_milvus/pics/Milvus_Logo_Official.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+138 KB
bootcamp/tutorials/quickstart/apps/hybrid_demo_with_milvus/pics/demo.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 5 additions & 0 deletions
5
bootcamp/tutorials/quickstart/apps/hybrid_demo_with_milvus/requirements.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
pandas | ||
numpy | ||
pymilvus | ||
pymilvus[model] | ||
streamlit |
127 changes: 127 additions & 0 deletions
127
bootcamp/tutorials/quickstart/apps/hybrid_demo_with_milvus/ui.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,127 @@ | ||
import streamlit as st | ||
from streamlit import cache_resource | ||
from pymilvus.model.hybrid import BGEM3EmbeddingFunction | ||
from pymilvus import ( | ||
Collection, | ||
AnnSearchRequest, | ||
WeightedRanker, | ||
connections, | ||
) | ||
|
||
# Logo | ||
st.image("./pics/Milvus_Logo_Official.png", width=200) | ||
|
||
|
||
@cache_resource | ||
def get_model(): | ||
ef = BGEM3EmbeddingFunction(use_fp16=False, device="cpu") | ||
return ef | ||
|
||
|
||
@cache_resource | ||
def get_collection(): | ||
col_name = "hybrid_demo" | ||
connections.connect("default", uri="milvus.db") | ||
col = Collection(col_name) | ||
return col | ||
|
||
|
||
def search_from_source(source, query): | ||
return [f"{source} Result {i+1} for {query}" for i in range(5)] | ||
|
||
|
||
st.title("Milvus Hybird Search Demo") | ||
|
||
query = st.text_input("Enter your search query:") | ||
search_button = st.button("Search") | ||
|
||
|
||
@cache_resource | ||
def get_tokenizer(): | ||
ef = get_model() | ||
tokenizer = ef.model.tokenizer | ||
return tokenizer | ||
|
||
|
||
def doc_text_colorization(query, docs): | ||
tokenizer = get_tokenizer() | ||
query_tokens_ids = tokenizer.encode(query, return_offsets_mapping=True) | ||
query_tokens = tokenizer.convert_ids_to_tokens(query_tokens_ids) | ||
colored_texts = [] | ||
|
||
for doc in docs: | ||
ldx = 0 | ||
landmarks = [] | ||
encoding = tokenizer.encode_plus(doc, return_offsets_mapping=True) | ||
tokens = tokenizer.convert_ids_to_tokens(encoding["input_ids"])[1:-1] | ||
offsets = encoding["offset_mapping"][1:-1] | ||
for token, (start, end) in zip(tokens, offsets): | ||
if token in query_tokens: | ||
if len(landmarks) != 0 and start == landmarks[-1]: | ||
landmarks[-1] = end | ||
else: | ||
landmarks.append(start) | ||
landmarks.append(end) | ||
close = False | ||
color_text = "" | ||
for i, c in enumerate(doc): | ||
if ldx == len(landmarks): | ||
pass | ||
elif i == landmarks[ldx]: | ||
if close is True: | ||
color_text += "]" | ||
else: | ||
color_text += ":red[" | ||
close = not close | ||
ldx = ldx + 1 | ||
color_text += c | ||
if close is True: | ||
color_text += "]" | ||
colored_texts.append(color_text) | ||
return colored_texts | ||
|
||
|
||
def hybrid_search(query_embeddings, sparse_weight=1.0, dense_weight=1.0): | ||
col = get_collection() | ||
sparse_search_params = {"metric_type": "IP"} | ||
sparse_req = AnnSearchRequest( | ||
query_embeddings["sparse"], "sparse_vector", sparse_search_params, limit=10 | ||
) | ||
dense_search_params = {"metric_type": "IP"} | ||
dense_req = AnnSearchRequest( | ||
query_embeddings["dense"], "dense_vector", dense_search_params, limit=10 | ||
) | ||
rerank = WeightedRanker(sparse_weight, dense_weight) | ||
res = col.hybrid_search( | ||
[sparse_req, dense_req], rerank=rerank, limit=10, output_fields=["text"] | ||
) | ||
if len(res): | ||
return [hit.fields["text"] for hit in res[0]] | ||
else: | ||
return [] | ||
|
||
|
||
# Display search results when the button is clicked | ||
if search_button and query: | ||
ef = get_model() | ||
query_embeddings = ef([query]) | ||
col1, col2, col3 = st.columns(3) | ||
with col1: | ||
st.header("Dense") | ||
results = hybrid_search(query_embeddings, sparse_weight=0.0, dense_weight=1.0) | ||
for result in results: | ||
st.markdown(result) | ||
|
||
with col2: | ||
st.header("Sparse") | ||
results = hybrid_search(query_embeddings, sparse_weight=1.0, dense_weight=0.0) | ||
colored_results = doc_text_colorization(query, results) | ||
for result in colored_results: | ||
st.markdown(result) | ||
|
||
with col3: | ||
st.header("Hybrid") | ||
results = hybrid_search(query_embeddings, sparse_weight=0.7, dense_weight=1.0) | ||
colored_results = doc_text_colorization(query, results) | ||
for result in colored_results: | ||
st.markdown(result) |
3 changes: 3 additions & 0 deletions
3
bootcamp/tutorials/quickstart/apps/image_search_with_milvus/.streamlit/config.toml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
[theme] | ||
base = "dark" | ||
primaryColor = "#4fc4f9" |
3 changes: 3 additions & 0 deletions
3
bootcamp/tutorials/quickstart/apps/rag_search_with_milvus/.streamlit/config.toml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
[theme] | ||
base = "dark" | ||
primaryColor = "#4fc4f9" |