Skip to content

Commit

Permalink
Update description
Browse files Browse the repository at this point in the history
  • Loading branch information
codingjaguar committed Feb 27, 2024
1 parent 5890720 commit 5e68c03
Showing 1 changed file with 32 additions and 17 deletions.
49 changes: 32 additions & 17 deletions bootcamp/Integration/bge_m3_embedding.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,16 @@
"source": [
"# Using BGE M3-Embedding Model with Milvus\n",
"\n",
"Milvus, as the world's foremost open source vector database, plays a vital role in enhancing semantic search with the use of powerful embedding models. Its scalability and advanced functionalities, such as metadata filtering, further contribute to its significance in this field.\n",
"As Deep Neural Networks continue to advance rapidly, it's increasingly common to employ them for information representation and retrieval. Referred to as embedding models, they can encode information into dense or sparse vector representations within a multi-dimensional space.\n",
"\n",
"\n",
"On January 30, 2024, a new member called BGE-M3 was released as part of the BGE model series. The M3 represents its capabilities in supporting over 100 languages, accommodating input lengths of up to 8192, and incorporating multiple functions such as dense, lexical, and multi-vec/colbert retrieval into a unified system. BGE-M3 holds the distinction of being the first embedding model to offer support for all three retrieval methods, resulting in achieving state-of-the-art performance on multi-lingual (MIRACL) and cross-lingual (MKQA) benchmark tests.\n",
"\n",
"![](../../images/bge_m3.png)\n",
"Milvus, world's first open-source vector database, plays a vital role in semantic search with efficient storage and retrieval for vector embeddings. Its scalability and advanced functionalities, such as metadata filtering, further contribute to its significance in this field. \n",
"\n",
"This tutorial shows how to use **BGE M3 embedding model with Milvus** for semantic similarity search.\n",
"\n",
"This tutorial shows how to use BGE M3 embedding model with Milvus for semantic similarity search."
"![](../../images/bge_m3.png)\n"
]
},
{
Expand All @@ -33,7 +36,16 @@
"\n",
"We then search a query by converting the query text into a vector embedding, and perform vector Approximate Nearest Neighbor search to find the text strings with cloest semantic.\n",
"\n",
"To run this demo, be sure you have already [started up a Milvus instance](https://milvus.io/docs/install_standalone-docker.md) and installed python client library with `pip install pymilvus FlagEmbedding`."
"To run this demo, be sure you have already [started up a Milvus instance](https://milvus.io/docs/install_standalone-docker.md) and installed python packages `pymilvus` (Milvus client library) and `FlagEmbedding` (library for BGE models)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! pip install pymilvus FlagEmbedding"
]
},
{
Expand Down Expand Up @@ -159,7 +171,7 @@
"source": [
"## Load vectors to Milvus\n",
"\n",
"We set up a collection in Milvus and build index so that we can efficiently search vectors. For more information on how to use Milvus, look [here](https://milvus.io/docs/example_code.md).\n"
"We need creat a collection in Milvus and build index so that we can efficiently search vectors. For more information on how to use Milvus, check out the [documentation](https://milvus.io/docs/example_code.md).\n"
]
},
{
Expand Down Expand Up @@ -192,9 +204,9 @@
"\n",
"# Set scheme with 3 fields: id (int), text (string), and embedding (float array).\n",
"fields = [\n",
" FieldSchema(name=\"pk\", dtype=DataType.INT64, is_primary=True, auto_id=False),\n",
" FieldSchema(name=\"id\", dtype=DataType.INT64, is_primary=True, auto_id=False),\n",
" FieldSchema(name=\"text\", dtype=DataType.VARCHAR, max_length=65_535),\n",
" FieldSchema(name=\"embeddings\", dtype=DataType.FLOAT_VECTOR, dim=dimension)\n",
" FieldSchema(name=\"embedding\", dtype=DataType.FLOAT_VECTOR, dim=dimension)\n",
"]\n",
"schema = CollectionSchema(fields, \"Here is description of this collection.\")\n",
"# Create a collection with above schema.\n",
Expand All @@ -206,7 +218,7 @@
" \"metric_type\": \"L2\",\n",
" \"params\": {\"nlist\": 128},\n",
"}\n",
"doc_collection.create_index(\"embeddings\", index)"
"doc_collection.create_index(\"embedding\", index)"
]
},
{
Expand All @@ -217,7 +229,7 @@
}
},
"source": [
"Here we have prepared a data source, which is crawled from the [M3 paper](https://arxiv.org/pdf/2402.03216.pdf), and its name is `m3_paper.txt`. It stores each sentence as a line, and we convert each line in the document into a vector with `BAAI/bge-m3` and then insert these embeddings into Milvus collection."
"Here we have prepared a data set of text strings from the [M3 paper](https://arxiv.org/pdf/2402.03216.pdf), named `m3_paper.txt`. It stores each sentence as a line, and we convert each line in the document into a dense vector embedding with `BAAI/bge-m3` and then insert these embeddings into Milvus collection."
]
},
{
Expand All @@ -237,11 +249,12 @@
"entities = [\n",
" list(range(len(lines))), # field id (primary key) \n",
" lines, # field text\n",
" embeddings, #field embeddings\n",
" embeddings, #field embedding\n",
"]\n",
"insert_result = doc_collection.insert(entities)\n",
"\n",
"# After final entity is inserted, it is best to call flush to have no growing segments left in memory\n",
"# In Milvus, it's a best practice to call flush() after all vectors are inserted,\n",
"# so that a more efficient index is built for the just inserted vectors.\n",
"doc_collection.flush()"
]
},
Expand Down Expand Up @@ -278,7 +291,7 @@
" \"metric_type\": \"L2\",\n",
" \"params\": {\"nprobe\": 10},\n",
" }\n",
" result = doc_collection.search(vectors_to_search, \"embeddings\", search_params, limit=top_k, output_fields=[\"text\"])\n",
" result = doc_collection.search(vectors_to_search, \"embedding\", search_params, limit=top_k, output_fields=[\"text\"])\n",
" return result[0]"
]
},
Expand Down Expand Up @@ -334,7 +347,7 @@
}
},
"source": [
"The smaller the distance, the closer the vector is, that is, semantically more similar. We can see that the top 1 results returned can answer this question.\n",
"The smaller the distance, the closer the vector is, that is, semantically more similar. We can see that the top 1 result returned *\"M3-Embedding...more than 100 world languages...\"* can directly answer the question.\n",
"\n",
"Let's try another question."
]
Expand Down Expand Up @@ -380,7 +393,7 @@
}
},
"source": [
"Our semantic retrieval is able to identify the meaning of our queries and return the most semantically similar documents from Milvus collection.\n",
"In this example, the top 2 results have enough information to answer the question. By selecting the top K results, semantic search with embedding model and vector retrieval is able to identify the meaning of queries and return the most semantically similar documents. Plugging this solution with Large Language Model (a pattern referred to as Retrieval Augmented Generation), a more human-readable answer can be crafted.\n",
"\n",
"We can delete this collection to save resources."
]
Expand All @@ -403,7 +416,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This is how to use BGE M3 embedding model and Milvus to perform semantic search. Milvus has also integrated with other model providers such as Cohere and HuggingFace, you can learn more at https://milvus.io/docs."
"In this notebook, we showed how to generate dense vectors with BGE M3 embedding model and use Milvus to perform semantic search. In the upcoming releases, Milvus will support hybrid search with dense and sparse vectors, which BGE M3 model can produce at the same time.\n",
"\n",
"Milvus has integrated with all major model providers, including OpenAI, HuggingFace and many more. You can learn about Milvus at https://milvus.io/docs."
]
}
],
Expand All @@ -423,9 +438,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
"version": "3.9.6"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
}

0 comments on commit 5e68c03

Please sign in to comment.