diff --git a/bootcamp/tutorials/quickstart/hdbscan_clustering_with_milvus.ipynb b/bootcamp/tutorials/quickstart/hdbscan_clustering_with_milvus.ipynb new file mode 100644 index 000000000..7f4869b2e --- /dev/null +++ b/bootcamp/tutorials/quickstart/hdbscan_clustering_with_milvus.ipynb @@ -0,0 +1,340 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\"Open \n", + " \"GitHub" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# HDBSCAN Clustering with Milvus\n", + "Data can be transformed into embeddings using deep learning models, which capture meaningful representations of the original data. By applying an unsupervised clustering algorithm, we can group similar data points together based on their inherent patterns. HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a widely used clustering algorithm that efficiently groups data points by analyzing their density and distance. It is particularly useful for discovering clusters of varying shapes and sizes. In this notebook, we will use HDBSCAN with Milvus, a high-performance vector database, to cluster data points into distinct groups based on their embeddings.\n", + "\n", + "HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that relies on calculating distances between data points in embedding space. These embeddings, created by deep learning models, represent data in a high-dimensional form. To group similar data points, HDBSCAN determines their proximity and density, but efficiently computing these distances, especially for large datasets, can be challenging.\n", + "\n", + "Milvus, a high-performance vector database, optimizes this process by storing and indexing embeddings, allowing for fast retrieval of similar vectors. When used together, HDBSCAN and Milvus enable efficient clustering of large-scale datasets in embedding space.\n", + "\n", + "In this notebook, we will use the BGE-M3 embedding model to extract embeddings from a news headline dataset, utilize Milvus to efficiently calculate distances between embeddings to aid HDBSCAN in clustering, and then visualize the results for analysis using the UMAP method. This notebook is a Milvus adapation of [Dylan Castillo's article](https://dylancastillo.co/posts/clustering-documents-with-openai-langchain-hdbscan.html)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Preparation\n", + "download news dataset from https://www.kaggle.com/datasets/dylanjcastillo/news-headlines-2024/" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install \"pymilvus[model]\"\n", + "!pip install hdbscan\n", + "!pip install plotly\n", + "!pip install umap-learn" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Download Data\n", + "Download news dataset from https://www.kaggle.com/datasets/dylanjcastillo/news-headlines-2024/, extract `news_data_dedup.csv` and put it into current directory.\n", + "\n", + "## Extract Embeddings to Milvus\n", + "We will create a collection using Milvus, and extract dense embeddings using BGE-M3 model." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2024-10-16 11:03:00.418817: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n", + "2024-10-16 11:03:00.432515: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n", + "2024-10-16 11:03:00.446648: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n", + "2024-10-16 11:03:00.450824: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n", + "2024-10-16 11:03:00.462393: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n", + "To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n", + "2024-10-16 11:03:01.128046: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "7cf59571682e432aa4c9e8b1b102012b", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Fetching 30 files: 0%| | 0/30 [00:00 - If you only need a local vector database for small scale data or prototyping, setting the uri as a local file, e.g.`./milvus.db`, is the most convenient method, as it automatically utilizes [Milvus Lite](https://milvus.io/docs/milvus_lite.md) to store all data in this file.\n", + "> - If you have large scale of data, say more than a million vectors, you can set up a more performant Milvus server on [Docker or Kubernetes](https://milvus.io/docs/quickstart.md). In this setup, please use the server address and port as your uri, e.g.`http://localhost:19530`. If you enable the authentication feature on Milvus, use \":\" as the token, otherwise don't set the token.\n", + "> - If you use [Zilliz Cloud](https://zilliz.com/cloud), the fully managed cloud service for Milvus, adjust the `uri` and `token`, which correspond to the [Public Endpoint and API key](https://docs.zilliz.com/docs/on-zilliz-cloud-console#cluster-details) in Zilliz Cloud." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fields = [\n", + " FieldSchema(\n", + " name=\"id\", dtype=DataType.INT64, is_primary=True, auto_id=True\n", + " ), # Primary ID field\n", + " FieldSchema(\n", + " name=\"embedding\", dtype=DataType.FLOAT_VECTOR, dim=1024\n", + " ), # Float vector field (embedding)\n", + " FieldSchema(\n", + " name=\"text\", dtype=DataType.VARCHAR, max_length=65535\n", + " ), # Float vector field (embedding)\n", + "]\n", + "\n", + "schema = CollectionSchema(fields=fields, description=\"Embedding collection\")\n", + "\n", + "collection = Collection(name=\"news_data\", schema=schema)\n", + "\n", + "for doc, embedding in zip(docs, embeddings):\n", + " collection.insert({\"text\": doc, \"embedding\": embedding})\n", + " print(doc)\n", + "\n", + "index_params = {\"index_type\": \"FLAT\", \"metric_type\": \"L2\", \"params\": {}}\n", + "\n", + "collection.create_index(field_name=\"embedding\", index_params=index_params)\n", + "\n", + "collection.flush()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Construct the Distance Matrix for HDBSCAN\n", + "HDBSCAN requires calculating distances between points for clustering, which can be computationally intensive. Since distant points have less influence on clustering assignments, we can improve efficiency by calculating the top-k nearest neighbors. In this example, we use the FLAT index, but for large-scale datasets, Milvus supports more advanced indexing methods to accelerate the search process.\n", + "Firstly, we need to get a iterator to iterate the Milvus collection we previously created." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "import hdbscan\n", + "import numpy as np\n", + "import pandas as pd\n", + "import plotly.express as px\n", + "from umap import UMAP\n", + "from pymilvus import Collection\n", + "\n", + "collection = Collection(name=\"news_data\")\n", + "collection.load()\n", + "\n", + "iterator = collection.query_iterator(\n", + " batch_size=10, expr=\"id > 0\", output_fields=[\"id\", \"embedding\"]\n", + ")\n", + "\n", + "search_params = {\n", + " \"metric_type\": \"L2\",\n", + " \"params\": {\"nprobe\": 10},\n", + "} # L2 is Euclidean distance\n", + "\n", + "ids = []\n", + "dist = {}\n", + "\n", + "embeddings = []" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will iterate all embeddings in the Milvus collection. For each embedding, we will search its top-k neighbors in the same collection, get their ids and distances. Then we also need to build a dictionary to map original ID to a continuous index in the distance matrix. When finished, we need to create a distance matrix which initialized with all elements as infinity and fill the elements we searched. In this way, the distance between far away points will be ignored. Finally we use HDBSCAN library to cluster the points using the distance matrix we created. We need to set metric to 'precomputed' to indicate the data is distance matrix rather than origianl embeddings." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "while True:\n", + " batch = iterator.next()\n", + " batch_ids = [data[\"id\"] for data in batch]\n", + " ids.extend(batch_ids)\n", + "\n", + " query_vectors = [data[\"embedding\"] for data in batch]\n", + " embeddings.extend(query_vectors)\n", + "\n", + " results = collection.search(\n", + " data=query_vectors,\n", + " limit=50,\n", + " anns_field=\"embedding\",\n", + " param=search_params,\n", + " output_fields=[\"id\"],\n", + " )\n", + " for i, batch_id in enumerate(batch_ids):\n", + " dist[batch_id] = []\n", + " for result in results[i]:\n", + " dist[batch_id].append((result.id, result.distance))\n", + "\n", + " if len(batch) == 0:\n", + " break\n", + "\n", + "ids2index = {}\n", + "\n", + "for id in dist:\n", + " ids2index[id] = len(ids2index)\n", + "\n", + "dist_metric = np.full((len(ids), len(ids)), np.inf, dtype=np.float64)\n", + "\n", + "for id in dist:\n", + " for result in dist[id]:\n", + " dist_metric[ids2index[id]][ids2index[result[0]]] = result[1]\n", + "\n", + "h = hdbscan.HDBSCAN(min_samples=3, min_cluster_size=3, metric=\"precomputed\")\n", + "hdb = h.fit(dist_metric)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After this, the HDBSCAN clustering is finished. We can get some data and show its cluster. Note some data will not be assigned to any cluster, which means they are noise, because they are located at some sparse region." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Clusters Visualization using UMAP \n", + "We have already clustered the data using HDBSCAN and can get the labels for each data point. However using some visualization techniques, we can get the whole picture of the clusters for a intuitional analysis. Now we are going the use UMAP to visualize the clusters. UMAP is a efficient methodused for dimensionality reduction, preserving the structure of high-dimensional data while projecting it into a lower-dimensional space for visualization or further analysis. With it, we can visualize original high-dimensional data in 2D or 3D space, and see the clusters clearly.\n", + "Here again, we iterate the data points and get the id and text for original data, then we use ploty to plot the data points with these metainfo in a figure, and use different colors to represent different clusters." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import plotly.io as pio\n", + "\n", + "pio.renderers.default = \"notebook\"\n", + "\n", + "umap = UMAP(n_components=2, random_state=42, n_neighbors=80, min_dist=0.1)\n", + "\n", + "df_umap = (\n", + " pd.DataFrame(umap.fit_transform(np.array(embeddings)), columns=[\"x\", \"y\"])\n", + " .assign(cluster=lambda df: hdb.labels_.astype(str))\n", + " .query('cluster != \"-1\"')\n", + " .sort_values(by=\"cluster\")\n", + ")\n", + "iterator = collection.query_iterator(\n", + " batch_size=10, expr=\"id > 0\", output_fields=[\"id\", \"text\"]\n", + ")\n", + "\n", + "ids = []\n", + "texts = []\n", + "\n", + "while True:\n", + " batch = iterator.next()\n", + " if len(batch) == 0:\n", + " break\n", + " batch_ids = [data[\"id\"] for data in batch]\n", + " batch_texts = [data[\"text\"] for data in batch]\n", + " ids.extend(batch_ids)\n", + " texts.extend(batch_texts)\n", + "\n", + "show_texts = [texts[i] for i in df_umap.index]\n", + "\n", + "df_umap[\"hover_text\"] = show_texts\n", + "fig = px.scatter(\n", + " df_umap, x=\"x\", y=\"y\", color=\"cluster\", hover_data={\"hover_text\": True}\n", + ")\n", + "fig.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![image](../../../images/hdbscan_clustering_with_milvus.png)\n", + "\n", + "Here, we demonstrate that the data is well clustered, and you can hover over the points to check the text they represent. With this notebook, we hope you learn how to use HDBSCAN to cluster embeddings with Milvus efficiently, which can also be applied to other types of data. Combined with large language models, this approach allows for deeper analysis of your data at a large scale." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "base", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/bootcamp/tutorials/quickstart/use_ColPALI_with_milvus.ipynb b/bootcamp/tutorials/quickstart/use_ColPALI_with_milvus.ipynb new file mode 100644 index 000000000..930757980 --- /dev/null +++ b/bootcamp/tutorials/quickstart/use_ColPALI_with_milvus.ipynb @@ -0,0 +1,476 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\"Open \n", + " \"GitHub" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Use ColPALI for Multi-Modal Retrieval with Milvus\n", + "\n", + "Modern retrieval models typically use a single embedding to represent text or images. ColBERT, however, is a neural model that utilizes a list of embeddings for each data instance and employs a \"MaxSim\" operation to calculate the similarity between two texts. Beyond textual data, figures, tables, and diagrams also contain rich information, which is often disregarded in text-based information retrieval.\n", + "\n", + "$$\n", + "S_{q,d} := \\sum_{i \\in |E_q|} \\max_{j \\in |E_d|} E_{q_i} \\cdot E_{d_j}^T\n", + "$$\n", + "MaxSim function compares a query with a document (what you're searching in) by looking at their token embeddings. For each word in the query, it picks the most similar word from the document (using cosine similarity or squared L2 distance) and sums these maximum similarities across all words in the query\n", + "\n", + "ColPALI is a method that combines ColBERT's multi-vector representation with PaliGemma (a multimodal large language model) to leverage its strong understanding capabilities. This approach enables a page with both text and images to be represented using a unified multi-vector embedding. The embeddings within this multi-vector representation can capture detailed information, improving the performance of retrieval-augmented generation (RAG) for multimodal data.\n", + "\n", + " In this notebook, we refer to this kind of multi-vector representation as \"ColBERT embeddings\" for generality. However, the actual model being used is the **ColPALI model**. We will demonstrate how to use Milvus for multi-vector retrieval. Building on that, we will introduce how to use ColPALI for retrieving pages based on a given query.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Preparation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install pdf2image\n", + "!pip pymilvus\n", + "!pip install colpali_engine\n", + "!pip install tqdm\n", + "!pip instal pillow" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prepare the data\n", + "We will use PDF RAG as our example. You can download [ColBERT](https://arxiv.org/pdf/2004.12832) paper and put it into `./pdf`. ColPALI does not process text directly; instead, the entire page is rasterized into an image. The ColPALI model excels at understanding the textual information contained within these images. Therefore, we will convert each PDF page into an image for processing." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "from pdf2image import convert_from_path\n", + "\n", + "pdf_path = \"pdfs/2004.12832v2.pdf\"\n", + "images = convert_from_path(pdf_path)\n", + "\n", + "for i, image in enumerate(images):\n", + " image.save(f\"pages/page_{i + 1}.png\", \"PNG\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we will initialize a database using Milvus Lite. You can easily switch to a full Milvus instance by setting the uri to the appropriate address where your Milvus service is hosted." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "from pymilvus import MilvusClient, DataType\n", + "import numpy as np\n", + "import concurrent.futures\n", + "\n", + "client = MilvusClient(uri=\"milvus.db\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "> - If you only need a local vector database for small scale data or prototyping, setting the uri as a local file, e.g.`./milvus.db`, is the most convenient method, as it automatically utilizes [Milvus Lite](https://milvus.io/docs/milvus_lite.md) to store all data in this file.\n", + "> - If you have large scale of data, say more than a million vectors, you can set up a more performant Milvus server on [Docker or Kubernetes](https://milvus.io/docs/quickstart.md). In this setup, please use the server address and port as your uri, e.g.`http://localhost:19530`. If you enable the authentication feature on Milvus, use \":\" as the token, otherwise don't set the token.\n", + "> - If you use [Zilliz Cloud](https://zilliz.com/cloud), the fully managed cloud service for Milvus, adjust the `uri` and `token`, which correspond to the [Public Endpoint and API key](https://docs.zilliz.com/docs/on-zilliz-cloud-console#cluster-details) in Zilliz Cloud." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will define a MilvusColbertRetriever class to wrap around the Milvus client for multi-vector data retrieval. The implementation flattens ColBERT embeddings and inserts them into a collection, where each row represents an individual embedding from the ColBERT embedding list. It also records the doc_id and seq_id to trace the origin of each embedding.\n", + "\n", + "When searching with a ColBERT embedding list, multiple searches will be conducted—one for each ColBERT embedding. The retrieved doc_ids will then be deduplicated. A reranking process will be performed, where the full embeddings for each doc_id are fetched, and the MaxSim score is calculated to produce the final ranked results.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "class MilvusColbertRetriever:\n", + " def __init__(self, milvus_client, collection_name, dim=128):\n", + " # Initialize the retriever with a Milvus client, collection name, and dimensionality of the vector embeddings.\n", + " # If the collection exists, load it.\n", + " self.collection_name = collection_name\n", + " self.client = milvus_client\n", + " if self.client.has_collection(collection_name=self.collection_name):\n", + " self.client.load_collection(collection_name)\n", + " self.dim = dim\n", + "\n", + " def create_collection(self):\n", + " # Create a new collection in Milvus for storing embeddings.\n", + " # Drop the existing collection if it already exists and define the schema for the collection.\n", + " if self.client.has_collection(collection_name=self.collection_name):\n", + " self.client.drop_collection(collection_name=self.collection_name)\n", + " schema = self.client.create_schema(\n", + " auto_id=True,\n", + " enable_dynamic_fields=True,\n", + " )\n", + " schema.add_field(field_name=\"pk\", datatype=DataType.INT64, is_primary=True)\n", + " schema.add_field(\n", + " field_name=\"vector\", datatype=DataType.FLOAT_VECTOR, dim=self.dim\n", + " )\n", + " schema.add_field(field_name=\"seq_id\", datatype=DataType.INT16)\n", + " schema.add_field(field_name=\"doc_id\", datatype=DataType.INT64)\n", + " schema.add_field(field_name=\"doc\", datatype=DataType.VARCHAR, max_length=65535)\n", + "\n", + " self.client.create_collection(\n", + " collection_name=self.collection_name, schema=schema\n", + " )\n", + "\n", + " def create_index(self):\n", + " # Create an index on the vector field to enable fast similarity search.\n", + " # Releases and drops any existing index before creating a new one with specified parameters.\n", + " self.client.release_collection(collection_name=self.collection_name)\n", + " self.client.drop_index(\n", + " collection_name=self.collection_name, index_name=\"vector\"\n", + " )\n", + " index_params = self.client.prepare_index_params()\n", + " index_params.add_index(\n", + " field_name=\"vector\",\n", + " index_name=\"vector_index\",\n", + " index_type=\"HNSW\", # or any other index type you want\n", + " metric_type=\"IP\", # or the appropriate metric type\n", + " params={\n", + " \"M\": 16,\n", + " \"efConstruction\": 500,\n", + " }, # adjust these parameters as needed\n", + " )\n", + "\n", + " self.client.create_index(\n", + " collection_name=self.collection_name, index_params=index_params, sync=True\n", + " )\n", + "\n", + " def create_scalar_index(self):\n", + " # Create a scalar index for the \"doc_id\" field to enable fast lookups by document ID.\n", + " self.client.release_collection(collection_name=self.collection_name)\n", + "\n", + " index_params = self.client.prepare_index_params()\n", + " index_params.add_index(\n", + " field_name=\"doc_id\",\n", + " index_name=\"int32_index\",\n", + " index_type=\"INVERTED\", # or any other index type you want\n", + " )\n", + "\n", + " self.client.create_index(\n", + " collection_name=self.collection_name, index_params=index_params, sync=True\n", + " )\n", + "\n", + " def search(self, data, topk):\n", + " # Perform a vector search on the collection to find the top-k most similar documents.\n", + " search_params = {\"metric_type\": \"IP\", \"params\": {}}\n", + " results = self.client.search(\n", + " self.collection_name,\n", + " data,\n", + " limit=int(50),\n", + " output_fields=[\"vector\", \"seq_id\", \"doc_id\"],\n", + " search_params=search_params,\n", + " )\n", + " doc_ids = set()\n", + " for r_id in range(len(results)):\n", + " for r in range(len(results[r_id])):\n", + " doc_ids.add(results[r_id][r][\"entity\"][\"doc_id\"])\n", + "\n", + " scores = []\n", + "\n", + " def rerank_single_doc(doc_id, data, client, collection_name):\n", + " # Rerank a single document by retrieving its embeddings and calculating the similarity with the query.\n", + " doc_colbert_vecs = client.query(\n", + " collection_name=collection_name,\n", + " filter=f\"doc_id in [{doc_id}, {doc_id + 1}]\",\n", + " output_fields=[\"seq_id\", \"vector\", \"doc\"],\n", + " limit=1000,\n", + " )\n", + " doc_vecs = np.vstack(\n", + " [doc_colbert_vecs[i][\"vector\"] for i in range(len(doc_colbert_vecs))]\n", + " )\n", + " score = np.dot(data, doc_vecs.T).max(1).sum()\n", + " return (score, doc_id)\n", + "\n", + " with concurrent.futures.ThreadPoolExecutor(max_workers=300) as executor:\n", + " futures = {\n", + " executor.submit(\n", + " rerank_single_doc, doc_id, data, client, self.collection_name\n", + " ): doc_id\n", + " for doc_id in doc_ids\n", + " }\n", + " for future in concurrent.futures.as_completed(futures):\n", + " score, doc_id = future.result()\n", + " scores.append((score, doc_id))\n", + "\n", + " scores.sort(key=lambda x: x[0], reverse=True)\n", + " if len(scores) >= topk:\n", + " return scores[:topk]\n", + " else:\n", + " return scores\n", + "\n", + " def insert(self, data):\n", + " # Insert ColBERT embeddings and metadata for a document into the collection.\n", + " colbert_vecs = [vec for vec in data[\"colbert_vecs\"]]\n", + " seq_length = len(colbert_vecs)\n", + " doc_ids = [data[\"doc_id\"] for i in range(seq_length)]\n", + " seq_ids = list(range(seq_length))\n", + " docs = [\"\"] * seq_length\n", + " docs[0] = data[\"filepath\"]\n", + "\n", + " # Insert the data as multiple vectors (one for each sequence) along with the corresponding metadata.\n", + " self.client.insert(\n", + " self.collection_name,\n", + " [\n", + " {\n", + " \"vector\": colbert_vecs[i],\n", + " \"seq_id\": seq_ids[i],\n", + " \"doc_id\": doc_ids[i],\n", + " \"doc\": docs[i],\n", + " }\n", + " for i in range(seq_length)\n", + " ],\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will use the [colpali_engine](https://github.com/illuin-tech/colpali) to extract embedding lists for two queries and retrieve the relevant information from the PDF pages.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from colpali_engine.models import ColPali\n", + "from colpali_engine.models.paligemma.colpali.processing_colpali import ColPaliProcessor\n", + "from colpali_engine.utils.processing_utils import BaseVisualRetrieverProcessor\n", + "from colpali_engine.utils.torch_utils import ListDataset, get_torch_device\n", + "from torch.utils.data import DataLoader\n", + "import torch\n", + "from typing import List, cast\n", + "\n", + "device = get_torch_device(\"cpu\")\n", + "model_name = \"vidore/colpali-v1.2\"\n", + "\n", + "model = ColPali.from_pretrained(\n", + " model_name,\n", + " torch_dtype=torch.bfloat16,\n", + " device_map=device,\n", + ").eval()\n", + "\n", + "queries = [\n", + " \"How to end-to-end retrieval with ColBert?\",\n", + " \"Where is ColBERT performance table?\",\n", + "]\n", + "\n", + "processor = cast(ColPaliProcessor, ColPaliProcessor.from_pretrained(model_name))\n", + "\n", + "dataloader = DataLoader(\n", + " dataset=ListDataset[str](queries),\n", + " batch_size=1,\n", + " shuffle=False,\n", + " collate_fn=lambda x: processor.process_queries(x),\n", + ")\n", + "\n", + "qs: List[torch.Tensor] = []\n", + "for batch_query in dataloader:\n", + " with torch.no_grad():\n", + " batch_query = {k: v.to(model.device) for k, v in batch_query.items()}\n", + " embeddings_query = model(**batch_query)\n", + " qs.extend(list(torch.unbind(embeddings_query.to(\"cpu\"))))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Additionally, we will need to extract the embedding list for each page and it shows there are 1030 128-dimensional embeddings for each page." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 0%| | 0/10 [00:00