diff --git a/notebooks/llms/langchain/readthedocs_rag_zilliz.ipynb b/notebooks/llms/langchain/readthedocs_rag_zilliz.ipynb
new file mode 100755
index 000000000..e06207565
--- /dev/null
+++ b/notebooks/llms/langchain/readthedocs_rag_zilliz.ipynb
@@ -0,0 +1,869 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "369c3444",
+   "metadata": {},
+   "source": [
+    "# ReadtheDocs Retrieval Augmented Generation (RAG) using Milvus Client"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f6ffd11a",
+   "metadata": {},
+   "source": [
+    "In this notebook, we are going to use Milvus documentation pages to create a chatbot about our product.\n",
+    "\n",
+    "A chatbot is going to follow RAG steps to retrieve chunks of data using Semantic Vector Search, then the Question + Context will be fed as a Prompt to a LLM to generate an answer.\n",
+    "\n",
+    "<div>\n",
+    "<img src=\"../../../images/rag_image.png\" width=\"80%\"/>\n",
+    "</div>\n",
+    "\n",
+    "Many RAG demos use OpenAI for the Embedding Model and ChatGPT for the Generative AI model.  In this notebook, we will demo a fully open source RAG stack - open source embedding model available on HuggingFace, Milvus, and an open source LLM.\n",
+    "\n",
+    "Let's get started!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "d7570b2e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# For colab install these libraries in this order:\n",
+    "# !pip install milvus, pymilvus, langchain, torch, transformers, python-dotenv, accelerate\n",
+    "\n",
+    "# Import common libraries.\n",
+    "import time\n",
+    "import pandas as pd\n",
+    "import numpy as np"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e059b674",
+   "metadata": {},
+   "source": [
+    "## Download Milvus documentation to a local directory."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "20dcdaf7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# # Uncomment to download readthedocs page locally.\n",
+    "\n",
+    "# DOCS_PAGE=\"https://pymilvus.readthedocs.io/en/latest/\"\n",
+    "# !echo $DOCS_PAGE\n",
+    "\n",
+    "# # Specify encoding to handle non-unicode characters in documentation.\n",
+    "# !wget -r -A.html -P rtdocs --header=\"Accept-Charset: UTF-8\" $DOCS_PAGE"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8a67e382",
+   "metadata": {},
+   "source": [
+    "## Start up a local Milvus server."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fb844837",
+   "metadata": {},
+   "source": [
+    "Code in this notebook uses fully-managed Milvus on [Ziliz Cloud free trial](https://cloud.zilliz.com/login).  Choose the default \"Starter\" option when you provision > Create collection > Give it a name > Create cluster and collection.\n",
+    "- pip install pymilvus\n",
+    "\n",
+    "💡 **For production purposes**, use a local Milvus docker, Milvus clusters, or fully-managed Milvus on Zilliz Cloud.\n",
+    "- [Local Milvus docker](https://milvus.io/docs/install_standalone-docker.md) requires local docker installed and running.\n",
+    "- [Milvus clusters](https://milvus.io/docs/install_cluster-milvusoperator.md) requires a K8s cluster up and running.\n",
+    "- [Milvus client](https://milvus.io/docs/using_milvusclient.md) with [Milvus lite](https://milvus.io/docs/milvus_lite.md), which runs a local server.  ⛔️ Milvus lite is only meant for demos and local testing.\n",
+    "\n",
+    "💡 Note: To keep your tokens private, best practice is to use an env variable.\n",
+    "In Jupyter, need .env file (in same dir as notebooks) containing lines like this:\n",
+    "- VARIABLE_NAME=value\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "0806d2db",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Type of server: zilliz_cloud\n"
+     ]
+    }
+   ],
+   "source": [
+    "from pymilvus import connections, utility\n",
+    "\n",
+    "import os\n",
+    "from dotenv import load_dotenv\n",
+    "load_dotenv()\n",
+    "TOKEN = os.getenv(\"ZILLIZ_API_KEY\")\n",
+    "\n",
+    "# Connect to Zilliz cloud.\n",
+    "CLUSTER_ENDPOINT=\"https://in03-e3348b7ab973336.api.gcp-us-west1.zillizcloud.com:443\"\n",
+    "connections.connect(\n",
+    "  alias='default',\n",
+    "  #  Public endpoint obtained from Zilliz Cloud\n",
+    "  uri=CLUSTER_ENDPOINT,\n",
+    "  # API key or a colon-separated cluster username and password\n",
+    "  token=TOKEN,\n",
+    ")\n",
+    "\n",
+    "# Check if the server is ready and get colleciton name.\n",
+    "print(f\"Type of server: {utility.get_server_version()}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b01d6622",
+   "metadata": {},
+   "source": [
+    "## Load the Embedding Model checkpoint and use it to create vector embeddings\n",
+    "**Embedding model:**  We will use the open-source [sentence transformers](https://www.sbert.net/docs/pretrained_models.html) available on HuggingFace to encode the documentation text.  We will download the model from HuggingFace and run it locally.  We'll save the model's generated embeedings to a pandas dataframe and then into the milvus database.\n",
+    "\n",
+    "Two model parameters of note below:\n",
+    "1. EMBEDDING_LENGTH refers to the dimensionality or length of the embedding vector. In this case, the embeddings generated for EACH token in the input text will have the SAME length = 768. This size of embedding is often associated with BERT-based models, where the embeddings are used for downstream tasks such as classification, question answering, or text generation. <br><br>\n",
+    "2. MAX_SEQ_LENGTH is the maximum length the encoder model can handle for input sequences. In this case, if sequences longer than 512 tokens are given to the model, everything longer will be (silently!) chopped off.  This is the reason why a chunking strategy is needed to segment input texts into chunks with lengths that will fit in the model's input."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "dd2be7fd",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "device: cpu\n",
+      "<class 'sentence_transformers.SentenceTransformer.SentenceTransformer'>\n",
+      "SentenceTransformer(\n",
+      "  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel \n",
+      "  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})\n",
+      ")\n",
+      "model_name: BAAI/bge-base-en-v1.5\n",
+      "EMBEDDING_LENGTH: 768\n",
+      "MAX_SEQ_LENGTH: 512\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Import torch.\n",
+    "import torch\n",
+    "from torch.nn import functional as F\n",
+    "from sentence_transformers import SentenceTransformer\n",
+    "\n",
+    "# Initialize torch settings\n",
+    "torch.backends.cudnn.deterministic = True\n",
+    "DEVICE = torch.device('cuda:3' if torch.cuda.is_available() else 'cpu')\n",
+    "print(f\"device: {DEVICE}\")\n",
+    "\n",
+    "# Load the model from huggingface model hub.\n",
+    "model_name = \"BAAI/bge-base-en-v1.5\"\n",
+    "encoder = SentenceTransformer(model_name, device=DEVICE)\n",
+    "print(type(encoder))\n",
+    "print(encoder)\n",
+    "\n",
+    "# Get the model parameters and save for later.\n",
+    "MAX_SEQ_LENGTH = encoder.get_max_seq_length() \n",
+    "HF_EOS_TOKEN_LENGTH = 1\n",
+    "EMBEDDING_LENGTH = encoder.get_sentence_embedding_dimension()\n",
+    "\n",
+    "# Inspect model parameters.\n",
+    "print(f\"model_name: {model_name}\")\n",
+    "print(f\"EMBEDDING_LENGTH: {EMBEDDING_LENGTH}\")\n",
+    "print(f\"MAX_SEQ_LENGTH: {MAX_SEQ_LENGTH}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create a Milvus collection\n",
+    "\n",
+    "You can think of a collection in Milvus like a \"table\" in SQL databases.  The **collection** will contain the \n",
+    "- **Schema** (or no-schema Milvus Client).  \n",
+    "💡 You'll need the vector `EMBEDDING_LENGTH` parameter from your embedding model.\n",
+    "- **Vector index** for efficient vector search\n",
+    "- **Vector distance metric** for measuring nearest neighbor vectors\n",
+    "- **Consistency level**\n",
+    "In Milvus, transactional consistency is possible; however, according to the [CAP theorem](https://en.wikipedia.org/wiki/CAP_theorem), some latency must be sacrificed. 💡 Searching movie reviews is not mission-critical, so [`eventually`](https://milvus.io/docs/consistency.md) consistent is fine here.\n",
+    "\n",
+    "Some supported [data types](https://milvus.io/docs/schema.md) for Milvus schemas are:\n",
+    "- INT64 - primary key\n",
+    "- VARCHAR - raw texts\n",
+    "- FLOAT_VECTOR - embedings = list of `numpy.ndarray` of `numpy.float32` numbers"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Embedding length: 768\n",
+      "Created collection: MIlvusDocs\n",
+      "Schema: {'auto_id': True, 'description': 'The schema for docs pages', 'fields': [{'name': 'pk', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': True}, {'name': 'vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}], 'enable_dynamic_field': True}\n"
+     ]
+    }
+   ],
+   "source": [
+    "from pymilvus import (\n",
+    "    FieldSchema, DataType, \n",
+    "    CollectionSchema, Collection)\n",
+    "\n",
+    "# 1. Name your collection.\n",
+    "COLLECTION_NAME = \"MIlvusDocs\"\n",
+    "\n",
+    "# 2. Use embedding length from the embedding model.\n",
+    "print(f\"Embedding length: {EMBEDDING_LENGTH}\")\n",
+    "\n",
+    "# 3. Define minimum required fields.\n",
+    "fields = [\n",
+    "    FieldSchema(name=\"pk\", dtype=DataType.INT64, is_primary=True, auto_id=True),\n",
+    "    FieldSchema(name=\"vector\", dtype=DataType.FLOAT_VECTOR, dim=EMBEDDING_LENGTH),\n",
+    "]\n",
+    "\n",
+    "# 4. Create schema with dynamic field enabled.\n",
+    "schema = CollectionSchema(\n",
+    "\t\tfields,\n",
+    "\t\tdescription=\"The schema for docs pages\",\n",
+    "\t\tenable_dynamic_field=True\n",
+    ")\n",
+    "mc = Collection(COLLECTION_NAME, schema, consistency_level=\"Eventually\")\n",
+    "\n",
+    "print(f\"Created collection: {COLLECTION_NAME}\")\n",
+    "print(f\"Schema: {mc.schema}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Add a Vector Index\n",
+    "\n",
+    "The vector index determines the vector **search algorithm** used to find the closest vectors in your data to the query a user submits.  Most vector indexes use different sets of parameters depending on whether the database is:\n",
+    "- **inserting vectors** (creation mode) - vs - \n",
+    "- **searching vectors** (search mode) \n",
+    "\n",
+    "Scroll down the [docs page](https://milvus.io/docs/index.md) to see a table listing different vector indexes available on Milvus.  For example:\n",
+    "- FLAT - deterministic exhaustive search\n",
+    "- IVF_FLAT or IVF_SQ8 - Hash index (stochastic approximate search)\n",
+    "- HNSW - Graph index (stochastic approximate search)\n",
+    "- AUTOINDEX - Automatically determined by Milvus based on local vs cloud, type of GPU, size of data.\n",
+    "\n",
+    "Besides a search algorithm, we also need to specify a **distance metric**, that is, a definition of what is considered \"close\" in vector space.  In the cell below, the [`HNSW`](https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md) search index is chosen.  Its possible distance metrics are one of:\n",
+    "- L2 - L2-norm\n",
+    "- IP - Dot-product\n",
+    "- COSINE - Angular distance\n",
+    "\n",
+    "💡 Most use cases work better with normalized embeddings, in which case L2 is useless (every vector has length=1) and IP and COSINE are the same.  Only choose L2 if you plan to keep your embeddings unnormalized."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Status(code=0, message=)"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Add a default search index to the collection.\n",
+    "\n",
+    "# Drop the index, in case it already exists.\n",
+    "mc.drop_index()\n",
+    "\n",
+    "index_params = {\n",
+    "    \"index_type\": \"AUTOINDEX\",\n",
+    "    \"metric_type\": \"COSINE\", \n",
+    "    # No params for AUTOINDEX\n",
+    "    # \"params\": {}\n",
+    "    }\n",
+    "\n",
+    "# Specify column name which contains the vector.\n",
+    "mc.create_index(\n",
+    "    field_name=\"vector\", \n",
+    "    index_params=index_params)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "6861beb7",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "loaded 15 documents\n"
+     ]
+    }
+   ],
+   "source": [
+    "## Read docs into LangChain\n",
+    "#!pip install langchain \n",
+    "from langchain.document_loaders import ReadTheDocsLoader\n",
+    "\n",
+    "loader = ReadTheDocsLoader(\"rtdocs/pymilvus.readthedocs.io/en/latest/\", features=\"html.parser\")\n",
+    "docs = loader.load()\n",
+    "\n",
+    "num_documents = len(docs)\n",
+    "print(f\"loaded {num_documents} documents\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c60423a5",
+   "metadata": {},
+   "source": [
+    "## Chunking\n",
+    "\n",
+    "Before embedding, it is necessary to decide your chunk strategy, chunk size, and chunk overlap.  In this demo, I will use:\n",
+    "- **Strategy** = Use markdown header hierarchies.  Split markdown sections if too long.\n",
+    "- **Chunk size** = Use the embedding model's parameter `MAX_SEQ_LENGTH`\n",
+    "- **Overlap** = Rule-of-thumb 10-15%\n",
+    "- **Function** = \n",
+    "  - Langchain's `HTMLHeaderTextSplitter` to split markdown sections.\n",
+    "  - Langchain's `RecursiveCharacterTextSplitter` to split up long reviews recursively.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "chunking time: 0.01832103729248047\n",
+      "docs: 15, split into: 15\n",
+      "split into chunks: 159, type: list of <class 'langchain.schema.document.Document'>\n",
+      "\n",
+      "Looking at a sample chunk...\n",
+      "{'h1': 'Installation', 'h2': 'Installing via pip', 'source': 'rtdocs/pymilvus.readthedocs.io/en/latest/install.html'}\n",
+      "demonstrate how to install and using PyMilvus in a virtual environment. See virtualenv for more info\n"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain.text_splitter import HTMLHeaderTextSplitter, RecursiveCharacterTextSplitter\n",
+    "\n",
+    "# Define the headers to split on for the HTMLHeaderTextSplitter\n",
+    "headers_to_split_on = [\n",
+    "    (\"h1\", \"Header 1\"),\n",
+    "    (\"h2\", \"Header 2\"),\n",
+    "    (\"h3\", \"Header 3\"),\n",
+    "]\n",
+    "# Create an instance of the HTMLHeaderTextSplitter\n",
+    "html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
+    "\n",
+    "# Use the embedding model parameters.\n",
+    "chunk_size = MAX_SEQ_LENGTH - HF_EOS_TOKEN_LENGTH\n",
+    "chunk_overlap = np.round(chunk_size * 0.10, 0)\n",
+    "\n",
+    "# Create an instance of the RecursiveCharacterTextSplitter\n",
+    "child_splitter = RecursiveCharacterTextSplitter(\n",
+    "    chunk_size = chunk_size,\n",
+    "    chunk_overlap = chunk_overlap,\n",
+    "    length_function = len,\n",
+    ")\n",
+    "\n",
+    "# Split the HTML text using the HTMLHeaderTextSplitter.\n",
+    "start_time = time.time()\n",
+    "html_header_splits = []\n",
+    "for doc in docs:\n",
+    "    splits = html_splitter.split_text(doc.page_content)\n",
+    "    for split in splits:\n",
+    "        # Add the source URL and header values to the metadata\n",
+    "        metadata = {}\n",
+    "        new_text = split.page_content\n",
+    "        for header_name, metadata_header_name in headers_to_split_on:\n",
+    "            header_value = new_text.split(\"¶ \")[0].strip()\n",
+    "            metadata[header_name] = header_value\n",
+    "            try:\n",
+    "                new_text = new_text.split(\"¶ \")[1].strip()\n",
+    "            except:\n",
+    "                break\n",
+    "        split.metadata = {\n",
+    "            **metadata,\n",
+    "            \"source\": doc.metadata[\"source\"]\n",
+    "        }\n",
+    "        # Add the header to the text\n",
+    "        split.page_content = split.page_content\n",
+    "    html_header_splits.extend(splits)\n",
+    "\n",
+    "# Split the documents further into smaller, recursive chunks.\n",
+    "chunks = child_splitter.split_documents(html_header_splits)\n",
+    "\n",
+    "end_time = time.time()\n",
+    "print(f\"chunking time: {end_time - start_time}\")\n",
+    "print(f\"docs: {len(docs)}, split into: {len(html_header_splits)}\")\n",
+    "print(f\"split into chunks: {len(chunks)}, type: list of {type(chunks[0])}\") \n",
+    "\n",
+    "# Inspect chunks.\n",
+    "print()\n",
+    "print(\"Looking at a sample chunk...\")\n",
+    "print(chunks[1].metadata)\n",
+    "print(chunks[1].page_content[:100])\n",
+    "\n",
+    "# TODO - remove this before saving in github.\n",
+    "# # Print the child splits with their associated header metadata\n",
+    "# print()\n",
+    "# for child in chunks:\n",
+    "#     print(f\"Content: {child.page_content}\")\n",
+    "#     print(f\"Metadata: {child.metadata}\")\n",
+    "#     print()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "512130a3",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'h1': 'Installation', 'h2': 'Installing via pip', 'source': 'https://pymilvus.readthedocs.io/en/latest/install.html'}\n",
+      "Installation¶ Installing via pip¶ PyMilvus is in the Python Package Index. PyMilvus only support pyt\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Clean up the metadata urls\n",
+    "for doc in chunks:\n",
+    "    new_url = doc.metadata[\"source\"]\n",
+    "    new_url = new_url.replace(\"rtdocs\", \"https:/\")\n",
+    "    doc.metadata.update({\"source\": new_url})\n",
+    "\n",
+    "print(chunks[0].metadata)\n",
+    "print(chunks[0].page_content[:100])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d9bd8153",
+   "metadata": {},
+   "source": [
+    "## Insert data into Milvus\n",
+    "\n",
+    "Milvus and Milvus Lite support loading pandas dataframes directly.\n",
+    "\n",
+    "Milvus Client, however, requires conerting pandas df into a list of dictionaries first.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Convert chunks and embeddings to a list of dictionaries.\n",
+    "chunk_list = []\n",
+    "for chunk in chunks:\n",
+    "    embeddings = torch.tensor(encoder.encode([chunk.page_content]))\n",
+    "    embeddings = F.normalize(embeddings, p=2, dim=1)\n",
+    "    converted_values = list(map(np.float32, embeddings))[0]\n",
+    "    \n",
+    "    # Only use h1, h2. Truncate the metadata in case too long.\n",
+    "    try:\n",
+    "        h2 = chunk.metadata['h2'][:50]\n",
+    "    except:\n",
+    "        h2 = \"\"\n",
+    "    chunk_dict = {\n",
+    "        'vector': converted_values,\n",
+    "        'chunk': chunk.page_content,\n",
+    "        'source': chunk.metadata['source'],\n",
+    "        'h1': chunk.metadata['h1'][:50],\n",
+    "        'h2': h2,\n",
+    "    }\n",
+    "    chunk_list.append(chunk_dict)\n",
+    "\n",
+    "# # TODO - remove this before saving in github.\n",
+    "# for chunk in chunk_list[:1]:\n",
+    "#     print(chunk)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "b51ff139",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Start inserting entities\n",
+      "Milvus insert time for 159 vectors: 0.9112908840179443 seconds\n",
+      "(insert count: 159, delete count: 0, upsert count: 0, timestamp: 445785021399957506, success count: 159, err count: 0)\n",
+      "[{\"name\":\"_default\",\"collection_name\":\"MIlvusDocs\",\"description\":\"\"}]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Insert a batch of data into the Milvus collection.\n",
+    "\n",
+    "print(\"Start inserting entities\")\n",
+    "start_time = time.time()\n",
+    "insert_result = mc.insert(chunk_list)\n",
+    "\n",
+    "end_time = time.time()\n",
+    "print(f\"Milvus insert time for {len(chunk_list)} vectors: {end_time - start_time} seconds\")\n",
+    "\n",
+    "# After final entity is inserted, call flush to stop growing segments left in memory.\n",
+    "mc.flush() \n",
+    "\n",
+    "# Inspect results.\n",
+    "print(insert_result)\n",
+    "print(mc.partitions) # list[Partition] objects\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4ebfb115",
+   "metadata": {},
+   "source": [
+    "## Run a Semantic Search\n",
+    "\n",
+    "Now we can search all the documentation embeddings to find the `TOP_K` documentation chunks with the closest embeddings to a user's query.\n",
+    "- In this example, we'll ask about AUTOINDEX.\n",
+    "\n",
+    "💡 The same model should always be used for consistency for all the embeddings."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "02c589ff",
+   "metadata": {},
+   "source": [
+    "## Ask a question about your data\n",
+    "\n",
+    "So far in this demo notebook: \n",
+    "1. Your custom data has been mapped into a vector embedding space\n",
+    "2. Those vector embeddings have been saved into a vector database\n",
+    "\n",
+    "Next, you can ask a question about your custom data!\n",
+    "\n",
+    "💡 In LLM lingo:\n",
+    "> **Query** is the generic term for user questions.  \n",
+    "A query is a list of multiple individual questions, up to maybe 1000 different questions!\n",
+    "\n",
+    "> **Question** usually refers to a single user question.  \n",
+    "In our example below, the user question is \"What is AUTOINDEX in Milvus Client?\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "5e7f41f4",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "query length: 54\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Define a sample question about your data.\n",
+    "question = \"what is the default distance metric used in AUTOINDEX?\"\n",
+    "query = [question]\n",
+    "\n",
+    "# Inspect the length of the query.\n",
+    "QUERY_LENGTH = len(query[0])\n",
+    "print(f\"query length: {QUERY_LENGTH}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9ea29411",
+   "metadata": {},
+   "source": [
+    "## Execute a vector search\n",
+    "\n",
+    "Search Milvus using [PyMilvus API](https://milvus.io/docs/search.md).\n",
+    "\n",
+    "💡 By their nature, vector searches are \"semantic\" searches.  For example, if you were to search for \"leaky faucet\": \n",
+    "> **Traditional Key-word Search** - either or both words \"leaky\", \"faucet\" would have to match some text in order to return a web page or link text to the document.\n",
+    "\n",
+    "> **Semantic search** - results containing words \"drippy\" \"taps\" would be returned as well because these words mean the same thing even though they are different words,"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "89642119",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Loaded milvus collection into memory.\n",
+      "Milvus search time: 0.22196269035339355 sec\n",
+      "type: <class 'pymilvus.client.abstract.SearchResult'>, count: 5\n"
+     ]
+    }
+   ],
+   "source": [
+    "# RETRIEVAL USING MILVUS.\n",
+    "\n",
+    "# Before conducting a search based on a query, you need to load the data into memory.\n",
+    "mc.load()\n",
+    "print(\"Loaded milvus collection into memory.\")\n",
+    "\n",
+    "# Embed the question using the same embedding model.\n",
+    "embedded_question = torch.tensor(encoder.encode([question]))\n",
+    "# Normalize embeddings to unit length.\n",
+    "embedded_question = F.normalize(embedded_question, p=2, dim=1)\n",
+    "# Convert the embeddings to list of list of np.float32.\n",
+    "embedded_question = list(map(np.float32, embedded_question))\n",
+    "\n",
+    "# Return top k results with AUTOINDEX.\n",
+    "TOP_K = 5\n",
+    "\n",
+    "# Run semantic vector search using your query and the vector database.\n",
+    "start_time = time.time()\n",
+    "results = mc.search(\n",
+    "    data=embedded_question, \n",
+    "    anns_field=\"vector\", \n",
+    "    # No params for AUTOINDEX\n",
+    "    param={},\n",
+    "    # Access dynamic fields in the boolean expression.\n",
+    "    # expr=\"\",\n",
+    "    output_fields=[\"h1\", \"h2\", \"chunk\", \"source\"], \n",
+    "    limit=TOP_K,\n",
+    "    consistency_level=\"Eventually\"\n",
+    "    )\n",
+    "\n",
+    "elapsed_time = time.time() - start_time\n",
+    "print(f\"Milvus search time: {elapsed_time} sec\")\n",
+    "\n",
+    "# Inspect search result.\n",
+    "print(f\"type: {type(results)}, count: {len(results[0])}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Assemble and inspect the search result\n",
+    "\n",
+    "The search result is in the variable `result[0]` of type `'pymilvus.orm.search.SearchResult'`.  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "2267\n"
+     ]
+    }
+   ],
+   "source": [
+    "# # TODO - remove this before saving in github.\n",
+    "# for n, hits in enumerate(results):\n",
+    "#     print(f\"{n}th query result\")\n",
+    "#     for hit in hits:\n",
+    "#         print(hit)\n",
+    "\n",
+    "# Assemble the context as a stuffed string.\n",
+    "context = \"\"\n",
+    "for r in results[0]:\n",
+    "    text = r.entity.chunk\n",
+    "    context += f\"{text} \"\n",
+    "print(len(context))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bd6060ce",
+   "metadata": {},
+   "source": [
+    "## Use an LLM to Generate a chat response to the user's question using the Retrieved Context.\n",
+    "\n",
+    "Below, we're using an open, very tiny generative AI model, or LLM.  Many demos use OpenAI as the LLM choice instead."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "3e7fa0b6",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Question: what is the default distance metric used in AUTOINDEX?\n",
+      "Answer: lazy dog\n"
+     ]
+    }
+   ],
+   "source": [
+    "# BASELINING THE LLM: ASK A QUESTION WITHOUT ANY RETRIEVED CONTEXT.\n",
+    "\n",
+    "from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline\n",
+    "\n",
+    "# Load the Hugging Face auto-regressive LLM checkpoint.\n",
+    "llm = \"deepset/tinyroberta-squad2\"\n",
+    "tokenizer = AutoTokenizer.from_pretrained(llm)\n",
+    "\n",
+    "# context cannot be empty so just put random text in it.\n",
+    "QA_input = {\n",
+    "    'question': question,\n",
+    "    'context': 'The quick brown fox jumped over the lazy dog'\n",
+    "}\n",
+    "\n",
+    "nlp = pipeline('question-answering', \n",
+    "               model=llm, \n",
+    "               tokenizer=tokenizer)\n",
+    "\n",
+    "result = nlp(QA_input)\n",
+    "print(f\"Question: {question}\")\n",
+    "print(f\"Answer: {result['answer']}\")\n",
+    "\n",
+    "# The baseline LLM chat is not very helpful."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "a68e87b1",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Question: what is the default distance metric used in AUTOINDEX?\n",
+      "Answer: MetricType.L2\n"
+     ]
+    }
+   ],
+   "source": [
+    "# NOW ASK THE SAME LLM THE SAME QUESTION USING THE RETRIEVED CONTEXT.\n",
+    "QA_input = {\n",
+    "    'question': question,\n",
+    "    'context': context,\n",
+    "}\n",
+    "\n",
+    "nlp = pipeline('question-answering', \n",
+    "               model=llm, \n",
+    "               tokenizer=tokenizer)\n",
+    "\n",
+    "result = nlp(QA_input)\n",
+    "print(f\"Question: {question}\")\n",
+    "print(f\"Answer: {result['answer']}\")\n",
+    "\n",
+    "# That answer looks a little better!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "d0e81e68",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 9. Drop collection\n",
+    "utility.drop_collection(COLLECTION_NAME)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "c777937e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Author: Christy Bergman\n",
+      "\n",
+      "Python implementation: CPython\n",
+      "Python version       : 3.10.12\n",
+      "IPython version      : 8.15.0\n",
+      "\n",
+      "torch       : 2.0.1\n",
+      "transformers: 4.34.1\n",
+      "milvus      : 2.3.3\n",
+      "pymilvus    : 2.3.3\n",
+      "langchain   : 0.0.322\n",
+      "\n",
+      "conda environment: py310\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Props to Sebastian Raschka for this handy watermark.\n",
+    "# !pip install watermark\n",
+    "\n",
+    "%load_ext watermark\n",
+    "%watermark -a 'Christy Bergman' -v -p torch,transformers,milvus,pymilvus,langchain --conda"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}