diff --git a/cookbook/RAG/baidu_vectordb/baidu_vectordb_rag.ipynb b/cookbook/RAG/baidu_vectordb/baidu_vectordb_rag.ipynb new file mode 100644 index 00000000..cc113f0e --- /dev/null +++ b/cookbook/RAG/baidu_vectordb/baidu_vectordb_rag.ipynb @@ -0,0 +1,826 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "3b57beb5-b99e-4255-9056-b46a8b06dcd3", + "metadata": {}, + "source": [ + "本文旨在基于百度向量数据库实现一个简单的RAG(Retrieval-Augmented Generation)示例。\n", + "\n", + "## RAG介绍\n", + "RAG是一种先进的自然语言处理方法,它结合了信息检索和文本生成技术,用于提高问答系统、聊天机器人等应用的性能。以下是RAG的详细工作流程:\n", + "\n", + "### RAG的工作流程\n", + "\n", + "1. **文档加载(Document Loading)**\n", + " - 从各种来源加载大量文档数据。\n", + " - 这些文档将作为知识库,用于后续的信息检索。\n", + "\n", + "2. **文档分割(Document Splitting)**\n", + " - 将加载的文档分割成更小的段落或部分。\n", + " - 这有助于提高检索的准确性和效率。\n", + "\n", + "3. **嵌入向量生成(Embedding Generation)**\n", + " - 对每个文档或文档的部分生成嵌入向量。\n", + " - 这些嵌入向量捕捉文档的语义信息,方便后续的相似度比较。\n", + "\n", + "4. **写入向量数据库(Writing to Vector Database)**\n", + " - 将生成的嵌入向量存储在一个向量数据库中。\n", + " - 数据库支持高效的相似度搜索操作。\n", + "\n", + "5. **查询生成(Query Generation)**\n", + " - 用户提出一个问题或输入一个提示。\n", + " - RAG模型根据输入生成一个或多个相关的查询。\n", + "\n", + "6. **文档检索(Document Retrieval)**\n", + " - 使用生成的查询在向量数据库中检索相关文档。\n", + " - 选择与查询最相关的文档作为信息源。\n", + "\n", + "7. **上下文融合(Context Integration)**\n", + " - 将检索到的文档内容与原始问题或提示融合,构成扩展的上下文。\n", + "\n", + "8. **答案生成(Answer Generation)**\n", + " - 基于融合后的上下文,RAG生成模型产生最终的回答或文本。\n", + " - 这一步骤旨在综合原始输入和检索到的信息。" + ] + }, + { + "cell_type": "markdown", + "id": "997d856d-1e57-48f6-a587-74c157d9b2ad", + "metadata": {}, + "source": [ + "## 文档加载(Document Loading)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "746da733-727b-438d-acfa-e6e3f0aae0af", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install langchain\n", + "!pip install pymochow\n", + "!pip install qianfan\n", + "!pip install pdfplumber" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "676226c2-b0ac-43ed-a7a6-fc1fe8b8e0db", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Document(page_content='Feature Article | 特稿\\n大模型发展趋势及国内外研究现状\\n◎ 撰文 | 熊子晗 李雨轩 陈军 陈大北\\n大模型是“大算力+强算法”结合的产物,通常是在大规模无标注数据上进行训练,学习出一种特征\\n和规则。基于大模型进行应用开发时,将大模型进行微调,如在下游特定任务上的小规模有标注数据上进\\n行二次训练,或者不进行微调,就可以完成多个应用场景的任务。\\n业、地球科学、航空航天等领域开始从偏微分方程\\n大模型发展趋势 的方法拓展到AI方法。\\n国际数据公司(IDC)认为,大模型的发展是\\n华为认为,首先,大模型的出现和繁荣既是 大势所趋。首先,未来大小模型会协同进化,推动\\n当前深度学习的顶峰,也代表着深度学习算法的瓶 端侧化发展。大模型负责向小模型输出模型能力,\\n颈。对大模型的需求本质上是对大数据的需求。当 小模型更精确地处理自己“擅长”的任务,再将应\\n前的人工智能算法尚无法高效地建模不同数据之间 用中的数据与结果反哺给大模型,让大模型持续迭\\n的关系,并以此解决模型泛化的问题,取而代之, 代更新,形成大小模型协同应用模式,达到降低能\\n通过收集并处理大量训练数据,人工智能算法能够 耗、提高整体模型精度的效果。其次,大模型通用\\n通过“死记硬背”的方式一定程度上提升泛化能 性持续增强,实现AI开发“大一统”模式。大模\\n力。从这一角度看,大模型对数据的应用依然处于 型由于其泛化性、通用性为人工智能带来了新机\\n比较初级、低效的水平。可以预见,这种方式的边 遇。目前,在通用模型的基础上,各行业正利用精\\n际效应明显,数据集越大模型越大,提升同等精度 调或提示语prompt的方式加入任务间的差异化\\n所需要的代价就越大。要想通过预训练大模型真正 内容,从而极大地提高了模型的利用率,推动AI\\n解决人工智能问题,看来不太现实。其次,除了在 开发走向“统一”。最后,大模型从科研创新走向\\n数据集构建、模型设计乃至评测标准方面持续演 产业落地,通过开放的生态持续释放红利。大模型\\n进,业界首先需要做的是抛弃预训练大模型“参数 最重要的优势是推动AI 进入大规模可复制的产业\\n量至上”的评判标准。因此,参数量并不是评判模 落地阶段,仅需零样本、小样本的学习就可以达到\\n型能力的最好标准——如何用好参数并将模型的鲁 很好的效果,以此大大降低AI开发成本。国际数\\n棒性做得更好才是大模型发展真正应该关注的。 据公司建议各行业尽早拥抱大模型。在合作方面,\\n2020年华为云预判AI发展有两大趋势: 主要关注大模型与自身业务的适配性以及与头部厂\\n①AI会从传统小模型发展到大模型,对应算力需 商联手打造行业标杆;在技术方面,建议大模型供\\n求过去10年增加了40万倍。大模型成为应对AI 应商持续探究大模型的生成可控性;在安全方面,\\n应用碎片化的一种方式,可能收编高度定制化的小 大模型的技术安全以及伴随着大模型落地所带来的\\n模型,导致市场向大公司集中,产业规则集格局也 伦理问题是关注的重点;在商业化方面,大模型的\\n可能改变。②AI for Science(AI赋能科研), 路径仍不明确,海外市场发展较早,国内厂商可以\\nAI与科学计算交汇。包括传统的气象、海洋、农 重点借鉴。\\n6 2023.6 C-Enterprise Management 通信企业管理\\n', metadata={'source': './example_data/ai-paper.pdf', 'file_path': './example_data/ai-paper.pdf', 'page': 0, 'total_pages': 7, 'Producer': 'TTKN', 'CreationDate': \"D:20230718144804-08'00'\", 'Author': 'CNKI', 'Creator': 'ReaderEx_DIS 2.3.0 Build 4031'})" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from langchain_community.document_loaders import PDFPlumberLoader\n", + "loader = PDFPlumberLoader(\"./example_data/ai-paper.pdf\")\n", + "documents = loader.load()\n", + "documents[0]" + ] + }, + { + "cell_type": "markdown", + "id": "b61b7ca0-d083-48a2-96ed-adbe9ef76476", + "metadata": {}, + "source": [ + "## 文档分割(Document Splitting)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "c9920923-6264-4009-a05a-910cd7500046", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Document(page_content='Feature Article | 特稿\\n大模型发展趋势及国内外研究现状\\n◎ 撰文 | 熊子晗 李雨轩 陈军 陈大北\\n大模型是“大算力+强算法”结合的产物,通常是在大规模无标注数据上进行训练,学习出一种特征\\n和规则。基于大模型进行应用开发时,将大模型进行微调,如在下游特定任务上的小规模有标注数据上进\\n行二次训练,或者不进行微调,就可以完成多个应用场景的任务。\\n业、地球科学、航空航天等领域开始从偏微分方程\\n大模型发展趋势 的方法拓展到AI方法。\\n国际数据公司(IDC)认为,大模型的发展是\\n华为认为,首先,大模型的出现和繁荣既是 大势所趋。首先,未来大小模型会协同进化,推动\\n当前深度学习的顶峰,也代表着深度学习算法的瓶 端侧化发展。大模型负责向小模型输出模型能力,\\n颈。对大模型的需求本质上是对大数据的需求。当 小模型更精确地处理自己“擅长”的任务,再将应', metadata={'source': './example_data/ai-paper.pdf', 'file_path': './example_data/ai-paper.pdf', 'page': 0, 'total_pages': 7, 'Producer': 'TTKN', 'CreationDate': \"D:20230718144804-08'00'\", 'Author': 'CNKI', 'Creator': 'ReaderEx_DIS 2.3.0 Build 4031'})" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "text_splitter = RecursiveCharacterTextSplitter(chunk_size = 384, chunk_overlap = 0, separators=[\"\\n\\n\", \"\\n\", \" \", \"\", \"。\", \",\"])\n", + "all_splits = text_splitter.split_documents(documents)\n", + "all_splits[0]" + ] + }, + { + "cell_type": "markdown", + "id": "fc4dd1d8-d86a-4034-b9a6-a7442cb2ad07", + "metadata": {}, + "source": [ + "## 嵌入向量生成(Embedding Generation)\n", + "这里使用了百度千帆平台进行向量生产,具体可以参考[千帆的使用文档](https://cloud.baidu.com/doc/WENXINWORKSHOP/s/hlmokk9qn)" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "id": "36045280-095c-4ab6-be65-7c4f4a98880c", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[INFO] [03-07 00:02:09] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:10] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:11] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:12] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:13] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:14] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:16] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:17] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:18] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:19] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:20] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:21] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:22] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:23] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:25] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:26] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:27] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:28] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:29] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:30] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:31] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:32] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:34] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:35] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:36] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:37] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:38] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:39] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:40] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n", + "[INFO] [03-07 00:02:41] openapi_requestor.py:316 [t:8618713344]: requesting llm api endpoint: /embeddings/embedding-v1\n" + ] + }, + { + "data": { + "text/plain": [ + "[0.11224345862865448,\n", + " 0.054152220487594604,\n", + " -0.00048503236030228436,\n", + " -0.02061440609395504,\n", + " -0.03978653624653816,\n", + " -0.13321490585803986,\n", + " 0.019213810563087463,\n", + " -0.08260218054056168,\n", + " -0.07853612303733826,\n", + " 0.0272421445697546,\n", + " 0.012822103686630726,\n", + " 0.007456572260707617,\n", + " 0.021469425410032272,\n", + " -0.046562984585762024,\n", + " -0.16177448630332947,\n", + " -0.02001659944653511,\n", + " 0.018230443820357323,\n", + " -0.0613895058631897,\n", + " -0.09145896136760712,\n", + " 0.04688601940870285,\n", + " 0.0007361111929640174,\n", + " 0.00711675314232707,\n", + " -0.06621534377336502,\n", + " -0.05942794308066368,\n", + " -0.05141240730881691,\n", + " 0.04894018545746803,\n", + " 0.010030743665993214,\n", + " -0.03037424385547638,\n", + " 0.1294194459915161,\n", + " 0.03385620191693306,\n", + " -0.06916307657957077,\n", + " -0.10393933951854706,\n", + " 0.025252867490053177,\n", + " -0.08275408297777176,\n", + " 0.029970388859510422,\n", + " 0.030179088935256004,\n", + " 0.06368114054203033,\n", + " -0.00975083839148283,\n", + " 0.04710584133863449,\n", + " -0.031163427978754044,\n", + " -0.00028807061607949436,\n", + " -0.11335715651512146,\n", + " -0.03873395919799805,\n", + " -0.030331065878272057,\n", + " 0.08304671943187714,\n", + " -0.17482027411460876,\n", + " -0.10135161131620407,\n", + " 0.0984276607632637,\n", + " -0.06822502613067627,\n", + " -0.028058750554919243,\n", + " -0.04743753746151924,\n", + " -0.02738681249320507,\n", + " 0.09087447077035904,\n", + " 0.009101110510528088,\n", + " 0.07940004765987396,\n", + " 0.02902618795633316,\n", + " -0.0631762221455574,\n", + " -0.016593553125858307,\n", + " 0.029825875535607338,\n", + " 0.04605714604258537,\n", + " -0.03910074010491371,\n", + " -0.10579654574394226,\n", + " -0.05382489785552025,\n", + " 0.001830346998758614,\n", + " 0.1394398808479309,\n", + " -0.0649682953953743,\n", + " 0.028537852689623833,\n", + " 0.023779558017849922,\n", + " 0.08007601648569107,\n", + " 0.024423006922006607,\n", + " -0.021172164008021355,\n", + " 0.026453617960214615,\n", + " 0.004543804097920656,\n", + " 0.055978693068027496,\n", + " 0.027443373575806618,\n", + " 0.049529388546943665,\n", + " -0.013236327096819878,\n", + " 0.006071871146559715,\n", + " 0.025335561484098434,\n", + " 0.060000184923410416,\n", + " -0.06660202145576477,\n", + " -0.0396239310503006,\n", + " 0.08042364567518234,\n", + " 0.03039279766380787,\n", + " -0.12600959837436676,\n", + " -0.0017387226689606905,\n", + " -0.0347987562417984,\n", + " -0.039931267499923706,\n", + " 0.05629347637295723,\n", + " 0.011686760932207108,\n", + " -0.05441810190677643,\n", + " -0.09911958873271942,\n", + " 0.05383096635341644,\n", + " 0.013121864758431911,\n", + " 0.02292998507618904,\n", + " 0.011427725665271282,\n", + " -0.05158079415559769,\n", + " 0.06754251569509506,\n", + " 0.08275103569030762,\n", + " -0.06495284289121628,\n", + " -0.10055118799209595,\n", + " 0.006182626821100712,\n", + " -0.0030862735584378242,\n", + " -0.02022726461291313,\n", + " -0.0017674606060609221,\n", + " -0.06863682717084885,\n", + " 0.004017315339297056,\n", + " -0.04689915478229523,\n", + " 0.0033409802708774805,\n", + " -0.1089789867401123,\n", + " 0.01768055185675621,\n", + " 0.055484022945165634,\n", + " 0.016321523115038872,\n", + " -0.022504126653075218,\n", + " -0.02515234611928463,\n", + " 0.008612191304564476,\n", + " -0.00819480698555708,\n", + " 0.04645538702607155,\n", + " 0.005330824758857489,\n", + " 0.00171668641269207,\n", + " 0.08213366568088531,\n", + " 0.02664787322282791,\n", + " -0.0699603259563446,\n", + " 0.11877693980932236,\n", + " -0.05024316906929016,\n", + " 0.013361001387238503,\n", + " -0.038008466362953186,\n", + " 0.004596003796905279,\n", + " 0.015852754935622215,\n", + " 0.05736970901489258,\n", + " 7.392980478471145e-05,\n", + " -0.015487557277083397,\n", + " -0.0453178733587265,\n", + " 0.007471819408237934,\n", + " 0.0092796441167593,\n", + " -0.0904003158211708,\n", + " 0.005929036997258663,\n", + " -0.06673946976661682,\n", + " 0.030203664675354958,\n", + " 0.045923277735710144,\n", + " 0.012864544987678528,\n", + " 0.18166065216064453,\n", + " 0.023307323455810547,\n", + " 0.030799556523561478,\n", + " -0.05303889140486717,\n", + " -0.01546456292271614,\n", + " 0.016258440911769867,\n", + " 0.13253480195999146,\n", + " 0.02059279941022396,\n", + " 0.023894652724266052,\n", + " 0.04016369581222534,\n", + " 0.1710713803768158,\n", + " -0.0343390516936779,\n", + " -0.003215887350961566,\n", + " -0.05439172685146332,\n", + " 0.031531039625406265,\n", + " -0.003879532217979431,\n", + " -0.027501625940203667,\n", + " -0.019186951220035553,\n", + " 0.0320759080350399,\n", + " -0.05226020887494087,\n", + " 0.08840324729681015,\n", + " -0.0683160126209259,\n", + " -0.06384953111410141,\n", + " 0.04635944962501526,\n", + " -0.05846914276480675,\n", + " -0.06673721224069595,\n", + " 0.02712375298142433,\n", + " 0.014060401357710361,\n", + " 0.006596860010176897,\n", + " -0.06854859739542007,\n", + " 0.018177030608057976,\n", + " 0.018230661749839783,\n", + " -0.0558541938662529,\n", + " -0.004618027247488499,\n", + " 0.065196692943573,\n", + " 0.037749312818050385,\n", + " 0.07401712238788605,\n", + " -0.07244758307933807,\n", + " -0.03186663240194321,\n", + " 0.018105151131749153,\n", + " -0.005602388177067041,\n", + " 0.05210825428366661,\n", + " -0.0639311894774437,\n", + " -0.030425596982240677,\n", + " -0.019569534808397293,\n", + " -0.1272461712360382,\n", + " 0.001648408593609929,\n", + " 0.04702368378639221,\n", + " 0.03325974568724632,\n", + " -0.01878782920539379,\n", + " 0.01634359546005726,\n", + " 0.15590398013591766,\n", + " 0.05701915919780731,\n", + " 0.027158871293067932,\n", + " 0.06302422285079956,\n", + " 0.004601453430950642,\n", + " -0.054110657423734665,\n", + " 0.04899121820926666,\n", + " 0.06389059871435165,\n", + " 0.0737939104437828,\n", + " 0.03459013253450394,\n", + " 0.02035161480307579,\n", + " 0.09582310914993286,\n", + " 0.01974046789109707,\n", + " 0.03663746267557144,\n", + " 0.046277422457933426,\n", + " -0.0022836043499410152,\n", + " 0.053253524005413055,\n", + " 0.10230868309736252,\n", + " 0.057966750115156174,\n", + " 0.020251475274562836,\n", + " -0.15618188679218292,\n", + " -8.565132884541526e-05,\n", + " -0.03259008005261421,\n", + " 0.14324982464313507,\n", + " -0.021642353385686874,\n", + " 0.013389227911829948,\n", + " 0.0461769625544548,\n", + " 0.07587601244449615,\n", + " 0.06393904238939285,\n", + " 0.061978962272405624,\n", + " 0.09427467733621597,\n", + " -0.010711690410971642,\n", + " -0.02439178340137005,\n", + " 0.07483454048633575,\n", + " 0.031289998441934586,\n", + " 0.01893763616681099,\n", + " 0.02570437453687191,\n", + " 0.017265215516090393,\n", + " -0.023270810022950172,\n", + " -0.004660835489630699,\n", + " -0.08049830049276352,\n", + " -0.020157603546977043,\n", + " -0.05831923708319664,\n", + " 0.02693042904138565,\n", + " 0.0062797944992780685,\n", + " -0.07370108366012573,\n", + " -0.01424528006464243,\n", + " 0.047536224126815796,\n", + " -0.040188636630773544,\n", + " -0.14466917514801025,\n", + " 0.11420592665672302,\n", + " -0.11133081465959549,\n", + " 0.018494227901101112,\n", + " -0.1342700570821762,\n", + " 0.0797385647892952,\n", + " 0.019303850829601288,\n", + " -0.0589497908949852,\n", + " -0.00040900157182477415,\n", + " -0.07192771136760712,\n", + " -0.052940160036087036,\n", + " -0.06884238123893738,\n", + " 0.018189897760748863,\n", + " -0.11377942562103271,\n", + " -0.04913673177361488,\n", + " -0.01739814504981041,\n", + " 0.012818627990782261,\n", + " 0,\n", + " 0.009366332553327084,\n", + " 0.02585402876138687,\n", + " -0.058853067457675934,\n", + " 0,\n", + " -0.009884338825941086,\n", + " 0,\n", + " 0,\n", + " 0,\n", + " 0,\n", + " 0.027464674785733223,\n", + " 0,\n", + " 0.051964305341243744,\n", + " -0.018081365153193474,\n", + " 0,\n", + " -0.01711001992225647,\n", + " 0,\n", + " -0.020913679152727127,\n", + " 0,\n", + " 0,\n", + " -0.029994091019034386,\n", + " 0.049706120043992996,\n", + " 0,\n", + " 0,\n", + " -0.006257336121052504,\n", + " 0.0011860616505146027,\n", + " 0.027352364733815193,\n", + " 0,\n", + " -0.014587893150746822,\n", + " 0,\n", + " 0.031199641525745392,\n", + " 0.033114004880189896,\n", + " 0,\n", + " 0,\n", + " 0,\n", + " 0,\n", + " 0,\n", + " 0.01806504651904106,\n", + " -0.03615168482065201,\n", + " 0,\n", + " 0,\n", + " -0.01640968956053257,\n", + " -0.01602284424006939,\n", + " 0,\n", + " -0.04212148115038872,\n", + " 0,\n", + " -0.04131178930401802,\n", + " 0,\n", + " 0.02799217589199543,\n", + " -0.003469156799837947,\n", + " 0.02487725019454956,\n", + " 0.04201517254114151,\n", + " 0,\n", + " 0,\n", + " 0.015195129439234734,\n", + " -0.020232301205396652,\n", + " 0.003725948743522167,\n", + " 0,\n", + " 0.025728506967425346,\n", + " 0,\n", + " -0.03505311161279678,\n", + " 0.012839777395129204,\n", + " 0,\n", + " 0,\n", + " 0,\n", + " 0,\n", + " 0.03088497184216976,\n", + " -0.01844422146677971,\n", + " -0.01879795268177986,\n", + " 0,\n", + " 0,\n", + " -0.036270562559366226,\n", + " -0.04669174179434776,\n", + " 0.016902903094887733,\n", + " -0.024774691089987755,\n", + " 0,\n", + " 0,\n", + " 0.015529601834714413,\n", + " 0,\n", + " 0,\n", + " 0,\n", + " -0.05565505102276802,\n", + " -0.01461542584002018,\n", + " 0,\n", + " -0.02815856784582138,\n", + " 0,\n", + " 0,\n", + " 0,\n", + " 0.014100479893386364,\n", + " 0.033713482320308685,\n", + " 0.033084094524383545,\n", + " 0,\n", + " 0.03735937923192978,\n", + " 0,\n", + " 0.02364158257842064,\n", + " -0.03173264488577843,\n", + " 0,\n", + " 0,\n", + " 0,\n", + " 0.026736106723546982,\n", + " 0.015014618635177612,\n", + " -0.051845721900463104,\n", + " -0.0030715828761458397,\n", + " 0,\n", + " -0.0360470786690712,\n", + " 0,\n", + " -0.04701394960284233,\n", + " 0,\n", + " -0.05722227692604065,\n", + " 0,\n", + " 0,\n", + " 0,\n", + " 0,\n", + " 0.024855952709913254,\n", + " -0.05081732198596001,\n", + " 0,\n", + " 0,\n", + " 0,\n", + " 0,\n", + " 0,\n", + " 0,\n", + " -0.016477391123771667,\n", + " 0,\n", + " 0,\n", + " 0,\n", + " -0.008843593299388885]" + ] + }, + "execution_count": 49, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import os\n", + "import qianfan\n", + "import time\n", + "\n", + "# 使用安全认证AK/SK鉴权,通过环境变量方式初始化;替换下列示例中参数,安全认证Access Key替换your_iam_ak,Secret Key替换your_iam_sk\n", + "os.environ[\"QIANFAN_ACCESS_KEY\"] = \"your_ak\"\n", + "os.environ[\"QIANFAN_SECRET_KEY\"] = \"your_sk\"\n", + "\n", + "emb = qianfan.Embedding()\n", + "\n", + "embeddings = []\n", + "for chunk in all_splits:\n", + " resp = emb.do(texts=[ # 省略 model 时则调用默认模型 Embedding-V1\n", + " chunk.page_content\n", + " ])\n", + " embeddings.append(resp['data'][0]['embedding'])\n", + " time.sleep(1)\n", + "embeddings[0]" + ] + }, + { + "cell_type": "markdown", + "id": "6c852eae-1100-405c-a4d6-b2a84fe62743", + "metadata": {}, + "source": [ + "## 写入向量数据库(Writing to Vector Database)\n", + "\n", + "将基于原始文档生产的标量和向量数据写入到向量数据库主要分为以下三步:\n", + "1. ** 购买[百度向量数据库实例](https://cloud.baidu.com/doc/VDB/s/hlrsoazuf) **\n", + "2. ** 创建数据库、数据表 **\n", + "3. ** 写入数据 **" + ] + }, + { + "cell_type": "markdown", + "id": "7ce217d7-65bf-4f61-849e-0b031542116e", + "metadata": {}, + "source": [ + "### 创建数据库、数据表" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "42c4f7f1-ff44-4122-9128-ea06737be774", + "metadata": {}, + "outputs": [], + "source": [ + "import pymochow\n", + "from pymochow.configuration import Configuration\n", + "\n", + "account = 'root'\n", + "api_key = 'your api key'\n", + "endpoint = 'your endpoint' #example:http://127.0.0.1:8511\n", + "\n", + "config = Configuration(credentials=BceCredentials(account, api_key), endpoint=endpoint)\n", + "client = pymochow.MochowClient(config)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9cad43f2-79c6-48ea-9ae3-d169ae4be152", + "metadata": {}, + "outputs": [], + "source": [ + "db=client.create_database(\"document\")\n", + "database_list = client.list_databases()\n", + "for db_item in database_list:\n", + " print(\"database: {}\".format(db_item.database_name))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ac8aa1c7-e022-47b3-b874-e0d6f5bcda67", + "metadata": {}, + "outputs": [], + "source": [ + "from pymochow.model.schema import Schema, Field, VectorIndex, SecondaryIndex, HNSWParams\n", + "from pymochow.model.enum import FieldType, IndexType, MetricType\n", + "from pymochow.model.table import Partition\n", + "\n", + "fields = []\n", + "fields.append(\n", + " Field(\n", + " \"id\", \n", + " FieldType.UINT64, \n", + " primary_key=True,\n", + " partition_key=True, \n", + " auto_increment=False, \n", + " not_null=True\n", + " )\n", + ")\n", + "fields.append(Field(\"text\", FieldType.STRING))\n", + "fields.append(Field(\"metadata\", FieldType.STRING))\n", + "fields.append(Field(\"source\", FieldType.STRING))\n", + "fields.append(Field(\"author\", FieldType.STRING, not_null=True))\n", + "fields.append(\n", + " Field(\n", + " vector, \n", + " FieldType.FLOAT_VECTOR,\n", + " len(embedding),\n", + " not_null=True\n", + " )\n", + ")\n", + "\n", + "indexes = []\n", + "indexes.append(\n", + " VectorIndex(\n", + " index_name=\"vector_idx\",\n", + " index_type=IndexType.HNSW,\n", + " metric_type=MetricType.L2,\n", + " params=HNSWParams(m=32, efconstruction=200)\n", + " )\n", + ")\n", + "indexes.append(SecondaryIndex(index_name=\"author_idx\", field=\"author\"))\n", + "\n", + "table = db.create_table(\n", + " table_name=\"chunks\",\n", + " replication=3,\n", + " partition=Partition(partition_num=1),\n", + " schema=Schema(fields=fields, indexes=indexes)\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "d63d1815-03bf-46bc-9743-a55a85e968e0", + "metadata": {}, + "source": [ + "### 写入数据" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9d46e99c-e1f1-40e5-95b5-2060f4dab892", + "metadata": {}, + "outputs": [], + "source": [ + "from pymochow.model.table import Row\n", + "import json\n", + "rows = []\n", + "for index, chunk in enumerate(all_splits):\n", + " metadata = \"{}\"\n", + " if chunk.metadata is not None:\n", + " metadata = json.dumps(chunk.metadata)\n", + " row = Row(\n", + " id=index,\n", + " text=chunk.page_content,\n", + " metadata=metadata,\n", + " source=chunk.metadata[\"source\"],\n", + " author=chunk.metadata[\"Creator\"],\n", + " vector=embeddings[index]\n", + " )\n", + " rows.append(row)\n", + "rows[0].to_dict()\n", + "table.upsert(rows=rows)" + ] + }, + { + "cell_type": "markdown", + "id": "0c92bcc9-7afc-4683-9b31-7f39de942708", + "metadata": {}, + "source": [ + "### 构建向量索引" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2f804cc8-5bbb-4ce1-aa2e-354c9f9ae9df", + "metadata": {}, + "outputs": [], + "source": [ + "table.rebuild_index(\"vector_idx\")" + ] + }, + { + "cell_type": "markdown", + "id": "929cef3c-2b7f-4458-abcb-2a634fc6e700", + "metadata": {}, + "source": [ + "## RAG 问答示例" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f2d2ed7f-1921-4ac0-a614-58166f8cdecd", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_community.vectorstores import BaiduVectorDB\n", + "from langchain_community.vectorstores.baiduvectordb import ConnectionParams, TableParams\n", + "from langchain_community.embeddings import QianfanEmbeddingsEndpoint\n", + "from langchain_community.chat_models import QianfanChatEndpoint\n", + "from langchain.chains import RetrievalQA\n", + "\n", + "# 初始化向量嵌入和连接参数\n", + "embeddings = QianfanEmbeddingsEndpoint()\n", + "conn_params = ConnectionParams(\n", + " endpoint=config.endpoint,\n", + " account=config.account,\n", + " api_key=config.api_key\n", + ")\n", + "\n", + "# 初始化百度云向量数据库\n", + "vector_db = BaiduVectorDB(\n", + " embedding=embeddings,\n", + " connection_params=conn_params,\n", + " table_params=TableParams(384),\n", + " database_name=\"document\",\n", + " table_name=\"chunks\",\n", + " drop_old=False,\n", + ")\n", + "\n", + "# 初始化检索器和对话模型\n", + "retriever = vector_db.as_retriever(search_type=\"similarity\")\n", + "qianfan_chat_model = QianfanChatEndpoint(model=\"ERNIE-Bot\", temperature=0.1)\n", + "\n", + "# 初始化问答模块\n", + "qa = RetrievalQA.from_chain_type(llm=qianfan_chat_model, chain_type=\"refine\", retriever=retriever, return_source_documents=True)\n", + "\n", + "# 接收用户输入的问题\n", + "query = input(\"\\nYour question: \")\n", + "\n", + "# 处理用户问题并获取答案和相关文档\n", + "res = qa(query)\n", + "answer, docs = res['result'], res['source_documents']\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.6" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/cookbook/RAG/baidu_vectordb/exmaple_data/ai-paper.pdf b/cookbook/RAG/baidu_vectordb/exmaple_data/ai-paper.pdf new file mode 100644 index 00000000..b100c27f Binary files /dev/null and b/cookbook/RAG/baidu_vectordb/exmaple_data/ai-paper.pdf differ