-
Notifications
You must be signed in to change notification settings - Fork 39
Offline Tools
Your can use insert.py
to load data from local project folder. The project information can be your own private repo files, or it can be crawled by yourself using a crawler.
Running this script can generate many intermediate files and steps, which is helpful for importing large projects or information.
First, a csv file will be generated, which saves all doc chunk information, and then this csv file will be changed to embedding and saved as an npy file. Finally this npy file will be imported into the vector database.
For example, if you want to load your local project named langchain, you can use this script:
python insert.py --project_root_or_file my_path/langchain_doc_dir --project_name langchain --mode project
You can see the help for each arg with cmd:
python insert.py -h
usage: insert.py [-h] --project_root_or_file PROJECT_ROOT_OR_FILE [--project_name PROJECT_NAME] --mode {project,github,stackoverflow,custom}
[--url_domain URL_DOMAIN] [--emb_batch_size EMB_BATCH_SIZE] [--load_batch_size LOAD_BATCH_SIZE] [--enable_qa ENABLE_QA]
[--qa_num_parallel QA_NUM_PARALLEL]
optional arguments:
-h, --help show this help message and exit
--project_root_or_file PROJECT_ROOT_OR_FILE
It can be a folder or file path containing your project information.
--project_name PROJECT_NAME
It is your project name. When mode is `stackoverflow`, project_name is not required.
--mode {project,github,stackoverflow,custom}
When mode == 'project', `project_root_or_file` is a repo root which contains **/*.md files. When mode == 'github', `project_root_or_file`
can be a repo with '|', which means "(namespace)|(repo_name)", or a root containing repo folders with '|'.
--url_domain URL_DOMAIN
When the mode is project, you can specify a url domain, so that the relative directory of your file is the same relative path added after
your domain. When the mode is github, there is no need to specify the url, the url path is the url of your github repo. When the mode is
stackoverflow, there is no need to specify the url, because the url can be obtained in the answer json.
--emb_batch_size EMB_BATCH_SIZE
Batch size when extracting embedding.
--load_batch_size LOAD_BATCH_SIZE
Batch size when loading to vector db.
--enable_qa ENABLE_QA
Whether to use the generate question mode, which will use llm to generate questions related to doc chunks, and use questions to match
instead of doc chunks.
--qa_num_parallel QA_NUM_PARALLEL
The number of concurrent request when generating problems. If your openai account does not support high request rates, I suggest you set
this value very small, such as 1, else you can use a higher num such as 8, or 16.
Here is a practical usage example.
We crawled or downloaded a small amount of langchain documents in the directory /my_workspace/chatbot/data/langchain.fake.docs
first let's look at the directory structure.
cd /my_workspace/chatbot/data/langchain.fake.docs
tree
.
├── installation
│ └── content.md
└── v1
├── agents
│ └── content.md
└── glossary
└── content.md
cd /my_workspace/akcio/src/offline_tools
python insert.py --project_root_or_file /my_workspace/chatbot/data/langchain.fake.docs --project_name langchain --mode project --enable_qa 1 --qa_num_parallel 1
len of pattern_files = 3
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 24385.49it/s]
Total number of lines in pattern_files: 166
finished_files num = 0
2023-06-02 15:56:10
Start for /my_workspace/chatbot/data/langchain.fake.docs/v1/glossary/content.md...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:05<00:00, 16.28s/it]
Writing questions to csv ...
Done for /my_workspace/chatbot/data/langchain.fake.docs/v1/glossary/content.md
Start for /my_workspace/chatbot/data/langchain.fake.docs/v1/agents/content.md...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:47<00:00, 15.86s/it]
Writing questions to csv ...
Done for /my_workspace/chatbot/data/langchain.fake.docs/v1/agents/content.md
Start for /my_workspace/chatbot/data/langchain.fake.docs/installation/content.md...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:19<00:00, 9.88s/it]
Writing questions to csv ...
Done for /my_workspace/chatbot/data/langchain.fake.docs/installation/content.md
Finish one try, len of finished_files = 3
2023-06-02 15:58:22
Finish one try, len of finished_files = 3
df.columns = Index(['file', 'question', 'doc_chunk'], dtype='object')
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:00<00:00, 3808.52it/s]
final len of df = 44
len of df = 44
original_col = ['file', 'question', 'doc_chunk', 'url']
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:04<00:00, 9.27it/s]
time = 4.747671604156494
combined_array.shape = (44, 5)
finish embed_questions, output_npy =
/my_workspace/chatbot/data/langchain.fake.docs_embedding.npy
finish load_to_vector_db
total time = 140.70229268074036 (s) = 0.03908397018909454 (h).
If we do not use QA generation, we can set the enable_qa
to False:
python insert.py --project_root_or_file /my_workspace/chatbot/data/langchain.fake.docs --project_name langchain --mode project --enable_qa 0
len of pattern_files = 3
100%|██████████████████████████████████████████| 3/3 [00:00<00:00, 26159.90it/s]
Total number of lines in pattern_files: 166
df.columns = Index(['file', 'doc_chunk'], dtype='object')
100%|███████████████████████████████████████████| 9/9 [00:00<00:00, 5193.83it/s]
final len of df = 9
len of df = 9
original_col = ['file', 'doc_chunk', 'url']
100%|█████████████████████████████████████████████| 9/9 [00:03<00:00, 2.40it/s]
time = 3.7467992305755615
combined_array.shape = (9, 4)
finish embed_questions, output_npy =
/my_workspace/chatbot/data/langchain.fake.docs_embedding.npy
finish load_to_vector_db
total time = 6.9640936851501465 (s) = 0.001934470468097263 (h).
Using question generation can get better retrieval results, but it will also increase overhead. For more information, you can refer to Question Generator
In the above example, the mode
is project
.
When mode == 'project', project_root_or_file
is a repo root which contains **/*.md files.
When mode == 'github', project_root_or_file
can be a repo with '|', or a root containing repo folders with '|', which means "(namespace)|(repo_name)", or a root containing repo folders with '|'.
Next, I will demonstrate how to use the github
mode.
Now, we have a directory that GitHub crawled down, its name is 'hwchase17|langchain', and it is structured like this:
cd hwchase17\|langchain/
tree
.
├── discussions
│ └── 632_to_3500.json
├── issues
│ └── 26_to_3523.json
├── LICENSE
├── README.md
└── summary.json
In fact, we will only get the information of all the readme files below, because other files cannot guarantee the unity of format and information.
python insert.py --project_root_or_file "/my_workspace/chatbot/data/hwchase17|langchain" --project_name github_langchain --mode github --enable_qa 0
len of pattern_files = 1
100%|█████████████████████████████████████████████| 1/1 [00:02<00:00, 2.23s/it]
Total number of lines in pattern_files: 81
df.columns = Index(['repo', 'doc_chunk'], dtype='object')
100%|████████████████████████████████████████████| 8/8 [00:00<00:00, 788.31it/s]
final len of df = 8
len of df = 8
original_col = ['repo', 'doc_chunk', 'url']
100%|█████████████████████████████████████████████| 8/8 [00:03<00:00, 2.11it/s]
time = 3.8008766174316406
combined_array.shape = (8, 4)
finish embed_questions, output_npy =
/my_workspace/chatbot/data/hwchase17-langchain_embedding.npy
finish load_to_vector_db
total time = 73.88799715042114 (s) = 0.020524443652894762 (h).
If there are multiple repo directories, your project_root_or_file
can be set to their root directory.
For example, the following root structure:
.
├── hwchase17|langchain
│ ├── discussions
│ │ └── 632_to_3500.json
│ ├── issues
│ │ └── 26_to_3523.json
│ ├── LICENSE
│ ├── README.md
│ └── summary.json
└── my_namespace|my_project
└── README.md
Akcio is a proprietary project owned and developed by Zilliz. It is published under the Server Side Public License (SSPL) v1.
© Copyright 2023, Zilliz Inc.
Towhee
LangChain
Others