Offline Tools

Offline loading script

Introduction

Your can use insert.py to load data from local project folder. The project information can be your own private repo files, or it can be crawled by yourself using a crawler.

Running this script can generate many intermediate files and steps, which is helpful for importing large projects or information.
First, a csv file will be generated, which saves all doc chunk information, and then this csv file will be changed to embedding and saved as an npy file. Finally this npy file will be imported into the vector database.

How to use

For example, if you want to load your local project named langchain, you can use this script:

python insert.py --project_root_or_file my_path/langchain_doc_dir --project_name langchain --mode project

You can see the help for each arg with cmd:

python insert.py -h

usage: insert.py [-h] --project_root_or_file PROJECT_ROOT_OR_FILE [--project_name PROJECT_NAME] --mode {project,github,stackoverflow,custom}
                 [--url_domain URL_DOMAIN] [--emb_batch_size EMB_BATCH_SIZE] [--load_batch_size LOAD_BATCH_SIZE] [--enable_qa ENABLE_QA]
                 [--qa_num_parallel QA_NUM_PARALLEL]

optional arguments:
  -h, --help            show this help message and exit
  --project_root_or_file PROJECT_ROOT_OR_FILE
                        It can be a folder or file path containing your project information.
  --project_name PROJECT_NAME
                        It is your project name. When mode is `stackoverflow`, project_name is not required.
  --mode {project,github,stackoverflow,custom}
                        When mode == 'project', `project_root_or_file` is a repo root which contains **/*.md files. When mode == 'github', `project_root_or_file`
                        can be a repo with '|', which means "(namespace)|(repo_name)", or a root containing repo folders with '|'.
  --url_domain URL_DOMAIN
                        When the mode is project, you can specify a url domain, so that the relative directory of your file is the same relative path added after
                        your domain. When the mode is github, there is no need to specify the url, the url path is the url of your github repo. When the mode is
                        stackoverflow, there is no need to specify the url, because the url can be obtained in the answer json.
  --emb_batch_size EMB_BATCH_SIZE
                        Batch size when extracting embedding.
  --load_batch_size LOAD_BATCH_SIZE
                        Batch size when loading to vector db.
  --enable_qa ENABLE_QA
                        Whether to use the generate question mode, which will use llm to generate questions related to doc chunks, and use questions to match
                        instead of doc chunks.
  --qa_num_parallel QA_NUM_PARALLEL
                        The number of concurrent request when generating problems. If your openai account does not support high request rates, I suggest you set
                        this value very small, such as 1, else you can use a higher num such as 8, or 16.

First example

Here is a practical usage example.
We crawled or downloaded a small amount of langchain documents in the directory /my_workspace/chatbot/data/langchain.fake.docs first let's look at the directory structure.

cd /my_workspace/chatbot/data/langchain.fake.docs
tree
.
├── installation
│   └── content.md
└── v1
    ├── agents
    │   └── content.md
    └── glossary
        └── content.md

cd /my_workspace/akcio/src/offline_tools
python insert.py --project_root_or_file /my_workspace/chatbot/data/langchain.fake.docs --project_name langchain --mode project --enable_qa 1 --qa_num_parallel 1

len of pattern_files = 3
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 24385.49it/s]
Total number of lines in pattern_files: 166
finished_files num =  0
2023-06-02 15:56:10
Start for /my_workspace/chatbot/data/langchain.fake.docs/v1/glossary/content.md...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:05<00:00, 16.28s/it]
Writing questions to csv ...
Done for /my_workspace/chatbot/data/langchain.fake.docs/v1/glossary/content.md

Start for /my_workspace/chatbot/data/langchain.fake.docs/v1/agents/content.md...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:47<00:00, 15.86s/it]
Writing questions to csv ...
Done for /my_workspace/chatbot/data/langchain.fake.docs/v1/agents/content.md

Start for /my_workspace/chatbot/data/langchain.fake.docs/installation/content.md...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:19<00:00,  9.88s/it]
Writing questions to csv ...
Done for /my_workspace/chatbot/data/langchain.fake.docs/installation/content.md

Finish one try, len of finished_files = 3
2023-06-02 15:58:22
Finish one try, len of finished_files = 3
df.columns = Index(['file', 'question', 'doc_chunk'], dtype='object')
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:00<00:00, 3808.52it/s]
final len of df = 44
len of df = 44
original_col =  ['file', 'question', 'doc_chunk', 'url']
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:04<00:00,  9.27it/s]
time =  4.747671604156494
combined_array.shape =  (44, 5)
finish embed_questions, output_npy =
/my_workspace/chatbot/data/langchain.fake.docs_embedding.npy
finish load_to_vector_db
total time = 140.70229268074036 (s) = 0.03908397018909454 (h).

Close QA Generation

If we do not use QA generation, we can set the enable_qa to False:

python insert.py --project_root_or_file /my_workspace/chatbot/data/langchain.fake.docs --project_name langchain --mode project --enable_qa 0

len of pattern_files = 3
100%|██████████████████████████████████████████| 3/3 [00:00<00:00, 26159.90it/s]
Total number of lines in pattern_files: 166
df.columns = Index(['file', 'doc_chunk'], dtype='object')
100%|███████████████████████████████████████████| 9/9 [00:00<00:00, 5193.83it/s]
final len of df = 9
len of df = 9
original_col =  ['file', 'doc_chunk', 'url']
100%|█████████████████████████████████████████████| 9/9 [00:03<00:00,  2.40it/s]
time =  3.7467992305755615
combined_array.shape =  (9, 4)
finish embed_questions, output_npy =
/my_workspace/chatbot/data/langchain.fake.docs_embedding.npy
finish load_to_vector_db
total time = 6.9640936851501465 (s) = 0.001934470468097263 (h).

Using question generation can get better retrieval results, but it will also increase overhead. For more information, you can refer to Question Generator

GitHub mode

In the above example, the mode is project.
When mode == 'project', project_root_or_file is a repo root which contains **/*.md files.
When mode == 'github', project_root_or_file can be a repo with '|', or a root containing repo folders with '|', which means "(namespace)|(repo_name)", or a root containing repo folders with '|'.

Next, I will demonstrate how to use the github mode.
Now, we have a directory that GitHub crawled down, its name is 'hwchase17|langchain', and it is structured like this:

cd hwchase17\|langchain/
tree
.
├── discussions
│   └── 632_to_3500.json
├── issues
│   └── 26_to_3523.json
├── LICENSE
├── README.md
└── summary.json

In fact, we will only get the information of all the readme files below, because other files cannot guarantee the unity of format and information.

python insert.py --project_root_or_file "/my_workspace/chatbot/data/hwchase17|langchain" --project_name github_langchain --mode github --enable_qa 0

len of pattern_files = 1
100%|█████████████████████████████████████████████| 1/1 [00:02<00:00,  2.23s/it]
Total number of lines in pattern_files: 81
df.columns = Index(['repo', 'doc_chunk'], dtype='object')
100%|████████████████████████████████████████████| 8/8 [00:00<00:00, 788.31it/s]
final len of df = 8
len of df = 8
original_col =  ['repo', 'doc_chunk', 'url']
100%|█████████████████████████████████████████████| 8/8 [00:03<00:00,  2.11it/s]
time =  3.8008766174316406
combined_array.shape =  (8, 4)
finish embed_questions, output_npy =
/my_workspace/chatbot/data/hwchase17-langchain_embedding.npy
finish load_to_vector_db
total time = 73.88799715042114 (s) = 0.020524443652894762 (h).

If there are multiple repo directories, your project_root_or_file can be set to their root directory. For example, the following root structure:

.
├── hwchase17|langchain
│   ├── discussions
│   │   └── 632_to_3500.json
│   ├── issues
│   │   └── 26_to_3523.json
│   ├── LICENSE
│   ├── README.md
│   └── summary.json
└── my_namespace|my_project
    └── README.md

Akcio is a proprietary project owned and developed by Zilliz. It is published under the Server Side Public License (SSPL) v1.