gpt-doc2data

[ English | 中文 ]

When collecting data for LLM finetuning, obtaining formatted data from existing documents can be costly. Fortunately, gpt-doc2data comes to the rescue! It is a GPT-based data generator that allows you to input your private data, and large language models will create high-quality question and answer pairs. These pairs can then be utilized for fine-tuning or prompt-tuning your own model.

Example

Below are 10 QA pairs generated after feeding the LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS paper into the system:

[
    {
        "question": "What is LoRA?",
        "answer": "LoRA is a method that freezes pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture to reduce the number of trainable parameters."
    },
    {
        "question": "What are the advantages of LoRA?",
        "answer": "The advantages of LoRA include reduced number of trainable parameters, lower GPU memory requirement, higher training throughput, no additional inference latency, and the ability to switch tasks efficiently."
    },
    {
        "question": "How does LoRA compare to fine-tuning in terms of model quality?",
        "answer": "LoRA performs on-par or better than fine-tuning in model quality on various language models, despite having fewer trainable parameters and higher training throughput."
    },
    {
        "question": "Which weight matrices in the Transformer architecture should be adapted with LoRA?",
        "answer": "LoRA should be applied to the weight matrices in the self-attention module, specifically Wq and Wv, for optimal performance."
    },
    {
        "question": "What is the optimal rank for LoRA?",
        "answer": "A low rank, such as 1 or 2, is sufficient for LoRA to achieve competitive performance on downstream tasks."
    },
    {
        "question": "What is the advantage of few-shot learning?",
        "answer": "Few-shot learning is advantageous when we only have a handful of training samples."
    },
    {
        "question": "What is the difference between adapter layers and LoRA?",
        "answer": "Adapter layers are computed in addition to the base model, introducing additional latency, while LoRA is added in a parallel manner."
    },
    {
        "question": "What is the GLUE Benchmark?",
        "answer": "The GLUE Benchmark is a collection of natural language understanding tasks used to evaluate NLU models."
    },
    {
        "question": "What is the purpose of the E2E NLG Challenge dataset?",
        "answer": "The E2E NLG Challenge dataset is used for training end-to-end, data-driven natural language generation systems."
    },
    {
        "question": "What is the amplification factor for task-specific directions in LoRA?",
        "answer": "The amplification factor for task-specific directions in LoRA is around 20."
    }
]

Getting Started

Install requirements

git clone https://github.com/codewangg/gpt-doc2data.git
cd gpt-doc2data
pip install -r requirements.txt

Prepare your documents

Currently supported file format:

PDF
Markdown
TXT

All the files should be put under gpt-doc2data/data directory

config.yaml

Rename example_config.yaml to config.yaml and modify it to suit your requirements and provide your own openai API key.

Generate QA pairs

python3 gpt-doc2data/gpt-doc2data.py

TODO

Low-hanging Fruits

Add an "id" field in the output JSON.
Improve the README.md for better understanding and usage.
We need a Chinese README page.
Clean up and add comments and type specifiers to the codebase (currently over 50% generated by LLM).

Medium-hanging Fruits

Improve the method for estimating the generated QA pair token number, as the current approach may waste tokens for each API call.
Add support to configure the output JSON key's name.
Add rate-limiter to avoid overloading the openai api.

High-hanging Fruits

Integrate the tool with local/private-served open-source models to reduce the cost associated with using the openai API due to high throughput.
Extend support for more file types, such as audio and videos, to serve as useful information sources. Broaden the tool's capabilities to generate different formats of outputs for fine-tuning, not just limited to QA pairs.
Implement a human judge mechanism to ensure high-quality data generation when needed.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
data		data
gpt-doc2data		gpt-doc2data
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

gpt-doc2data

Example

Getting Started

Install requirements

Prepare your documents

config.yaml

Generate QA pairs

TODO

Low-hanging Fruits

Medium-hanging Fruits

High-hanging Fruits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

haohww/gpt-doc2data

Folders and files

Latest commit

History

Repository files navigation

gpt-doc2data

Example

Getting Started

Install requirements

Prepare your documents

config.yaml

Generate QA pairs

TODO

Low-hanging Fruits

Medium-hanging Fruits

High-hanging Fruits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages