[ English | 中文 ]
Welcome to join our Wechat group chat!
When collecting data for LLM finetuning, obtaining formatted data from existing documents can be costly. Fortunately, gpt-doc2data comes to the rescue! It is a GPT-based data generator that allows you to input your private data, and large language models will create high-quality question and answer pairs. These pairs can then be utilized for fine-tuning or prompt-tuning your own model.
Below are 10 QA pairs generated after feeding the LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS paper into the system:
[
{
"question": "What is LoRA?",
"answer": "LoRA is a method that freezes pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture to reduce the number of trainable parameters."
},
{
"question": "What are the advantages of LoRA?",
"answer": "The advantages of LoRA include reduced number of trainable parameters, lower GPU memory requirement, higher training throughput, no additional inference latency, and the ability to switch tasks efficiently."
},
{
"question": "How does LoRA compare to fine-tuning in terms of model quality?",
"answer": "LoRA performs on-par or better than fine-tuning in model quality on various language models, despite having fewer trainable parameters and higher training throughput."
},
{
"question": "Which weight matrices in the Transformer architecture should be adapted with LoRA?",
"answer": "LoRA should be applied to the weight matrices in the self-attention module, specifically Wq and Wv, for optimal performance."
},
{
"question": "What is the optimal rank for LoRA?",
"answer": "A low rank, such as 1 or 2, is sufficient for LoRA to achieve competitive performance on downstream tasks."
},
{
"question": "What is the advantage of few-shot learning?",
"answer": "Few-shot learning is advantageous when we only have a handful of training samples."
},
{
"question": "What is the difference between adapter layers and LoRA?",
"answer": "Adapter layers are computed in addition to the base model, introducing additional latency, while LoRA is added in a parallel manner."
},
{
"question": "What is the GLUE Benchmark?",
"answer": "The GLUE Benchmark is a collection of natural language understanding tasks used to evaluate NLU models."
},
{
"question": "What is the purpose of the E2E NLG Challenge dataset?",
"answer": "The E2E NLG Challenge dataset is used for training end-to-end, data-driven natural language generation systems."
},
{
"question": "What is the amplification factor for task-specific directions in LoRA?",
"answer": "The amplification factor for task-specific directions in LoRA is around 20."
}
]
git clone https://github.com/codewangg/gpt-doc2data.git
cd gpt-doc2data
pip install -r requirements.txt
Currently supported file format:
- Markdown
- TXT
All the files should be put under gpt-doc2data/data
directory
Rename example_config.yaml
to config.yaml
and modify it to suit your requirements and provide your own openai API key.
python3 gpt-doc2data/gpt-doc2data.py
- Add an "id" field in the output JSON.
- Improve the README.md for better understanding and usage.
- We need a Chinese README page.
- Clean up and add comments and type specifiers to the codebase (currently over 50% generated by LLM).
- Improve the method for estimating the generated QA pair token number, as the current approach may waste tokens for each API call.
- Add support to configure the output JSON key's name.
- Add rate-limiter to avoid overloading the openai api.
- Integrate the tool with local/private-served open-source models to reduce the cost associated with using the openai API due to high throughput.
- Extend support for more file types, such as audio and videos, to serve as useful information sources. Broaden the tool's capabilities to generate different formats of outputs for fine-tuning, not just limited to QA pairs.
- Implement a human judge mechanism to ensure high-quality data generation when needed.