This repository contains a convenient wrapper for fine-tuning and inference of Large Language Models (LLMs) in memory-constrained environment. Two major components that democratize the training of LLMs are: Parameter-Efficient Fine-tuning (PEFT) (e.g: LoRA, Adapter) and quantization techniques (8-bit, 4-bit). However, there exists many quantization techniques and corresponding implementations which make it hard to compare and test different training configurations effectively. This repo aims to provide a common fine-tuning pipeline for LLMs to help researchers quickly try most common quantization-methods and create compute-optimized training pipeline.
This repo is built upon these materials:
- alpaca-lora for the original training script.
- GPTQ-for-LLaMa for the efficient GPTQ quantization method.
- exllama for the high-performance inference engine.
-
Memory-efficient fine-tuning of LLMs on consumer GPUs (<16GiB) by utilizing LoRA (Low-Rank Adapter) and quantization techniques.
-
Support most popular quantization techniques: 8-bit, 4-bit quantization from bitsandbytes and GPTQ.
-
Correct PEFT checkpoint saving at regular interval to minimize risk of progress loss during long training.
-
Correct checkpoint resume for all quantization methods.
-
Support distributed training on multiple GPUs (with examples).
-
Support gradient checkpointing for both
GPTQ
andbitsandbytes
. -
Switchable prompt templates to fit different pretrained LLMs.
-
Support evaluation loop to ensure LoRA is correctly loaded after training.
-
Inference and deployment examples.
-
Fast inference with exllama for GPTQ model.
See notebook or on Colab .
-
Install default dependencies
pip install -r requirements.txt
-
If
bitsandbytes
doesn't work, install it from source. Windows users can follow these instructions -
To use 4-bit efficient CUDA kernel from ExLlama and GPTQ for training and inference
pip install -r cuda_quant_requirements.txt
Note that the installation of above packages requires the installation of CUDA to compile custom kernels. If you have issue, looks for help in the original repos GPTQ, exllama for advices.
Prepare the instruction data to fine-tune the model in the following JSON format.
[
{
"instruction": "do something with the input",
"input": "input string",
"output": "output string"
}
]
You can supply a single JSON file as training data and perform auto split for validation. Or, prepare two separate train.json
and test.json
in the same directory to supply as train and validation data.
You should also take a look at templates to see different prompt template to combine the instruction, input, output pair into a single text. During the training process, the model is trained using CausalLM objective (text completion) on the combined text. The prompt template must be compatible with the base LLM to maximize performance. Read the detail of the model card on HF (example) to get this information.
Prompt template can be switched as command line parameters at training and inference step.
We also support for raw text file input and ShareGPT conversation style input. See templates.
This file contains a straightforward application of PEFT to the LLaMA model, as well as some code related to prompt construction and tokenization. We use common HF trainer to ensure the compatibility with other library such as accelerate.
Simple usage:
bash scripts/train.sh
# OR
python finetune.py \
--base_model 'decapoda-research/llama-7b-hf' \
--data_path 'yahma/alpaca-cleaned' \
--output_dir './lora-output'
where data_path
is the path to a JSON file or a directory contains train.json
and test.json
. base_model
is the model name on HF model hub or path to a local model on disk.
We can also tweak other hyperparameters (see example in train.sh):
python finetune.py \
--base_model 'decapoda-research/llama-7b-hf' \
--data_path 'yahma/alpaca-cleaned' \
--output_dir './lora-output' \
--mode 4 \
--batch_size 128 \
--micro_batch_size 4 \
--num_epochs 3 \
--learning_rate 1e-4 \
--cutoff_len 512 \
--val_set_size 0.2 \
--lora_r 8 \
--lora_alpha 16 \
--lora_dropout 0.05 \
--lora_target_modules '[q_proj,v_proj]' \
--resume_from_checkpoint checkpoint-29/adapter_model/
Some notables parameters:
micro_batch_size: size of the batch on each GPU, greatly affect VRAM usage
batch_size: actual batch size after gradient accumulation
cutoff_len: maximum length of the input sequence, greatly affect VRAM usage
gradient_checkpointing: use gradient checkpointing to save memory, however training speed will be lower
mode: quantization mode to use, acceptable values [4, 8, 16 or "gptq"]
resume_from_checkpoint: resume training from existings LoRA checkpoint
You can use the helper script python download_model.py <model_name>
to download a model from HF model hub and store it locally. By default it will save the model to models
of the current path. Make sure to create this folder or change the output location --output
.
On the quantization mode effects on training time and memory usage, see note. Generally, 16
and gptq
mode has the best performance, and should be selected to reduce training time. However, most of the time you will hit the memory limitation of the GPU with larger models, which mode 4
and gptq
provides the best memory saving effect. Overall, gptq
mode has the best balance between memory saving and training speed.
NOTE: To use gptq
mode, you must install the required package in cuda_quant_requirements
. Also, since GPTQ is a post-hoc quantization technique, only GTPQ-quantized model can be used for training. Look for model name which contains gptq
on HF model hub, such as TheBloke/orca_mini_v2_7B-GPTQ. To correctly load the checkpoint, GPTQ model requires offline checkpoint download as described in previous section.
If you don't use wandb
and want to disable the prompt at start of every training. Run wandb disabled
.
By default, on multi-GPUs environment, the training script will load the model weight and split its layers accross different GPUs. This is done to reduce VRAM usage, which allows loading larger model than a single GPU can handle. However, this essentially wastes the power of mutiple GPUs since the computation only run on 1 GPU at a time, thus training time is mostly similar to single GPU run.
To correctly run the training on multiple GPUs in parallel, you can use torchrun
or accelerate
to launch distributed training. Check the example in train_torchrun.sh and train_accelerate.sh. Training time will be drastically lower. However, you should modify batch_size
to be divisible by the number of GPUs.
bash scripts/train_torchrun.sh
Simply add --eval
and --resume_from_checkpoint
to perform evaluation on validation data.
python finetune.py \
--base_model 'decapoda-research/llama-7b-hf' \
--data_path 'yahma/alpaca-cleaned' \
--resume_from_checkpoint output/checkpoint-29/adapter_model/ \
--eval
This file loads the fine-tuned LoRA checkpoint with the base model and performs inference on the selected dataset. Output is printed to terminal output and stored in sample_output.txt
.
Example usage:
python inference.py \
--base models/TheBloke_vicuna-13b-v1.3.0-GPTQ/ \
--delta lora-output \
--mode exllama \
--type local \
--data data/test.json
Important parameters:
base: model id or path to base model
delta: path to fine-tuned LoRA checkpoint (optional)
data: path to evaluation dataset
mode: quantization mode to load the model, acceptable values [4, 8, 16, "gptq", "exllama"]
type: inference type to use, acceptable values ["local", "api", "guidance"]
Note that gptq
and exllama
mode are only compatible with GPTQ models. exllama
is currently provide the best inference speed thus is recommended.
Inference type local
is the default option (use local model loading). To use inference type api
, we need an instance of text-generation-inferece
server described in deployment.
Inference type guidance
is an advanced method to ensure the structure of the text output (such as JSON). Check the command line inference.py --help
and guidance for more information
This file contain scripts that merge the LoRA weights back into the base model for export to Hugging Face format. They should help users who want to run inference in projects like llama.cpp or text-generation-inference.
Currently, we do not support the merge of LoRA to GPTQ base model due to incompatibility issue of quantized weight.
See deployment.
To convert normal HF checkpoint go GPTQ checkpoint we need a conversion script. See GPTQ-for-LLaMa and AutoGPTQ for more information.
This document provides a comprehensive summary of different quantization methods and some suggestions for efficient training & inference.
Recommended models to start:
- 7B: TheBloke/vicuna-7B-v1.3-GPTQ, lmsys/vicuna-7b-v1.3
- 13B: TheBloke/vicuna-13b-v1.3.0-GPTQ, lmsys/vicuna-13b-v1.3
- 33B: TheBloke/airoboros-33B-gpt4-1.4-GPTQ
- https://github.com/ggerganov/llama.cpp: highly portable Llama inference based on C++
- https://github.com/huggingface/text-generation-inference: production-level LLM serving
- https://github.com/microsoft/guidance: enforce structure to LLM output
- https://github.com/turboderp/exllama/: high-perfomance GPTQ inference
- https://github.com/qwopqwop200/GPTQ-for-LLaMa: GPTQ quantization
- https://github.com/oobabooga/text-generation-webui: a flexible Web UI with support for multiple LLMs back-end
- https://github.com/vllm-project/vllm/: high throughput LLM serving
- @disarmyouwitha exllama_fastapi
- @turboderp exllama
- @johnsmith0031 alpaca_lora_4bit
- @TimDettmers bitsandbytes
- @tloen alpaca-lora
- @oobabooga text-generation-webui