A complete pipeline for fine-tuning 7B language models using QLoRA and converting them to GGUF format for efficient inference.
- QLoRA Fine-tuning: Memory-efficient fine-tuning using 4-bit quantization
- GPU/CPU Support: Automatic fallback to CPU if CUDA unavailable
- Model Merging: Merge LoRA adapters with base models
- GGUF Conversion: Convert to GGUF format with configurable quantization
- Local Inference: Test models locally with interactive chat
- Complete Pipeline: One-command execution of the entire workflow
-
Setup Environment
python setup.py
-
Create Sample Data
python finetune_pipeline.py --create_sample_data
-
Run Complete Pipeline
python finetune_pipeline.py --data_path data/sample_train.jsonl
-
Install Python Dependencies
pip install -r requirements.txt
-
Install llama.cpp (Optional, for GGUF conversion)
git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make
Training data should be in JSONL format with a "text" field containing formatted prompts:
{"text": "<s>[INST] What is machine learning? [/INST] Machine learning is a subset of artificial intelligence...</s>"}
{"text": "<s>[INST] Explain neural networks. [/INST] Neural networks are computational models...</s>"}
python finetune_pipeline.py --data_path my_data.jsonl
python finetune_pipeline.py \
--model_name microsoft/DialoGPT-medium \
--data_path my_data.jsonl \
--epochs 5 \
--batch_size 2 \
--learning_rate 1e-4
python finetune_pipeline.py --data_path my_data.jsonl --skip_gguf
Fine-tuning Only
python finetune.py --data_path my_data.jsonl
Merge LoRA Adapter
python merge_model.py --adapter_path ./lora_adapters --output_path ./merged_model
Convert to GGUF
python gguf_converter.py --model_path ./merged_model
Test Model
python inference.py --model_path ./merged_model --interactive
Edit config.py
to customize training parameters:
@dataclass
class TrainingConfig:
model_name: str = "microsoft/DialoGPT-medium"
max_seq_length: int = 512
num_train_epochs: int = 3
learning_rate: float = 2e-4
lora_r: int = 64
lora_alpha: int = 16
quantization_level: str = "Q4_K_M"
outputs/
├── lora_adapters/ # LoRA adapter files
├── merged_model/ # Merged PyTorch model
└── gguf_models/ # GGUF quantized models
data/
└── sample_train.jsonl # Sample training data
- Minimum: 8GB VRAM for 7B model fine-tuning with QLoRA
- Recommended: 16GB+ VRAM for optimal performance
- CPU Fallback: Available but significantly slower
The pipeline works with most causal language models on Hugging Face:
- microsoft/DialoGPT-medium (default)
- microsoft/DialoGPT-large
- EleutherAI/gpt-neo-1.3B
- EleutherAI/gpt-neo-2.7B
- And many others...
Available GGUF quantization levels:
Q4_0
,Q4_1
: 4-bit quantizationQ5_0
,Q5_1
: 5-bit quantizationQ8_0
: 8-bit quantizationQ4_K_M
,Q5_K_M
: K-quantization (recommended)
CUDA Out of Memory
- Reduce
per_device_train_batch_size
- Increase
gradient_accumulation_steps
- Reduce
max_seq_length
GGUF Conversion Fails
- Install llama.cpp from source
- Ensure conversion scripts are in PATH
- Use
--skip_gguf
to bypass conversion
Model Quality Issues
- Increase training epochs
- Adjust learning rate
- Improve training data quality
- Increase LoRA rank (
lora_r
)
This project is licensed under the MIT License. See individual model licenses for usage restrictions.