|
1 | 1 | # LLM Fine-Tuning with Axolotl - Pod Deployment |
2 | 2 |
|
3 | | -**Interactive LLM fine-tuning environment using Axolotl on RunPod Pods** |
| 3 | +Pod-based LLM fine-tuning using [Axolotl](https://github.com/axolotl-ai-cloud/axolotl) on RunPod. |
4 | 4 |
|
5 | | -This repository provides a Pod-based deployment for LLM fine-tuning using [Axolotl](https://github.com/axolotl-ai-cloud/axolotl). It's designed for interactive development, experimentation, and debugging. |
6 | | - |
7 | | -## 🎯 Purpose |
8 | | - |
9 | | -This is the **Pod deployment** version of the LLM fine-tuning infrastructure. For serverless/API-based deployments, see the main [llm-fine-tuning](https://github.com/runpod-workers/llm-fine-tuning) repository. |
| 5 | +> **Serverless Version**: See [llm-fine-tuning](https://github.com/runpod-workers/llm-fine-tuning) for API-based deployments. |
10 | 6 |
|
11 | 7 | ## 🚀 Quick Start |
12 | 8 |
|
13 | | -### Deploy as RunPod Pod |
14 | | - |
15 | | -1. **Use the pre-built image**: `runpod/llm-finetuning:latest` |
16 | | -2. **Set environment variables** for your training configuration: |
17 | | - |
18 | | -```bash |
19 | | -# Required |
20 | | -export HF_TOKEN="your-huggingface-token" |
21 | | -export WANDB_API_KEY="your-wandb-key" |
22 | | - |
23 | | -# Training Configuration (examples) |
24 | | -export AXOLOTL_BASE_MODEL="TinyLlama/TinyLlama_v1.1" |
25 | | -export AXOLOTL_DATASETS='[{"path":"mhenrichsen/alpaca_2k_test","type":"alpaca"}]' |
26 | | -export AXOLOTL_OUTPUT_DIR="./outputs/my_training" |
27 | | -export AXOLOTL_ADAPTER="lora" |
28 | | -export AXOLOTL_LORA_R="8" |
29 | | -export AXOLOTL_LORA_ALPHA="16" |
30 | | -export AXOLOTL_NUM_EPOCHS="1" |
31 | | -``` |
| 9 | +**Image**: `runpod/llm-finetuning:latest` |
32 | 10 |
|
33 | | -3. **Start training**: |
| 11 | +### Required Environment Variables |
34 | 12 |
|
35 | 13 | ```bash |
36 | | -# The autorun.sh script will automatically configure and start training |
37 | | -# Or manually run: |
38 | | -axolotl train config.yaml |
39 | | -``` |
40 | | - |
41 | | -4. **Optional - Start vLLM server** (after training): |
42 | | - |
43 | | -```bash |
44 | | -# Create your vLLM config based on the example |
45 | | -cp vllm_config_example.yaml my_vllm_config.yaml |
46 | | -# Edit my_vllm_config.yaml with your trained model path and settings |
47 | | -./start_vllm.sh my_vllm_config.yaml |
48 | | -``` |
49 | | - |
50 | | -## 🏗️ Local Development |
51 | | - |
52 | | -### Build and Test Locally |
53 | | - |
54 | | -```bash |
55 | | -# Build the container |
56 | | -docker build -t llm-finetuning-pod . |
57 | | - |
58 | | -# Run with test configuration |
59 | | -docker run -it --gpus all \ |
60 | | - -e HF_TOKEN="your-token" \ |
61 | | - -e WANDB_API_KEY="your-key" \ |
62 | | - -e AXOLOTL_BASE_MODEL="TinyLlama/TinyLlama_v1.1" \ |
63 | | - -e AXOLOTL_DATASETS='[{"path":"mhenrichsen/alpaca_2k_test","type":"alpaca"}]' \ |
64 | | - llm-finetuning-pod |
65 | | -``` |
66 | | - |
67 | | -### Using the Makefile |
68 | | - |
69 | | -```bash |
70 | | -# Set up local development environment |
71 | | -make setup |
72 | | - |
73 | | -# Install dependencies |
74 | | -make install |
75 | | - |
76 | | -# Test the autorun script |
77 | | -make test |
78 | | -``` |
79 | | - |
80 | | -## ⚙️ Configuration |
81 | | - |
82 | | -Configuration is done entirely through environment variables prefixed with `AXOLOTL_`: |
83 | | - |
84 | | -### Required Variables |
| 14 | +HF_TOKEN=your-huggingface-token |
| 15 | +WANDB_API_KEY=your-wandb-key |
85 | 16 |
|
86 | | -- `HF_TOKEN`: HuggingFace access token |
87 | | -- `WANDB_API_KEY`: Weights & Biases API key |
88 | | - |
89 | | -### Common Configuration Examples |
90 | | - |
91 | | -#### Basic LoRA Training |
92 | | - |
93 | | -```bash |
94 | | -export AXOLOTL_BASE_MODEL="NousResearch/Llama-3.2-1B" |
95 | | -export AXOLOTL_DATASETS='[{"path":"teknium/GPT4-LLM-Cleaned","type":"alpaca"}]' |
96 | | -export AXOLOTL_ADAPTER="lora" |
97 | | -export AXOLOTL_LORA_R="16" |
98 | | -export AXOLOTL_LORA_ALPHA="32" |
99 | | -export AXOLOTL_NUM_EPOCHS="1" |
100 | | -export AXOLOTL_MICRO_BATCH_SIZE="2" |
101 | | -export AXOLOTL_GRADIENT_ACCUMULATION_STEPS="2" |
| 17 | +# Training config (examples) |
| 18 | +AXOLOTL_BASE_MODEL=TinyLlama/TinyLlama_v1.1 |
| 19 | +AXOLOTL_DATASETS=[{"path":"mhenrichsen/alpaca_2k_test","type":"alpaca"}] |
| 20 | +AXOLOTL_ADAPTER=lora |
102 | 21 | ``` |
103 | 22 |
|
104 | | -#### Memory-Optimized Settings |
| 23 | +### ⚠️ Critical: Volume Mounting |
105 | 24 |
|
106 | 25 | ```bash |
107 | | -export AXOLOTL_LOAD_IN_8BIT="true" |
108 | | -export AXOLOTL_GRADIENT_CHECKPOINTING="true" |
109 | | -export AXOLOTL_MICRO_BATCH_SIZE="1" |
110 | | -export AXOLOTL_GRADIENT_ACCUMULATION_STEPS="8" |
| 26 | +# ❌ NEVER mount to /workspace - overwrites everything! |
| 27 | +# ✅ Mount to /workspace/data only |
111 | 28 | ``` |
112 | 29 |
|
113 | | -#### Full Fine-Tuning |
| 30 | +### Training |
114 | 31 |
|
115 | 32 | ```bash |
116 | | -export AXOLOTL_BASE_MODEL="microsoft/DialoGPT-small" |
117 | | -# Don't set AXOLOTL_ADAPTER for full fine-tuning |
118 | | -export AXOLOTL_LEARNING_RATE="0.00001" |
119 | | -export AXOLOTL_WARMUP_STEPS="100" |
120 | | -``` |
121 | | - |
122 | | -## 📁 Repository Structure |
123 | | - |
124 | | -``` |
125 | | -llm-finetuning-axolotl/ |
126 | | -├── Dockerfile # Container definition |
127 | | -├── requirements.txt # Python dependencies |
128 | | -└── scripts/ # Initialization scripts |
129 | | - ├── autorun.sh # Main startup script |
130 | | - ├── configure.py # Environment-to-YAML converter |
131 | | - ├── config_template.yaml # Base configuration template |
132 | | - ├── start_vllm.sh # vLLM server startup script |
133 | | - ├── vllm_config_example.yaml # vLLM configuration example |
134 | | - └── WELCOME # Welcome message |
| 33 | +# Training starts automatically, or manually: |
| 34 | +axolotl train config.yaml |
135 | 35 | ``` |
136 | 36 |
|
137 | | -## 🔄 How It Works |
138 | | - |
139 | | -1. **Container starts** → `autorun.sh` is executed |
140 | | -2. **Environment check** → Validates required tokens |
141 | | -3. **Configuration generation** → `configure.py` converts env vars to `config.yaml` |
142 | | -4. **Training starts** → `axolotl train config.yaml` |
143 | | - |
144 | | -## 🚀 vLLM Inference Server |
145 | | - |
146 | | -After training, you can serve your model using the built-in vLLM server: |
147 | | - |
148 | | -### Quick Start vLLM |
| 37 | +### Inference (after training) |
149 | 38 |
|
150 | 39 | ```bash |
151 | | -# 1. Copy and customize the example config |
152 | | -cp vllm_config_example.yaml my_vllm_config.yaml |
153 | | -# 2. Edit my_vllm_config.yaml with your trained model path and settings |
154 | | -# 3. Start vLLM with your config |
155 | | -./start_vllm.sh my_vllm_config.yaml |
| 40 | +# Create vLLM config from example |
| 41 | +cp vllm_config_example.yaml my_config.yaml |
| 42 | +# Edit with your model path |
| 43 | +./start_vllm.sh my_config.yaml |
156 | 44 | ``` |
157 | 45 |
|
158 | | -### vLLM Features |
159 | | - |
160 | | -- **OpenAI-compatible API** at `http://localhost:8000` |
161 | | -- **Automatic LoRA support** for trained adapters |
162 | | -- **Optimized inference** with Flash Attention ≤ 2.8.0 |
163 | | -- **GPU memory management** with configurable utilization |
164 | | -- **Not started automatically** - run when needed |
165 | | - |
166 | | -### YAML Configuration |
167 | | - |
168 | | -The `vllm_config_example.yaml` provides a template with common settings: |
169 | | - |
170 | | -```yaml |
171 | | -# Model and performance |
172 | | -model: ./outputs/my-model |
173 | | -max_model_len: 32768 |
174 | | -gpu_memory_utilization: 0.95 |
175 | | - |
176 | | -# Server settings |
177 | | -port: 8000 |
178 | | -host: 0.0.0.0 |
179 | | -served_model_name: my-model |
180 | | -# LoRA support (if needed) |
181 | | -# lora_modules: |
182 | | -# - name: lora_adapter |
183 | | -# path: ./outputs/lora-out |
184 | | -``` |
185 | | - |
186 | | -### API Usage |
| 46 | +## 🏗️ Local Development |
187 | 47 |
|
188 | 48 | ```bash |
189 | | -# Test the server |
190 | | -curl http://localhost:8000/v1/completions \ |
191 | | - -H "Content-Type: application/json" \ |
192 | | - -d '{ |
193 | | - "model": "your-model", |
194 | | - "prompt": "Hello, how are you?", |
195 | | - "max_tokens": 100 |
196 | | - }' |
| 49 | +# Build and test |
| 50 | +docker build -t llm-finetuning-pod . |
| 51 | +docker-compose up |
197 | 52 | ``` |
198 | 53 |
|
199 | | -## 🤝 Development Workflow |
200 | | - |
201 | | -1. **Set environment variables** for your experiment |
202 | | -2. **Deploy pod** or run locally |
203 | | -3. **Monitor training** via Weights & Biases |
204 | | -4. **Iterate** by updating environment variables and restarting |
205 | | - |
206 | 54 | ## 📚 Documentation |
207 | 55 |
|
208 | 56 | - **[Development Conventions](docs/conventions.md)** - Development guide and best practices |
209 | 57 | - **[Axolotl Documentation](https://axolotl-ai-cloud.github.io/axolotl/docs/config.html)** - Complete configuration reference |
210 | 58 |
|
211 | 59 | ## 🔧 Troubleshooting |
212 | 60 |
|
213 | | -### Common Issues |
214 | | - |
215 | | -#### Environment Variables Not Loading |
| 61 | +### Volume Mount Issues |
216 | 62 |
|
217 | 63 | ```bash |
218 | | -# Check if variables are set |
219 | | -env | grep AXOLOTL_ |
220 | | - |
221 | | -# Restart the container if variables were added after startup |
| 64 | +# Symptoms: "No such file or directory" errors, infinite loops |
| 65 | +# Cause: Mounting to /workspace overwrites container structure |
| 66 | +# Solution: Mount to /workspace/data/ subdirectories only |
222 | 67 | ``` |
223 | 68 |
|
224 | | -#### Memory Issues |
| 69 | +### Environment Variables Not Loading |
225 | 70 |
|
226 | 71 | ```bash |
227 | | -export AXOLOTL_LOAD_IN_8BIT="true" |
228 | | -export AXOLOTL_GRADIENT_CHECKPOINTING="true" |
229 | | -export AXOLOTL_MICRO_BATCH_SIZE="1" |
| 72 | +# Variables must be set before container starts |
| 73 | +env | grep AXOLOTL_ |
230 | 74 | ``` |
231 | 75 |
|
232 | | -#### Authentication Issues |
| 76 | +### Authentication Issues |
233 | 77 |
|
234 | 78 | ```bash |
235 | | -# Verify tokens |
236 | 79 | echo $HF_TOKEN |
237 | 80 | echo $WANDB_API_KEY |
238 | | - |
239 | | -# Test HuggingFace login |
240 | | -huggingface-cli whoami |
241 | 81 | ``` |
242 | 82 |
|
243 | 83 | ## 🏷️ Available Images |
244 | 84 |
|
245 | | -| Tag | Description | Use Case | |
246 | | -| ------------------------------- | --------------------- | -------------------- | |
247 | | -| `runpod/llm-finetuning:latest` | Latest stable release | Production pods | |
248 | | -| `runpod/llm-finetuning:dev` | Development build | Testing new features | |
249 | | -| `runpod/llm-finetuning:preview` | Preview release | Early access | |
| 85 | +| Tag | Description | Use Case | |
| 86 | +| ------------------------------ | --------------------- | -------------------- | |
| 87 | +| `runpod/llm-finetuning:latest` | Latest stable release | Production pods | |
| 88 | +| `runpod/llm-finetuning:dev` | Development build | Testing new features | |
250 | 89 |
|
251 | 90 | --- |
252 | 91 |
|
|
0 commit comments