Skip to content

Commit ecf4408

Browse files
fix: make sure that configure is working
1 parent a3aaa8f commit ecf4408

File tree

2 files changed

+54
-215
lines changed

2 files changed

+54
-215
lines changed

README.md

Lines changed: 37 additions & 198 deletions
Original file line numberDiff line numberDiff line change
@@ -1,252 +1,91 @@
11
# LLM Fine-Tuning with Axolotl - Pod Deployment
22

3-
**Interactive LLM fine-tuning environment using Axolotl on RunPod Pods**
3+
Pod-based LLM fine-tuning using [Axolotl](https://github.com/axolotl-ai-cloud/axolotl) on RunPod.
44

5-
This repository provides a Pod-based deployment for LLM fine-tuning using [Axolotl](https://github.com/axolotl-ai-cloud/axolotl). It's designed for interactive development, experimentation, and debugging.
6-
7-
## 🎯 Purpose
8-
9-
This is the **Pod deployment** version of the LLM fine-tuning infrastructure. For serverless/API-based deployments, see the main [llm-fine-tuning](https://github.com/runpod-workers/llm-fine-tuning) repository.
5+
> **Serverless Version**: See [llm-fine-tuning](https://github.com/runpod-workers/llm-fine-tuning) for API-based deployments.
106
117
## 🚀 Quick Start
128

13-
### Deploy as RunPod Pod
14-
15-
1. **Use the pre-built image**: `runpod/llm-finetuning:latest`
16-
2. **Set environment variables** for your training configuration:
17-
18-
```bash
19-
# Required
20-
export HF_TOKEN="your-huggingface-token"
21-
export WANDB_API_KEY="your-wandb-key"
22-
23-
# Training Configuration (examples)
24-
export AXOLOTL_BASE_MODEL="TinyLlama/TinyLlama_v1.1"
25-
export AXOLOTL_DATASETS='[{"path":"mhenrichsen/alpaca_2k_test","type":"alpaca"}]'
26-
export AXOLOTL_OUTPUT_DIR="./outputs/my_training"
27-
export AXOLOTL_ADAPTER="lora"
28-
export AXOLOTL_LORA_R="8"
29-
export AXOLOTL_LORA_ALPHA="16"
30-
export AXOLOTL_NUM_EPOCHS="1"
31-
```
9+
**Image**: `runpod/llm-finetuning:latest`
3210

33-
3. **Start training**:
11+
### Required Environment Variables
3412

3513
```bash
36-
# The autorun.sh script will automatically configure and start training
37-
# Or manually run:
38-
axolotl train config.yaml
39-
```
40-
41-
4. **Optional - Start vLLM server** (after training):
42-
43-
```bash
44-
# Create your vLLM config based on the example
45-
cp vllm_config_example.yaml my_vllm_config.yaml
46-
# Edit my_vllm_config.yaml with your trained model path and settings
47-
./start_vllm.sh my_vllm_config.yaml
48-
```
49-
50-
## 🏗️ Local Development
51-
52-
### Build and Test Locally
53-
54-
```bash
55-
# Build the container
56-
docker build -t llm-finetuning-pod .
57-
58-
# Run with test configuration
59-
docker run -it --gpus all \
60-
-e HF_TOKEN="your-token" \
61-
-e WANDB_API_KEY="your-key" \
62-
-e AXOLOTL_BASE_MODEL="TinyLlama/TinyLlama_v1.1" \
63-
-e AXOLOTL_DATASETS='[{"path":"mhenrichsen/alpaca_2k_test","type":"alpaca"}]' \
64-
llm-finetuning-pod
65-
```
66-
67-
### Using the Makefile
68-
69-
```bash
70-
# Set up local development environment
71-
make setup
72-
73-
# Install dependencies
74-
make install
75-
76-
# Test the autorun script
77-
make test
78-
```
79-
80-
## ⚙️ Configuration
81-
82-
Configuration is done entirely through environment variables prefixed with `AXOLOTL_`:
83-
84-
### Required Variables
14+
HF_TOKEN=your-huggingface-token
15+
WANDB_API_KEY=your-wandb-key
8516

86-
- `HF_TOKEN`: HuggingFace access token
87-
- `WANDB_API_KEY`: Weights & Biases API key
88-
89-
### Common Configuration Examples
90-
91-
#### Basic LoRA Training
92-
93-
```bash
94-
export AXOLOTL_BASE_MODEL="NousResearch/Llama-3.2-1B"
95-
export AXOLOTL_DATASETS='[{"path":"teknium/GPT4-LLM-Cleaned","type":"alpaca"}]'
96-
export AXOLOTL_ADAPTER="lora"
97-
export AXOLOTL_LORA_R="16"
98-
export AXOLOTL_LORA_ALPHA="32"
99-
export AXOLOTL_NUM_EPOCHS="1"
100-
export AXOLOTL_MICRO_BATCH_SIZE="2"
101-
export AXOLOTL_GRADIENT_ACCUMULATION_STEPS="2"
17+
# Training config (examples)
18+
AXOLOTL_BASE_MODEL=TinyLlama/TinyLlama_v1.1
19+
AXOLOTL_DATASETS=[{"path":"mhenrichsen/alpaca_2k_test","type":"alpaca"}]
20+
AXOLOTL_ADAPTER=lora
10221
```
10322

104-
#### Memory-Optimized Settings
23+
### ⚠️ Critical: Volume Mounting
10524

10625
```bash
107-
export AXOLOTL_LOAD_IN_8BIT="true"
108-
export AXOLOTL_GRADIENT_CHECKPOINTING="true"
109-
export AXOLOTL_MICRO_BATCH_SIZE="1"
110-
export AXOLOTL_GRADIENT_ACCUMULATION_STEPS="8"
26+
# ❌ NEVER mount to /workspace - overwrites everything!
27+
# ✅ Mount to /workspace/data only
11128
```
11229

113-
#### Full Fine-Tuning
30+
### Training
11431

11532
```bash
116-
export AXOLOTL_BASE_MODEL="microsoft/DialoGPT-small"
117-
# Don't set AXOLOTL_ADAPTER for full fine-tuning
118-
export AXOLOTL_LEARNING_RATE="0.00001"
119-
export AXOLOTL_WARMUP_STEPS="100"
120-
```
121-
122-
## 📁 Repository Structure
123-
124-
```
125-
llm-finetuning-axolotl/
126-
├── Dockerfile # Container definition
127-
├── requirements.txt # Python dependencies
128-
└── scripts/ # Initialization scripts
129-
├── autorun.sh # Main startup script
130-
├── configure.py # Environment-to-YAML converter
131-
├── config_template.yaml # Base configuration template
132-
├── start_vllm.sh # vLLM server startup script
133-
├── vllm_config_example.yaml # vLLM configuration example
134-
└── WELCOME # Welcome message
33+
# Training starts automatically, or manually:
34+
axolotl train config.yaml
13535
```
13636

137-
## 🔄 How It Works
138-
139-
1. **Container starts**`autorun.sh` is executed
140-
2. **Environment check** → Validates required tokens
141-
3. **Configuration generation**`configure.py` converts env vars to `config.yaml`
142-
4. **Training starts**`axolotl train config.yaml`
143-
144-
## 🚀 vLLM Inference Server
145-
146-
After training, you can serve your model using the built-in vLLM server:
147-
148-
### Quick Start vLLM
37+
### Inference (after training)
14938

15039
```bash
151-
# 1. Copy and customize the example config
152-
cp vllm_config_example.yaml my_vllm_config.yaml
153-
# 2. Edit my_vllm_config.yaml with your trained model path and settings
154-
# 3. Start vLLM with your config
155-
./start_vllm.sh my_vllm_config.yaml
40+
# Create vLLM config from example
41+
cp vllm_config_example.yaml my_config.yaml
42+
# Edit with your model path
43+
./start_vllm.sh my_config.yaml
15644
```
15745

158-
### vLLM Features
159-
160-
- **OpenAI-compatible API** at `http://localhost:8000`
161-
- **Automatic LoRA support** for trained adapters
162-
- **Optimized inference** with Flash Attention ≤ 2.8.0
163-
- **GPU memory management** with configurable utilization
164-
- **Not started automatically** - run when needed
165-
166-
### YAML Configuration
167-
168-
The `vllm_config_example.yaml` provides a template with common settings:
169-
170-
```yaml
171-
# Model and performance
172-
model: ./outputs/my-model
173-
max_model_len: 32768
174-
gpu_memory_utilization: 0.95
175-
176-
# Server settings
177-
port: 8000
178-
host: 0.0.0.0
179-
served_model_name: my-model
180-
# LoRA support (if needed)
181-
# lora_modules:
182-
# - name: lora_adapter
183-
# path: ./outputs/lora-out
184-
```
185-
186-
### API Usage
46+
## 🏗️ Local Development
18747

18848
```bash
189-
# Test the server
190-
curl http://localhost:8000/v1/completions \
191-
-H "Content-Type: application/json" \
192-
-d '{
193-
"model": "your-model",
194-
"prompt": "Hello, how are you?",
195-
"max_tokens": 100
196-
}'
49+
# Build and test
50+
docker build -t llm-finetuning-pod .
51+
docker-compose up
19752
```
19853

199-
## 🤝 Development Workflow
200-
201-
1. **Set environment variables** for your experiment
202-
2. **Deploy pod** or run locally
203-
3. **Monitor training** via Weights & Biases
204-
4. **Iterate** by updating environment variables and restarting
205-
20654
## 📚 Documentation
20755

20856
- **[Development Conventions](docs/conventions.md)** - Development guide and best practices
20957
- **[Axolotl Documentation](https://axolotl-ai-cloud.github.io/axolotl/docs/config.html)** - Complete configuration reference
21058

21159
## 🔧 Troubleshooting
21260

213-
### Common Issues
214-
215-
#### Environment Variables Not Loading
61+
### Volume Mount Issues
21662

21763
```bash
218-
# Check if variables are set
219-
env | grep AXOLOTL_
220-
221-
# Restart the container if variables were added after startup
64+
# Symptoms: "No such file or directory" errors, infinite loops
65+
# Cause: Mounting to /workspace overwrites container structure
66+
# Solution: Mount to /workspace/data/ subdirectories only
22267
```
22368

224-
#### Memory Issues
69+
### Environment Variables Not Loading
22570

22671
```bash
227-
export AXOLOTL_LOAD_IN_8BIT="true"
228-
export AXOLOTL_GRADIENT_CHECKPOINTING="true"
229-
export AXOLOTL_MICRO_BATCH_SIZE="1"
72+
# Variables must be set before container starts
73+
env | grep AXOLOTL_
23074
```
23175

232-
#### Authentication Issues
76+
### Authentication Issues
23377

23478
```bash
235-
# Verify tokens
23679
echo $HF_TOKEN
23780
echo $WANDB_API_KEY
238-
239-
# Test HuggingFace login
240-
huggingface-cli whoami
24181
```
24282

24383
## 🏷️ Available Images
24484

245-
| Tag | Description | Use Case |
246-
| ------------------------------- | --------------------- | -------------------- |
247-
| `runpod/llm-finetuning:latest` | Latest stable release | Production pods |
248-
| `runpod/llm-finetuning:dev` | Development build | Testing new features |
249-
| `runpod/llm-finetuning:preview` | Preview release | Early access |
85+
| Tag | Description | Use Case |
86+
| ------------------------------ | --------------------- | -------------------- |
87+
| `runpod/llm-finetuning:latest` | Latest stable release | Production pods |
88+
| `runpod/llm-finetuning:dev` | Development build | Testing new features |
25089

25190
---
25291

scripts/configure.py

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,9 @@
33
import os
44
import json
55
import yaml
6-
from axolotl.utils.config.models.input.v0_4_1 import AxolotlInputConfig
6+
7+
# Note: AxolotlInputConfig is no longer available in current Axolotl versions
8+
# We'll do simple YAML processing instead
79

810
"""
911
Example:
@@ -45,45 +47,43 @@ def get_env_override(key: str, prefix: str = "") -> Optional[Any]:
4547

4648
def load_config_with_overrides(
4749
config_path: str, env_prefix: str = DEFAULT_PREFIX
48-
) -> AxolotlInputConfig:
50+
) -> dict:
4951
"""
5052
Load and parse the YAML config file, applying any environment variable overrides.
51-
Uses the Pydantic AxolotlInputConfig for validation and parsing.
53+
Simple version without Pydantic validation.
5254
5355
Args:
5456
config_path: Path to the YAML config file
5557
env_prefix: Prefix for environment variables to override config values
5658
5759
Returns:
58-
AxolotlInputConfig object with merged configuration
60+
dict with merged configuration
5961
"""
6062
# Load base config from YAML
6163
if not config_path.startswith("/"):
62-
# absolute path
6364
config_path = os.path.join(os.path.dirname(__file__), config_path)
6465

6566
with open(config_path, "r") as f:
6667
print(f"🛠️ Generating from template: {config_path}")
6768
config_dict = yaml.safe_load(f)
6869

69-
# Get all fields from the Pydantic model
70-
model_fields = AxolotlInputConfig.model_fields
71-
72-
# Apply environment overrides
73-
for field_name in model_fields:
74-
if env_value := get_env_override(field_name, env_prefix):
75-
config_dict[field_name] = env_value
70+
# Apply environment overrides for any AXOLOTL_ prefixed variables
71+
for env_key, env_value in os.environ.items():
72+
if env_key.startswith(env_prefix):
73+
# Convert AXOLOTL_BASE_MODEL to base_model
74+
config_key = env_key[len(env_prefix) :].lower()
75+
config_dict[config_key] = parse_env_value(env_value)
76+
print(f" Override: {config_key} = {config_dict[config_key]}")
7677

77-
# Create and validate the config
78-
return AxolotlInputConfig.model_validate(config_dict)
78+
return config_dict
7979

8080

81-
def save_config(config: AxolotlInputConfig, output_path: str) -> None:
81+
def save_config(config: dict, output_path: str) -> None:
8282
"""
8383
Save the configuration to a YAML file.
8484
"""
85-
# Convert to dict and remove null values
86-
config_dict = config.model_dump(mode="json", exclude_none=True)
85+
# Remove null values
86+
config_dict = {k: v for k, v in config.items() if v is not None}
8787

8888
if not output_path.startswith("/"):
8989
# absolute path

0 commit comments

Comments
 (0)