-
Notifications
You must be signed in to change notification settings - Fork 71
Description
It would be cool if there was an official finetuning script.
I have tried Qwen2.5-Coder of various sizes, but only the 32B model was barely usable quality-wise. The latency with an RTX 3090 was amazing with all models. 🚀
I then finetuned unsloth/Qwen2.5-Coder-7B
on my own code and the resulting model was good enough for the code I usually write. If I did not have a free Copilot student subscription, I'd use this model from now on. The biggest advantage is that the context became much less important since most of it resides in the model now.
However, my finetuning script can probably use tons of improvement, so I wanted to suggest an official finetuning script from someone who has more experience with finetuning or llama.vscode.
Some uncertainties I've had:
- Which data format for training should be used? Currently, I am finetuning with FIM template (
f"<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>{middle}<|endoftext|>" + eos_token
), but maybe I should finetune on raw code instead and then learn the FIM template format afterwards? - I have skipped the global context/extra chunks entirely for now. I skimmed over the technical description, but was unsure how to best sample that data. Not sure how important it is, since the model has already seen the context during finetuning, but maybe it helps?
- How should the prefix/suffix/middle parts be sampled? For now, I chose them like this:
- Choose a random file
- Choose a random character index within that file
- Choose the rest of the line (or the next line if next character is EOL) as the
middle
to be predicted, or up to 256 characters if the line is longer.
- Completing only the current/next line is sufficient for me. I'd rather get lines one by one and press TAB if I am happy with the predicted line. But maybe other people have different tastes. This could probably be controlled in the extension instead of being backed into the model. I wanted to have at least one line all the time, even if the cursor is at the end of a line.
- Choose a
prefix
of random length starting directly before themiddle
with up to 2048 characters (is that a good length?) - Skip a random number of characters (up to 512, number is made up) from the end of the
middle
until thesuffix
starts. The idea is that the model should not be forced to complete a function in just one line. It is fine to do it in multiple lines. - Start the
suffix
of random length after the random offset behind themiddle
. I choose up to 1024 characters because the suffix is probably less important than the prefix, but again, the number is entirely made up.
The entire training sample then looks like this:
[prefix][middle][offset][suffix]
len(prefix) < 2048
len(middle) <= 257
offset < 512
len(suffix) < 1024
I have not done any overly scientific ablation studies to validate any choices I've made, except for a small test with Qwen2.5-Coder-0.5B, which was not great when finetuned. It was able to recite samples from the training data, but could not generalize them extremely well.
- How long to train and on which context size? I've trained the 7B model for about 10 hours on 60k samples on a V100 over night, which seemed to work okay, but maybe more or less training is better.
- Should the training data be filtered by some advanced criteria? I keep most of my code in a large repository, about 2000 files with a total size of around 11MB. I excluded unsloth files, some autogenerated files and very tiny files, but did no filtering otherwise.
- Which rank to choose for the LoRA? I chose 64 because it seemed like a nice number.
- Should
lora_alpha
be adjusted? - How to make
packing
ofSFTTrainer
work? It only seemed to make training much slower. - Most of those questions could be answered with a validation dataset, but I have not checked whether there is one and I do not have the compute to check all possible variations anyway.
I'll attach my training code here as a starting point, but it requires some modifications to be usable. Probably still better than nothing. It is mostly copied from https://colab.research.google.com/drive/1Kose-ucXO1IBaZq5BvbwWieuubP7hxvQ
dev_dataset.py
from pathlib import Path
import random
# adjust this to the directory with the Python files you want to train on
train_dir = "../.."
texts = []
for path in Path(train_dir).rglob("*.py"):
if "unsloth" in str(path): continue
if path.name.startswith("__"): continue
text = path.read_text()
# Small files are probably not interesting
if len(text) < 100: continue
texts.append(text)
def apply_fim_template(prefix, suffix, middle=None, eos_token=None):
text = f"<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>"
if middle is not None:
text += f"{middle}<|endoftext|>" + eos_token
return text
def yield_samples(n, eos_token, debug=False):
rng = random.Random(0)
for _ in range(n):
text = rng.choice(texts)
start_middle = rng.randrange(len(text) - 1)
# Complete current line
end_middle = start_middle + 1
for _ in range(256):
if end_middle >= len(text) or text[end_middle] == "\n": break
end_middle += 1
# make prefix start up to 2048 characters before middle
start_prefix = max(0, start_middle - rng.randrange(2048))
# make suffix start way after middle
suffix_offset = rng.randrange(512)
start_suffix = min(end_middle + suffix_offset, len(text))
end_suffix = min(start_suffix + rng.randrange(1024), len(text))
prefix = text[start_prefix:start_middle]
middle = text[start_middle:end_middle]
suffix = text[start_suffix:end_suffix]
text = apply_fim_template(prefix, suffix, middle, eos_token)
if debug:
print("#" * 80)
print(red(prefix) + green(middle) + yellow(suffix))
print()
yield {"text": text}
def red(text):
return f"\x1b[31m{text}\x1b[0m"
def green(text):
return f"\x1b[32m{text}\x1b[0m"
def yellow(text):
return f"\x1b[33m{text}\x1b[0m"
if __name__ == "__main__":
for _ in yield_samples(5, eos_token="<TODO eos token>", debug=True):
pass
print("files:", len(texts))
print(sum(len(t) for t in texts))
train.py
from unsloth import FastLanguageModel, is_bfloat16_supported
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import Dataset
import dev_dataset
from dev_dataset import red, green, yellow
def gen():
if 1:
yield from dev_dataset.yield_samples(n=60_000, eos_token=tokenizer.eos_token)
else:
# for quick testing, should run this first to ensure that saving works
# (llama.cpp failed to compile automatically on some of my computers)
yield from dev_dataset.yield_samples(n=10, eos_token=tokenizer.eos_token)
dataset = Dataset.from_generator(gen)
max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
# More models: https://huggingface.co/unsloth
model_name = "unsloth/Qwen2.5-Coder-7B",
max_seq_length = max_seq_length,
#dtype = torch.bfloat16, # bfloat16 not supported by V100
dtype = torch.float16,
load_in_4bit = True,
)
model = FastLanguageModel.get_peft_model(
model,
r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0,
bias = "none",
use_gradient_checkpointing = "unsloth",
random_state = 3407,
use_rslora = False,
loftq_config = None,
)
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = False,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
# choose either
num_train_epochs = 1, # Set this for 1 full training run.
#max_steps = 60,
learning_rate = 2e-4,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
report_to = "none",
),
)
trainer_stats = trainer.train()
FastLanguageModel.for_inference(model) # Enable inference
# Test on some random strings which occur in my code to see whether the model can correctly predict it
for prefix, suffix in [
['os.path.expanduser("../../../data/rh', "def"],
['print(f"{gb', "def"],
['extern "C"\n__glob', "}"],
['from cup', "def"],
]:
input_text = dev_dataset.apply_fim_template(prefix, suffix)
inputs = tokenizer([input_text], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=64)
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
new_text = output_text[len(input_text):]
print(red(prefix) + green(new_text) + yellow(suffix))
print()
model.save_pretrained_gguf("model_quantized", tokenizer, quantization_method = "q4_k_s")