Skip to content

[Feature request] Finetuning script for Qwen2.5-Coder FIM #40

@99991

Description

@99991

It would be cool if there was an official finetuning script.

I have tried Qwen2.5-Coder of various sizes, but only the 32B model was barely usable quality-wise. The latency with an RTX 3090 was amazing with all models. 🚀

I then finetuned unsloth/Qwen2.5-Coder-7B on my own code and the resulting model was good enough for the code I usually write. If I did not have a free Copilot student subscription, I'd use this model from now on. The biggest advantage is that the context became much less important since most of it resides in the model now.

However, my finetuning script can probably use tons of improvement, so I wanted to suggest an official finetuning script from someone who has more experience with finetuning or llama.vscode.

Some uncertainties I've had:

  • Which data format for training should be used? Currently, I am finetuning with FIM template (f"<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>{middle}<|endoftext|>" + eos_token), but maybe I should finetune on raw code instead and then learn the FIM template format afterwards?
  • I have skipped the global context/extra chunks entirely for now. I skimmed over the technical description, but was unsure how to best sample that data. Not sure how important it is, since the model has already seen the context during finetuning, but maybe it helps?
  • How should the prefix/suffix/middle parts be sampled? For now, I chose them like this:
    1. Choose a random file
    2. Choose a random character index within that file
    3. Choose the rest of the line (or the next line if next character is EOL) as the middle to be predicted, or up to 256 characters if the line is longer.
    • Completing only the current/next line is sufficient for me. I'd rather get lines one by one and press TAB if I am happy with the predicted line. But maybe other people have different tastes. This could probably be controlled in the extension instead of being backed into the model. I wanted to have at least one line all the time, even if the cursor is at the end of a line.
    1. Choose a prefix of random length starting directly before the middle with up to 2048 characters (is that a good length?)
    2. Skip a random number of characters (up to 512, number is made up) from the end of the middle until the suffix starts. The idea is that the model should not be forced to complete a function in just one line. It is fine to do it in multiple lines.
    3. Start the suffix of random length after the random offset behind the middle. I choose up to 1024 characters because the suffix is probably less important than the prefix, but again, the number is entirely made up.

The entire training sample then looks like this:

[prefix][middle][offset][suffix]

len(prefix) < 2048
len(middle) <= 257
offset < 512
len(suffix) < 1024

I have not done any overly scientific ablation studies to validate any choices I've made, except for a small test with Qwen2.5-Coder-0.5B, which was not great when finetuned. It was able to recite samples from the training data, but could not generalize them extremely well.

  • How long to train and on which context size? I've trained the 7B model for about 10 hours on 60k samples on a V100 over night, which seemed to work okay, but maybe more or less training is better.
  • Should the training data be filtered by some advanced criteria? I keep most of my code in a large repository, about 2000 files with a total size of around 11MB. I excluded unsloth files, some autogenerated files and very tiny files, but did no filtering otherwise.
  • Which rank to choose for the LoRA? I chose 64 because it seemed like a nice number.
  • Should lora_alpha be adjusted?
  • How to make packing of SFTTrainer work? It only seemed to make training much slower.
  • Most of those questions could be answered with a validation dataset, but I have not checked whether there is one and I do not have the compute to check all possible variations anyway.

I'll attach my training code here as a starting point, but it requires some modifications to be usable. Probably still better than nothing. It is mostly copied from https://colab.research.google.com/drive/1Kose-ucXO1IBaZq5BvbwWieuubP7hxvQ

dev_dataset.py

from pathlib import Path
import random

# adjust this to the directory with the Python files you want to train on
train_dir = "../.."

texts = []
for path in Path(train_dir).rglob("*.py"):
    if "unsloth" in str(path): continue
    if path.name.startswith("__"): continue

    text = path.read_text()

    # Small files are probably not interesting
    if len(text) < 100: continue

    texts.append(text)

def apply_fim_template(prefix, suffix, middle=None, eos_token=None):
    text = f"<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>"

    if middle is not None:
        text += f"{middle}<|endoftext|>" + eos_token

    return text

def yield_samples(n, eos_token, debug=False):
    rng = random.Random(0)

    for _ in range(n):
        text = rng.choice(texts)

        start_middle = rng.randrange(len(text) - 1)

        # Complete current line
        end_middle = start_middle + 1
        for _ in range(256):
            if end_middle >= len(text) or text[end_middle] == "\n": break
            end_middle += 1

        # make prefix start up to 2048 characters before middle
        start_prefix = max(0, start_middle - rng.randrange(2048))

        # make suffix start way after middle
        suffix_offset = rng.randrange(512)

        start_suffix = min(end_middle + suffix_offset, len(text))
        end_suffix = min(start_suffix + rng.randrange(1024), len(text))

        prefix = text[start_prefix:start_middle]
        middle = text[start_middle:end_middle]
        suffix = text[start_suffix:end_suffix]

        text = apply_fim_template(prefix, suffix, middle, eos_token)

        if debug:
            print("#" * 80)
            print(red(prefix) + green(middle) + yellow(suffix))
            print()

        yield {"text": text}

def red(text):
    return f"\x1b[31m{text}\x1b[0m"

def green(text):
    return f"\x1b[32m{text}\x1b[0m"

def yellow(text):
    return f"\x1b[33m{text}\x1b[0m"

if __name__ == "__main__":
    for _ in yield_samples(5, eos_token="<TODO eos token>", debug=True):
        pass
    print("files:", len(texts))
    print(sum(len(t) for t in texts))

train.py

from unsloth import FastLanguageModel, is_bfloat16_supported
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import Dataset
import dev_dataset
from dev_dataset import red, green, yellow

def gen():
    if 1:
        yield from dev_dataset.yield_samples(n=60_000, eos_token=tokenizer.eos_token)
    else:
        # for quick testing, should run this first to ensure that saving works
        # (llama.cpp failed to compile automatically on some of my computers)
        yield from dev_dataset.yield_samples(n=10, eos_token=tokenizer.eos_token)

dataset = Dataset.from_generator(gen)

max_seq_length = 2048

model, tokenizer = FastLanguageModel.from_pretrained(
    # More models: https://huggingface.co/unsloth
    model_name = "unsloth/Qwen2.5-Coder-7B",
    max_seq_length = max_seq_length,
    #dtype = torch.bfloat16, # bfloat16 not supported by V100
    dtype = torch.float16,
    load_in_4bit = True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,

        # choose either
        num_train_epochs = 1, # Set this for 1 full training run.
        #max_steps = 60,

        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
    ),
)

trainer_stats = trainer.train()

FastLanguageModel.for_inference(model) # Enable inference

# Test on some random strings which occur in my code to see whether the model can correctly predict it
for prefix, suffix in [
    ['os.path.expanduser("../../../data/rh', "def"],
    ['print(f"{gb', "def"],
    ['extern "C"\n__glob', "}"],
    ['from cup', "def"],
]:
    input_text = dev_dataset.apply_fim_template(prefix, suffix)

    inputs = tokenizer([input_text], return_tensors = "pt").to("cuda")

    outputs = model.generate(**inputs, max_new_tokens=64)

    output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    new_text = output_text[len(input_text):]

    print(red(prefix) + green(new_text) + yellow(suffix))
    print()

model.save_pretrained_gguf("model_quantized", tokenizer, quantization_method = "q4_k_s")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions