[Feature request] Finetuning script for Qwen2.5-Coder FIM

It would be cool if there was an official finetuning script.

I have tried Qwen2.5-Coder of various sizes, but only the 32B model was barely usable quality-wise. The latency with an RTX 3090 was amazing with all models. 🚀

I then finetuned `unsloth/Qwen2.5-Coder-7B` on my own code and the resulting model was good enough for the code I usually write. If I did not have a free Copilot student subscription, I'd use this model from now on. The biggest advantage is that the context became much less important since most of it resides in the model now.

However, my finetuning script can probably use tons of improvement, so I wanted to suggest an official finetuning script from someone who has more experience with finetuning or llama.vscode.

Some uncertainties I've had:

- Which data format for training should be used? Currently, I am finetuning with FIM template (`f"<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>{middle}<|endoftext|>" + eos_token`), but maybe I should finetune on raw code instead and then learn the FIM template format afterwards?
- I have skipped the global context/extra chunks entirely for now. I skimmed over the [technical description](https://github.com/ggml-org/llama.cpp/pull/9787), but was unsure how to best sample that data. Not sure how important it is, since the model has already seen the context during finetuning, but maybe it helps?
- How should the prefix/suffix/middle parts be sampled? For now, I chose them like this:
  1. Choose a random file
  2. Choose a random character index within that file
  3. Choose the rest of the line (or the next line if next character is EOL) as the `middle` to be predicted, or up to 256 characters if the line is longer.
    - Completing only the current/next line is sufficient for me. I'd rather get lines one by one and press TAB if I am happy with the predicted line. But maybe other people have different tastes. This could probably be controlled in the extension instead of being backed into the model. I wanted to have at least one line all the time, even if the cursor is at the end of a line.
  5. Choose a `prefix` of random length starting directly before the `middle` with up to 2048 characters (is that a good length?)
  6. Skip a random number of characters (up to 512, number is made up) from the end of the `middle` until the `suffix` starts. The idea is that the model should not be forced to complete a function in just one line. It is fine to do it in multiple lines.
  7. Start the `suffix` of random length after the random offset behind the `middle`. I choose up to 1024 characters because the suffix is probably less important than the prefix, but again, the number is entirely made up.

The entire training sample then looks like this:

```python
[prefix][middle][offset][suffix]

len(prefix) < 2048
len(middle) <= 257
offset < 512
len(suffix) < 1024
```

I have not done any overly scientific ablation studies to validate any choices I've made, except for a small test with Qwen2.5-Coder-0.5B, which was not great when finetuned. It was able to recite samples from the training data, but could not generalize them extremely well.

* How long to train and on which context size? I've trained the 7B model for about 10 hours on 60k samples on a V100 over night, which seemed to work okay, but maybe more or less training is better.
* Should the training data be filtered by some advanced criteria? I keep most of my code in a large repository, about 2000 files with a total size of around 11MB. I excluded unsloth files, some autogenerated files and very tiny files, but did no filtering otherwise.
* Which rank to choose for the LoRA? I chose 64 because it seemed like a nice number.
* Should `lora_alpha` be adjusted?
* How to make `packing` of `SFTTrainer` work? It only seemed to make training much slower.
* Most of those questions could be answered with a validation dataset, but I have not checked whether there is one and I do not have the compute to check all possible variations anyway.

I'll attach my training code here as a starting point, but it requires some modifications to be usable. Probably still better than nothing. It is mostly copied from https://colab.research.google.com/drive/1Kose-ucXO1IBaZq5BvbwWieuubP7hxvQ

#### dev_dataset.py

```python
from pathlib import Path
import random

# adjust this to the directory with the Python files you want to train on
train_dir = "../.."

texts = []
for path in Path(train_dir).rglob("*.py"):
    if "unsloth" in str(path): continue
    if path.name.startswith("__"): continue

    text = path.read_text()

    # Small files are probably not interesting
    if len(text) < 100: continue

    texts.append(text)

def apply_fim_template(prefix, suffix, middle=None, eos_token=None):
    text = f"<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>"

    if middle is not None:
        text += f"{middle}<|endoftext|>" + eos_token

    return text

def yield_samples(n, eos_token, debug=False):
    rng = random.Random(0)

    for _ in range(n):
        text = rng.choice(texts)

        start_middle = rng.randrange(len(text) - 1)

        # Complete current line
        end_middle = start_middle + 1
        for _ in range(256):
            if end_middle >= len(text) or text[end_middle] == "\n": break
            end_middle += 1

        # make prefix start up to 2048 characters before middle
        start_prefix = max(0, start_middle - rng.randrange(2048))

        # make suffix start way after middle
        suffix_offset = rng.randrange(512)

        start_suffix = min(end_middle + suffix_offset, len(text))
        end_suffix = min(start_suffix + rng.randrange(1024), len(text))

        prefix = text[start_prefix:start_middle]
        middle = text[start_middle:end_middle]
        suffix = text[start_suffix:end_suffix]

        text = apply_fim_template(prefix, suffix, middle, eos_token)

        if debug:
            print("#" * 80)
            print(red(prefix) + green(middle) + yellow(suffix))
            print()

        yield {"text": text}

def red(text):
    return f"\x1b[31m{text}\x1b[0m"

def green(text):
    return f"\x1b[32m{text}\x1b[0m"

def yellow(text):
    return f"\x1b[33m{text}\x1b[0m"

if __name__ == "__main__":
    for _ in yield_samples(5, eos_token="<TODO eos token>", debug=True):
        pass
    print("files:", len(texts))
    print(sum(len(t) for t in texts))
```

#### train.py

```python
from unsloth import FastLanguageModel, is_bfloat16_supported
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import Dataset
import dev_dataset
from dev_dataset import red, green, yellow

def gen():
    if 1:
        yield from dev_dataset.yield_samples(n=60_000, eos_token=tokenizer.eos_token)
    else:
        # for quick testing, should run this first to ensure that saving works
        # (llama.cpp failed to compile automatically on some of my computers)
        yield from dev_dataset.yield_samples(n=10, eos_token=tokenizer.eos_token)

dataset = Dataset.from_generator(gen)

max_seq_length = 2048

model, tokenizer = FastLanguageModel.from_pretrained(
    # More models: https://huggingface.co/unsloth
    model_name = "unsloth/Qwen2.5-Coder-7B",
    max_seq_length = max_seq_length,
    #dtype = torch.bfloat16, # bfloat16 not supported by V100
    dtype = torch.float16,
    load_in_4bit = True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,

        # choose either
        num_train_epochs = 1, # Set this for 1 full training run.
        #max_steps = 60,

        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
    ),
)

trainer_stats = trainer.train()

FastLanguageModel.for_inference(model) # Enable inference

# Test on some random strings which occur in my code to see whether the model can correctly predict it
for prefix, suffix in [
    ['os.path.expanduser("../../../data/rh', "def"],
    ['print(f"{gb', "def"],
    ['extern "C"\n__glob', "}"],
    ['from cup', "def"],
]:
    input_text = dev_dataset.apply_fim_template(prefix, suffix)

    inputs = tokenizer([input_text], return_tensors = "pt").to("cuda")

    outputs = model.generate(**inputs, max_new_tokens=64)

    output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    new_text = output_text[len(input_text):]

    print(red(prefix) + green(new_text) + yellow(suffix))
    print()

model.save_pretrained_gguf("model_quantized", tokenizer, quantization_method = "q4_k_s")
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature request] Finetuning script for Qwen2.5-Coder FIM #40

dev_dataset.py

train.py

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature request] Finetuning script for Qwen2.5-Coder FIM #40

Description

dev_dataset.py

train.py

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions