Releases · huggingface/trl

02 Oct 05:20

qgallouedec

v0.23.1

4529a1c

v0.23.1 Latest

Latest

What's Changed

♨️ [GRPO] Fix potential hang in get_high_entropy_mask by @akakakakakaa in #4041
Aux loss is already included in the loss returned by Transformers by @pramodith in #4078
Fix get_peft_model() so that prepare_model_for_kbit_training does not reapply to an instance of PeftModel, thus freezing all the layers by @Hoesu in #4081
🐯 fix: use_liger_kernel with IterableDataset by @jue-jue-zi in #4087
[SFTrainer]: Fix DFT Loss by @pramodith in #4112
⚡ Fix Flash Attention x Padding-Free loss by @qgallouedec in #4170

New Contributors

@Hoesu made their first contribution in #4081

Full Changelog: v0.23.0...v0.23.1

Contributors

akakakakakaa, pramodith, and 3 other contributors

Assets 2

10 Sep 04:39

qgallouedec

v0.23.0

6adfd13

v0.23.0

Major

🥓 Context Parallelism

SFT now supports Context Parallelism (CP) for training large language models on very large sequences. You can now train with an arbitrarily long sequence length.

by @kashif in #3994

🧨 Dynamic Fine-Tuning

Dynamic Fine-Tuning (DFT) is a nnow supported in TRL.

from trl import SFTConfig

training_args = SFTConfig(
    loss_type="dft",
    ...
)

by @qgallouedec in #4042

🪵 Truncated Importance Sampling (TIS) to address rollout-training mismatch

Different implementations are used for rollout generation (vLLM) and model training. The implementation gap implicitly turns the on-policy RL to be off-policy. Truncated Importance Sampling (TIS) a simple yet effective importance sampling technique for handling such discrepancy. This is now implemented in GRPO.

from trl import GRPOConfig

training_args = GRPOConfig(
    ...
    use_vllm=True,
    vllm_importance_sampling_correction=True, # default True
    vllm_importance_sampling_cap=2.0, # hyper-parameter C
)

by @LeonEricsson in #3867

🥣 [SFTTrainer]: Add Aux Loss for MoE models

Mixture of Experts (MoE) models require an auxiliary loss to ensure that the different experts are used evenly. This auxiliary loss is now supported in SFTTrainer.

training_args = SFTConfig(
    model_init_kwargs={"output_router_logits": True},
    ...
)

by @pramodith in #4012

💤 [GRPO/RLOO] Adds an option to sleep vllm when running in colocated mode

When running GRPO (or RLOO) with vLLM in colocated mode, the vLLM server consume VRAM during optimization while not being used. We now have an option to put the vLLM server to sleep during optimization to free up VRAM.

from trl import GRPOConfig

training_args = GRPOConfig(..., vllm_sleep_enabled=True)

by @edbeeching in #3968

⚖️ Add vLLM server mode and VLM support to OnlineDPOTrainer

You can now use vLLM server mode with OnlineDPOTrainer. Additionally, VLM models are now supported.

by @vaelev in #3783

Comprehensive Paper Index Enhancement with 9 New Algorithm Implementations

The paper index has been significantly enhanced with the addition of 9+ new algorithm implementations, providing a more comprehensive resource for users.

by @behroozazarkhalili in #3990

Other Notable Changes

👷 Added Kernels on the Hub x TRL guide by @sergiopaniego in #3969
🌵 Refactor entropy_from_logits for memory efficiency by @qgallouedec in #4013

What's Changed

⬆️ Bump dev version by @qgallouedec in #3978
👮 Fix GRPO CLI by setting parameters for get_soft_overlong_punishment by @qgallouedec in #3972
🪃 args.gradient_checkpointing = False instead of args = dataclasses.replace(args, gradient_checkpointing=False) by @qgallouedec in #3981
[GRPO] Adds an option to sleep vllm when running in colocated mode by @edbeeching in #3968
🎯 Add Trackio integration documentation and update TOC by @qgallouedec in #3971
⚖️ Fix scale_rewards issue in GRPO by @Peter-Chou in #3992
⏰ fix: add return to shift_tokens_right by @ginkyenglee in #3987
Add pre-commit and hf-doc-builder as dev dependencies by @albertvillanova in #3993
[GRPO] Truncated Importance Sampling to address rollout-training mismatch by @LeonEricsson in #3867
Fixed tags shown problem in memory usage docs by @sergiopaniego in #3999
✖️ Support pad-to-multiple-of and padding-free by @qgallouedec in #3996
💾 [bugfix] fix PPO save_checkpoint by @hjh0119 in #3998
[GRPO]: Fix Multi-GPU training for Entropy based masking of tokens. by @pramodith in #3964
📏 torch_dype to dtype everywhere by @sergiopaniego in #4000
Comprehensive Paper Index Enhancement with 9 New Algorithm Implementations by @behroozazarkhalili in #3990
[SFT] fix: collator docstring by @LeonEricsson in #4011
👷 Added Kernels on the Hub x TRL guide by @sergiopaniego in #3969
🌵 Refactor entropy_from_logits for memory efficiency by @qgallouedec in #4013
[SFTTrainer]: Add Aux Loss for MoE models. by @pramodith in #4012
Add missing doc strings in SFTrainer by @pramodith in #4003
⚖️ Add vLLM server mode and VLM support to OnlineDPOTrainer by @vaelev in #3783
Fix typo in GRPO quickstart by @dwisdom0 in #4020
Align docstring parameters with function definitions by @albertvillanova in #4017
Fix formatting errors in docstrings by @albertvillanova in #4025
[doc] Paper index for Truncated Importance Sampling by @LeonEricsson in #4026
[doc] Group paper index by trainer by @LeonEricsson in #4027
Add missing trainer docstrings by @albertvillanova in #4030
Add autodoc for AlignPropTrainer and AlignPropConfig by @albertvillanova in #4033
🥓 [docs] add CP docs by @kashif in #3994
⚖️ Remove average_tokens_across_devices default replacement by @qgallouedec in #4039
CI hotfix: xfail test_training_with_transformers_paged by @albertvillanova in #4046
Update transformers minimum version to 4.56.1 by @albertvillanova in #4047
🧨 DFT by @qgallouedec in #4042
Update VLM arch check to AutoModelForImageTextToText for DPO and Online DPO by @sergiopaniego in #4049
🏂 Fix label shifting logic in SFTTrainer for compatibility with CP by @qgallouedec in #4038
Add autodoc for BestOfNSampler and improve docstrings by @albertvillanova in #4034
✨ Improve SFT doc by @qgallouedec in #4005
💬 Remove setting chat template in sft script by @qgallouedec in #4037
🪪 Update SFTTrainer to handle labels correctly and add configuration example in paper index by @qgallouedec in #4051
🗜 Hotfix: avoid passing quantization_config=None by @qgallouedec in #4019
Release: 0.23 by @qgallouedec in #4053

New Contributors

@Peter-Chou made their first contribution in #3992
@ginkyenglee made their first contribution in #3987
@albertvillanova made their first contribution in #3993
@hjh0119 made their first contribution in #3998
@vaelev made their first contribution in #3783
@dwisdom0 made their first contribution in #4020

Full Changelog: v0.22.0...v0.23.0

Contributors

kashif, dwisdom0, and 11 other contributors

Assets 2

03 Sep 14:44

qgallouedec

v0.22.2

2d597e4

v0.22.2

What's Changed

⚖️ Fix scale_rewards issue in GRPO by @Peter-Chou in #3992
⏰ fix: add return to shift_tokens_right by @ginkyenglee in #3987
✖️ Support pad-to-multiple-of and padding-free by @qgallouedec in #3996

New Contributors

@Peter-Chou made their first contribution in #3992

Full Changelog: v0.22.1...v0.22.2

Contributors

Peter-Chou, ginkyenglee, and qgallouedec

Assets 2

29 Aug 22:11

qgallouedec

v0.22.1

1366bac

v0.22.1

What changed

Refactor version retrieval to use importlib.metadata by @qgallouedec
Release: 0.22.1 by @qgallouedec

Full Changelog: v0.22.0...v0.22.1

Contributors

qgallouedec

Assets 2

29 Aug 22:07

qgallouedec

v0.22.0

3a6b365

v0.22.0

Major

🔮 Native VLM support for `SFTTrainer`

SFTTrainer now natively supports Vision-Language Models (VLMs). This includes support for both languauge modeling, prompt-completion data.
It also supports training on completion-only.

from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    args=SFTConfig(max_length=None),
    train_dataset=load_dataset("trl-lib/llava-instruct-mix", split="train"),
)
trainer.train()

by @qgallouedec in #3862, #3907 and #3908

🔥 `RLOOTrainer` refactor

RLOOTrainer has been refactored to align with the design principles of other other trainers in the library. You can now use this trainer exactly like GRPO.

from datasets import load_dataset
from trl import RLOOConfig, RLOOTrainer

dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")

# Dummy reward function for demonstration purposes
def reward_num_unique_letters(completions, **kwargs):
    """Reward function that rewards completions with more unique letters."""
    completion_contents = [completion[0]["content"] for completion in completions]
    return [float(len(set(content))) for content in completion_contents]

trainer = RLOOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=reward_num_unique_letters,
    train_dataset=dataset,
)
trainer.train()

by @shirinyamani in #3801

🧭 HF jobs x TRL guide

You can now levarage Hugging Face Jobs to easily train and deploy your models with TRL.

hf jobs uv run --flavor a100-large --secrets HF_TOKEN "https://raw.githubusercontent.com/huggingface/trl/main/trl/scripts/sft.py" --model_name_or_path Qwen/Qwen2-0.5B --dataset_name trl-lib/Capybara

A guide is available in the docs.

by @sergiopaniego in #3890

🏌️ DAPO loss type

GRPOTrainer now supports DAPO loss type, which aggregates token-level losses by normalizing with the number of active token in the global accumulated batch. This method was introduced to eliminate length bias. Simply use

from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    loss_type="dapo",
    ...
)

by @qgallouedec in #3938

🪶 [GRPO] PPO Lite: Scale rewards by Std of Batch

The authors of Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning (Lite PPO) find that the combination of:

scaling rewards by the standard deviation computed over the entire batch and
aggregating loss over the total number of tokens

can unlock the learning capability of critic-free policies using vanilla PPO loss. Their results demonstrate that this simple combination consistently improves performance, surpassing strategies like GRPO and DAPO.

TRL supports using these learnings to train a GRPO model by:

from trl import GRPOConfig

training_args = GRPOConfig(
    scale_rewards="batch",
    loss_type="dapo",
    ...
)

by @pramodith in #3935

🎢 [Callbacks] BEMA

Bias-Corrected Exponential Moving Average (BEMA) improves the stability and efficiency of language model fine-tuning by reducing stochasticity and eliminating bias. To use BEMA with SFT as described in the paper, you can now use the [BEMACallback]:

from trl import BEMACallback, SFTTrainer

trainer = SFTTrainer(
    ...
    callbacks=[BEMACallback()],
)

by @kashif in #3855

Minor

🎀 New defaults: gradient_checkpointing=True by @qgallouedec in #3510
🎚️ Add dataset mixer by @lewtun in #3791
💇 Add soft overlong punishment reward function and update documentation by @qgallouedec in #3804
🗿 [CPO] Add AlphaPO method via CPOTrainer by @kashif in #3824
🗳️ Extend BCO Trainer dataset format support by @reihig-ut in #3134
🐯 Support assistant-only training and Liger by @qgallouedec in #3914
🎆 Add entropy logging in SFT by @qgallouedec in #3940
📸 Return position_ids for flash_attention_3 by @jue-jue-zi in #3942

Deprecations

🗑️ Deprecate setup_chat_format by @qgallouedec in #3929
🗑 Deprecate IterativeSFTTrainer by @qgallouedec in #3905

What's Changed

⬆️ Bump dev version by @qgallouedec in #3850
🔗 Fix collection link in doc by @qgallouedec in #3852
Typo fix in new model description by @sergiopaniego in #3854
Small style fix in README by @qgallouedec in #3861
[GRPO] 👁️ Fix vLLM server mode for VLM GRPO training incompatibility for certain AutoProcessors by @ghubnerr in #3832
👁️ From AutoModelForVision2Seq to AutoModelForImageTextToText by @qgallouedec in #3836
👋 Remove --bf16 value in scripts by @sergiopaniego in #3869
🎀 New defaults: gradient_checkpointing=True by @qgallouedec in #3510
🦦 Validate vllm_mode param in GRPO by @sergiopaniego in #3866
🎚️ Add dataset mixer by @lewtun in #3791
✨ Integrate PEFT model preparation across trainers and utilities by @qgallouedec in #3882
⌨️ Add py.typed by @cyyever in #3841
💇 Add soft overlong punishment reward function and update documentation by @qgallouedec in #3804
🕹️ [GRPO] Fix vllm mode validation in distributed setting by @Kirill-Kravtsov in #3886
⏳ Replaced unittest.TestCase with TrlTestCase that handles tmp dir by @qgallouedec in #3863
🔮 Native VLM support for SFTTrainer by @qgallouedec in #3862
Minor optimizations in SFT. by @pramodith in #3884
🧩 Fix reward_processing_classes validation in GRPOTrainer by @chi2liu in #3876
🎢 [Callbacks] BEMA by @kashif in #3855
👁️ VLM blog by @qgallouedec in #3899
🪄 Improve quickstart documentation with updated API examples by @behroozazarkhalili in #3873
👔 HF Doc Builder style by @qgallouedec in #3498
✏️ Fix SFTTrainer token accuracy computation with PromptEncoder by @zk-quantum in #3821
☑️ Check eval batch size in grpo by @jp1924 in #3889
⚔️ Optimize truncate_with_protected_tokens to use vectorized operations by @chi2liu in #3875
Add tests for get_position_ids_from_packed_seq_lengths by @pramodith in #3883
🌳 Enhance segment tree implementation for non-power-of-2 values by @MengAiDev in #3888
⚡ Optimize completion_ids list conversion in GRPO trainer by @chi2liu in #3874
🗿 [CPO] Add AlphaPO method via CPOTrainer by @kashif in #3824
🗳️ Extend BCO Trainer dataset format support by @reihig-ut in #3134
🐯 Support assistant-only training and Liger by @qgallouedec in #3914
🗑 Deprecate IterativeSFTTrainer by @qgallouedec in #3905
♻️ use_cache should be set in the forward pass by @qgallouedec in #3891
🌓 SFTTrainer for VLM: Support for prompt-completion data by @qgallouedec in #3907
➡️ SFTTrainer for VLM: support completion-only loss by @qgallouedec in #3908
📚 Update BEMACallback documentation to ignore docstyle and fix lag parameter description by @qgallouedec in #3917
✏️ Fix typos by @cyyever in #3921
🧹 Clean SFT tests by @qgallouedec in #3922
🤹‍♂️ Multi-image testing dataset by @qgallouedec in #3916
🧾 Use logger.warning instead of warnings.warn by @qgallouedec in #3923
♻️ Reuse multimodal message preparation from SFTTrainer in GRPOTrainer by @MengAiDev in #3919
🗑️ Deprecate setup_chat_format by @qgallouedec in #3929
🗞 bugfix 'TrainerState' object is not subscriptable by @ErezYosef in https://github.com/huggingf...

Contributors

kashif, yao-matrix, and 20 other contributors

Assets 2

05 Aug 17:01

qgallouedec

v0.21.0

46d09bd

v0.21.0

Major and breaking

🌺 OpenAI GPT OSS & Harmony support

Open AI GPT OSS models are here! Check out the OpenAI Cookbook to see an example of how to SFT these models.

by @qgallouedec in #3848

Add vLLM transformers backend to online methods

You can now pass vllm_model_impl to the TRL vLLM server.
Example, for transformers backend:

trl vllm-serve ... --vllm_model_impl transformers

by @merveenoyan in #3773

What's Changed

⬆️ Bump dev version by @qgallouedec in #3793
Fix broken PEFT+TRL docs link in using_llama_models.md by @bwook00 in #3794
🐙 Add MPO VLM example script by @sergiopaniego in #3799
Examples list updated in docs by @sergiopaniego in #3806
Add vLLM transformers backend to online methods by @merveenoyan in #3773
Correction parameter description by @1787648106 in #3803
Add GSPO script examples (VLM/LLM) by @sergiopaniego in #3810
add xpu support for mergekit by @yao-matrix in #3800
GSPO parameters update from v2 by @BounharAbdelaziz in #3798
fix CI docs and grpo slow test by @kashif in #3814
Performance optimization: Replace list comprehensions with tensor operations in BCO and KTO trainers by @chi2liu in #3813
Improve trainer doc by @qgallouedec in #3818
Add 'Post training a VLM for reasoning with GRPO using TRL' recipe to Community tutorials by @sergiopaniego in #3843
[GRPO]: Fix Entropy Mask Threshold Calculation when using Multi-GPU training by @pramodith in #3833
🪦 Remove deprecated by @qgallouedec in #3817
🌺 OpenAI GPT OSS & Harmony support by @qgallouedec in #3848
Release: v0.21 by @qgallouedec in #3849

New Contributors

@bwook00 made their first contribution in #3794
@merveenoyan made their first contribution in #3773
@1787648106 made their first contribution in #3803
@BounharAbdelaziz made their first contribution in #3798
@chi2liu made their first contribution in #3813

Full Changelog: v0.20.0...v0.21.0

Contributors

kashif, yao-matrix, and 8 other contributors

Assets 2

29 Jul 04:59

qgallouedec

v0.20.0

30576d2

v0.20.0

Breaking and major changes

🎞️ GSPO

GSPO is a GRPO variant that computes importance sampling weights at the sequence level instead of per-token.

📜 Paper: https://huggingface.co/papers/2507.18071

To reproduce the paper's setting, use this configuration:

from trl import GRPOConfig

training_args = GRPOConfig(
    importance_sampling_level="sequence",
    loss_type="grpo",
    steps_per_generation=...,
    beta=0.04,  # not explicitly specified in the paper, but they likely used the same value as in the GRPO paper
    epsilon=3e-4,  # https://x.com/ChujieZheng/status/1948933507696525392
)

by @qgallouedec in #3775

👁️ [GRPO] Add VLM training capabilities to the GRPO trainer

The GRPOTrainer can now be used for VLM training. Give a try with this dummy example:

from trl import GRPOTrainer
from datasets import load_dataset

# Dummy vision-language dataset
dataset = load_dataset("trl-internal-testing/zen-image", "conversational_prompt_only", split="train")

# Dummy reward function: count the number of unique characters in the completions
def reward_num_unique_chars(completions, **kwargs):
    return [len(set(c[0]["content"])) for c in completions]

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    reward_funcs=[reward_num_unique_chars],
    train_dataset=dataset,
)

trainer.train()

by @CompN3rd and @kashif in #3072 in #3760

🐙 MPO

The DPO trainer supports combining multiple loss functions with different weights, enabling more sophisticated optimization strategies. This is particularly useful for implementing algorithms like MPO (Mixed Preference Optimization). MPO is a training approach that combines multiple optimization objectives, as described in the paper Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization.

To combine multiple losses, specify the loss types and corresponding weights as lists:

from trl import DPOConfig

# MPO: Combines DPO (sigmoid) for preference and BCO (bco_pair) for quality
training_args = DPOConfig(
    loss_type=["sigmoid", "bco_pair", "sft"],  # Loss types to combine
    loss_weights=[0.8, 0.2, 1.0]  # Corresponding weights, as used in the MPO paper
)

by @qgallouedec in #2544

Add support for CB with native transformers

Continuous Batching allows for faster generation using the transformers backend. You can now use it with the GRPOTrainer by setting use_transformers_paged=True in the config.

use_transformers_paged = True
from trl import GRPOConfig
training_args = GRPOConfig(
    # ... other args
    use_transformers_paged=Ture,
)

by @ArthurZucker in #3471

Add entropy based filtering inside the GRPOTrainer

In Beyond the 80/20 Rule: High-Entropy Minority Tokens
Drive Effective Reinforcement Learning for LLM Reasoning, it is shown that utilizing only 20% of the highest entropy tokens leads to similar performance as using all tokens. You can now enable this feature in the GRPOTrainer by setting entropy_filtering=True in the config.

from trl import GRPOConfig

training_args = GRPOConfig(
    # ... other args
    top_entropy_quantile=0.2,  # Use only the top 20% of tokens based on entropy
)

by @pramodith in #3563

👐 FSDP2+GRPO

GRPO now supports FSDP2 training. Just run your script with an FSDP2 config:

accelerate launch --config_file examples/accelerate_configs/fsdp2.yaml run_grpo.py

by @SalmanMohammadi in #3687

What's Changed

⬆️ Bump dev version by @qgallouedec in #3626
fix grpo generation_kwargs by @ahatamiz in #3634
fixing num_processes by @shirinyamani in #3637
env var for vllm colocate exp added by @shirinyamani in #3638
Update dpo_vlm.py by @Clement25 in #3629
☕️ GRPO script reward_funcs error by @tcapelle in #3639
🤝 validate gradient_accumulation_steps vs steps_per_generation for on-policy GRPO by @HarryHsing in #3493
Add entropy based filtering inside the GRPOTrainer. by @pramodith in #3563
Make sure chat template isn't lost when truncating prompt. by @pramodith in #3651
Add paranthesis to correct the check. by @pramodith in #3658
Add support for CB with native transformers by @ArthurZucker in #3471
feat: Pass trainer state to reward functions by @seungduk-yanolja in #3669
Enable completion-only loss in SFTTrainer when using Liger Kernel by @kswhitecross in #3674
Add mlflow support for generate_during_eval DPOTrainer by @dhruvmullick in #3660
[SFT] drop attention_mask if we have position ids for fa2 by @kashif in #3673
Faster position_ids computation for FFD packing by @mariosasko in #3649
Support datasets 4 by @lhoestq in #3688
Update steps_per_generation default description grpo_config.py by @wa008 in #3685
Fix non-serializable torch.dtype bug in VLLM weight sync by @CarlosArguilar in #3690
fix: support dict access in SFT Trainer by @jannisborn in #3677
[fix] type error of quantile by @gitabtion in #3667
[CI] Fix slow grpo CI by @kashif in #3693
Restore the effect of liger_kernel's monkey_patch on global modules in UT. by @YangKai0616 in #3680
Add type hints to dpo_trainer.py by @bvantuan in #3631
Fix mislabeling: "First-fit decreasing" is actually "Best-fit-decreasing" by @LeonEricsson in #3696
✂️ [BUG when vllm and prompt_truncation are used]: Strip out pad tokens in truncated prompt text by @pramodith in #3698
📣 Use explicit version for checking datasets version by @qgallouedec in #3702
🔭 Fix package discovery configuration in setup.cfg by @qgallouedec in #3703
[SFT] Add seq_lengths to signature columns by @LeonEricsson in #3699
⚗️ Tiny MoE for test by @qgallouedec in #3712
BUG: Disregard pad token entropies for entropy threshold calculation by @pramodith in #3715
Fix ORPOTrainer loss scaling with gradient accumulation by @Aratako in #3716
[Online DPO] Safeguard logit slice against empty prompt by @LeonEricsson in #3719
Remove deprecated processor.tokenizer by @Tavish9 in #3720
👋 Remove --bf16 flag from training scripts by @qgallouedec in #3724
↔️ Fix CB in GRPO by @qgallouedec in #3722
📥 Set environment variables for vLLM distributed training in GRPOTrainer by @qgallouedec in #3723
[GRPO] remove common activation offloading substring in all cases by @winglian in #3738
🔧 Fix GRPO sampling logic by @qgallouedec in #3725
🕸 Use wandb.run.url instead of wandb.run.get_url() (deprecated) by @qgallouedec in #3726
Updated processing_class docs for trainers by @sergiopaniego in #3737
Updated missing processing_class docs for rest of trainers by @sergiopaniego in #3745
Add comment for average_tokens_across_devices by @qgallouedec in #3746
uses steps_per_generation in vllm max_num_seqs by @akakakakakaa in #3747
🏗️ Refactor top-entropy in GRPO by @qgallouedec in #3727
[GRPO] Fix: Processing ref logprobs in batches by @idanshen in #3740
Add Object detection grounding recipe to Community tutorials by @sergiopaniego in #3752
🐙 MPO by @qga...

Contributors

kashif, winglian, and 28 other contributors

Assets 2

08 Jul 01:07

qgallouedec

v0.19.1

accf738

v0.19.1

What's Changed

fix grpo generation_kwargs by @ahatamiz in #3634
Make sure chat template isn't lost when truncating prompt. by @pramodith in #3651
Add paranthesis to correct the check. by @pramodith in #3658
[SFT] drop attention_mask if we have position ids for fa2 by @kashif in #3673
Support datasets 4 by @lhoestq in #3688
📣 Use explicit version for checking datasets version by @qgallouedec in #3702
Fix non-serializable torch.dtype bug in VLLM weight sync by @CarlosArguilar in #3690
✂️ [BUG when vllm and prompt_truncation are used]: Strip out pad tokens in truncated prompt text by @pramodith in #3698

New Contributors

@ahatamiz made their first contribution in #3634
@lhoestq made their first contribution in #3688
@CarlosArguilar made their first contribution in #3690

Full Changelog: v0.19.0...v0.19.1

Contributors

kashif, pramodith, and 4 other contributors

Assets 2

21 Jun 14:04

qgallouedec

v0.19.0

5b3ea9d

v0.19.0

Breaking and major changes

🧰 [SFT] Tool support

SFTTrainer now supports training with tools! You just have to add a column tools to your dataset, which contains a list of tool definitions as json schemas. The tools will be automatically registered and can be used in the training process.

from datasets import Dataset
from transformers.utils import get_json_schema
from trl import SFTTrainer

# Fictitious functions to simulate tool calls
def start_timer(duration: int) -> int:
    """
    Starts a timer for the specified duration in seconds.

    Args:
        duration: Duration in seconds to set the timer for.

    Returns:
        The duration set for the timer.
    """
    return duration

def create_reminder(time: str, note: str) -> str:
    """
    Creates a reminder for the specified time and note.

    Args:
        time: The time for the reminder.
        note: The note for the reminder.

    Returns:
        A confirmation message indicating that the reminder has been set.
    """
    return "I'll remind you to call mom at 7 PM."

# Define the JSON schemas for the tools
start_timer = get_json_schema(start_timer)
create_reminder = get_json_schema(create_reminder)

dataset = Dataset.from_dict({
    "messages": [
        [
            {"role": "user", "content": "Set a timer for 10 minutes."},
            {"role": "assistant", "tool_calls": [{"type": "function", "function": {"name": "start_timer", "arguments": {"duration": 600}}}]},
            {"role": "tool", "name": "start_timer", "content": "600"},
            {"role": "assistant", "content": "Timer set for 10 minutes."},
        ],
        ...,
    ],
    "tools": [
        [start_timer, create_reminder],
        ...,
    ]
})

# Initialize the trainer
trainer = SFTTrainer(model="Qwen3-0.6B", train_dataset=dataset)

# Train the model
trainer.train()

by @qgallouedec in #3597

📉 FFD packing

We introduce a new packing method: FFD (First Fit Decreasing) packing. This method is designed to optimize the packing of sequences in a way that more efficiently reduces the size of the training dataset by grouping examples more effectively. Previously, we used a wrapped packing method, which often truncated sequences even when they were not longer than the maximum sequence length. The new FFD packing method avoids unnecessary truncation by grouping sequences more intelligently. This new packing strategy is now the default when packing is enabled.

training_args = SFTConfig(..., packing=True)

by @qgallouedec in #3521 and accelerated by @mariosasko in #3537

[Liger] liger DPO support

The DPOTrainer now supports the Liger-powered DPO loss, enabling faster training with lower memory usage.

training_args = DPOConfig(..., use_liger_loss=True)

by @kashif in #2568

💬 Fix `setup_chat_format` and add `clone_chat_template`

We introduce clone_chat_template, a more convenient and flexible function for setting up chat templates from any tokenizer that already includes one. It handles EOS tokens and copies all added tokens from the source tokenizer, preserving their "special" status.
You can either use this function directly:

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import clone_chat_template

model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

model, tokenizer = clone_chat_template(model, tokenizer, "Qwen/Qwen3-4B")

or use the chat_template_path parameter in SFTConfig to specify a chat template, which will be automatically cloned when the SFTTrainer is initialized.

from trl import SFTConfig

training_args = SFTConfig(chat_template_path="Qwen/Qwen3-4B")

by @qgallouedec in #3404 and #3599

📚 SFTTrainer support chat template kwargs

SFTTrainer now supports passing additional keyword arguments to the chat template. This allows for more flexibility in customizing the chat format during training. To enable it, just add a chat_template_kwargs column to your your dataset.

example = {'messages': [{'content': 'What is better than ugly?', 'role': 'user'},
                        {'content': 'Beautiful.', 'role': 'assistant'}]
           'chat_template_kwargs': {'my_template_arg': 'my_value'}}

by @qgallouedec in #3609

🤵‍♂️ SFT on assistant messages only

The SFTTrainer now supports training on assistant messages only

example = {'messages': [
    {'role': 'user', 'content': 'What is better than ugly?'},          # masked in the loss
    {'role': 'assistant', 'content': 'Beautiful.'},                    # used in the loss
    {'role': 'user', 'content': 'And what is better than implicit?'},  # masked in the loss
    {'role': 'assistant', 'content': 'Explicit.'},                     # used in the loss
]}

by @qgallouedec in #3586

🧬 Add `generation_kwargs` as a property of `GRPOConfig` to support additional generation arguments

The GRPOConfig now includes a generation_kwargs property, allowing users to specify additional generation arguments for the GRPOTrainer. This allows for further customization of the generation behavior, such as setting suppress_tokens, num_beams, etc.
Depending on the generation backend used (transformers or vLLM), this property will be passed either to transformers.GenerationConfig (if using transformers) or vllm.SamplingParams (if using vLLM).

from trl import GRPOConfig

training_args = GRPOConfig(..., generation_kwargs={"length_penalty": -0.1})

by @pramodith in #3617

New defaults

🎀 New default: beta=0.0 for GRPO by @qgallouedec in #3516
🎀 New defaults: preparing the new structure by @qgallouedec in #3530
🎀 New defaults: logging_steps=10 by @qgallouedec in #3514
🎀 [SFT][Bugfix] sets average_tokens_across_devices to true in SFTConfig by @edbeeching in #3538
🎀 New defaults: bf16=True by @qgallouedec in #3515

Minor changes

Add support for IterableDataset in DPO Trainer by @h-tonywu in #3559
🔖 Fix: ensure user-provided labels are retained in self._signature_columns by @sxndqc in #3589
⭐ Add vllm_gpu_memory_utilization recommendation script by @toslali-ibm in #3554

What's Changed

⬆️ Bump dev version by @qgallouedec in #3505
📎 Fix clip ratio logging by @qgallouedec in #3506
📚 Fix doc building by removing vLLM from dev dependencies in setup.cfg by @qgallouedec in #3511
🧭 Patch release guide by @qgallouedec in #3512
🎀 New default: beta=0.0 for GRPO by @qgallouedec in #3516
Add "🐯 Liger GRPO meets TRL" by @qgallouedec in #3525
📉 FFD packing by @qgallouedec in #3521
🎀 New defaults: preparing the new structure by @qgallouedec in #3530
🪦 RIP trl chat by @shirinyamani in #3531
🎀 New defaults: logging_steps=10 by @qgallouedec in #3514
📰 Add blog "No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL" by @qgallouedec in #3527
🎯 Don't use getattr to get gradient_checkpointing by @qgallouedec in #3535
🧭 Remove useless transformers version checks by @qgallouedec in #3534
🐳 Add DeepseekV3 model configurations and update tests for new models by @qgallouedec in #3536
💭 [Data] Fix DeepSeek-R1 case by @kashif in #3522
🎀 [SFT][Bugfix] sets average_tokens_across_devices to true in SFTConfig by @edbeeching in #3538
⚡ Faster FFD packing by @mariosasko in #3537
📦 Packing with flash attn kwargs to avoid cross-contamination by @thepowerfuldeez in #3526
💽 [TRLParser] Fail when unknown args are provided in the config file. by @edbeeching in #3543
🛋️ Fix CI and bump accelerate by @qgallouedec in #3551
🧮 Rearrange DPOTrainer by @DaizeDong in #3501
🆙 Bump transformers to 4.51 and use _VALID_DICT_FIELDS by @qgallouedec in #3553
Update tests_latest.yml by @qgallouedec in #3558
ℹ️ Unify autocast behavior to torch.autocast and make it cover XPU by @yao-matrix in #3541
Fix dev version by @Tavish9 in #3570
[Lig...

Contributors

kashif, ajtejankar, and 21 other contributors

Assets 2

15 Jun 22:15

qgallouedec

v0.18.2

a21a925

v0.18.2

What's Changed

🏗️ Add test for training with multiple dataloader workers and update worker initialization for compatibility with transformers 4.52.0 by @qgallouedec in #3568

Full Changelog: v0.18.1...v0.18.2

Contributors

qgallouedec

Assets 2

Releases: huggingface/trl

v0.23.1

What's Changed

New Contributors

Contributors

Uh oh!

v0.23.0

Major

🥓 Context Parallelism

🧨 Dynamic Fine-Tuning

🪵 Truncated Importance Sampling (TIS) to address rollout-training mismatch

🥣 [SFTTrainer]: Add Aux Loss for MoE models

💤 [GRPO/RLOO] Adds an option to sleep vllm when running in colocated mode

⚖️ Add vLLM server mode and VLM support to OnlineDPOTrainer

Comprehensive Paper Index Enhancement with 9 New Algorithm Implementations

Other Notable Changes

What's Changed

New Contributors

Contributors

Uh oh!

v0.22.2

What's Changed

New Contributors

Contributors

Uh oh!

v0.22.1

What changed

Contributors

Uh oh!

v0.22.0

Major

🔮 Native VLM support for SFTTrainer

🔥 RLOOTrainer refactor

🧭 HF jobs x TRL guide

🏌️ DAPO loss type

🪶 [GRPO] PPO Lite: Scale rewards by Std of Batch

🎢 [Callbacks] BEMA

Minor

Deprecations

What's Changed

Contributors

Uh oh!

v0.21.0

Major and breaking

🌺 OpenAI GPT OSS & Harmony support

Add vLLM transformers backend to online methods

What's Changed

New Contributors

Contributors

Uh oh!

v0.20.0

Breaking and major changes

🎞️ GSPO

👁️ [GRPO] Add VLM training capabilities to the GRPO trainer

🐙 MPO

Add support for CB with native transformers

Add entropy based filtering inside the GRPOTrainer

👐 FSDP2+GRPO

What's Changed

Contributors

Uh oh!

v0.19.1

What's Changed

New Contributors

Contributors

Uh oh!

v0.19.0

Breaking and major changes

🧰 [SFT] Tool support

📉 FFD packing

[Liger] liger DPO support

💬 Fix setup_chat_format and add clone_chat_template

📚 SFTTrainer support chat template kwargs

🤵‍♂️ SFT on assistant messages only

🧬 Add generation_kwargs as a property of GRPOConfig to support additional generation arguments

New defaults

Minor changes

What's Changed

Contributors

Uh oh!

🔮 Native VLM support for `SFTTrainer`

🔥 `RLOOTrainer` refactor

💬 Fix `setup_chat_format` and add `clone_chat_template`

🧬 Add `generation_kwargs` as a property of `GRPOConfig` to support additional generation arguments