Releases: huggingface/trl
v0.23.1
What's Changed
- ♨️ [GRPO] Fix potential hang in
get_high_entropy_mask
by @akakakakakaa in #4041 - Aux loss is already included in the loss returned by Transformers by @pramodith in #4078
- Fix get_peft_model() so that prepare_model_for_kbit_training does not reapply to an instance of PeftModel, thus freezing all the layers by @Hoesu in #4081
- 🐯 fix: use_liger_kernel with IterableDataset by @jue-jue-zi in #4087
- [SFTrainer]: Fix DFT Loss by @pramodith in #4112
- ⚡ Fix Flash Attention x Padding-Free loss by @qgallouedec in #4170
New Contributors
Full Changelog: v0.23.0...v0.23.1
v0.23.0
Major
🥓 Context Parallelism
SFT now supports Context Parallelism (CP) for training large language models on very large sequences. You can now train with an arbitrarily long sequence length.

🧨 Dynamic Fine-Tuning
Dynamic Fine-Tuning (DFT) is a nnow supported in TRL.
from trl import SFTConfig
training_args = SFTConfig(
loss_type="dft",
...
)

by @qgallouedec in #4042
🪵 Truncated Importance Sampling (TIS) to address rollout-training mismatch
Different implementations are used for rollout generation (vLLM) and model training. The implementation gap implicitly turns the on-policy RL to be off-policy. Truncated Importance Sampling (TIS) a simple yet effective importance sampling technique for handling such discrepancy. This is now implemented in GRPO.
from trl import GRPOConfig
training_args = GRPOConfig(
...
use_vllm=True,
vllm_importance_sampling_correction=True, # default True
vllm_importance_sampling_cap=2.0, # hyper-parameter C
)
by @LeonEricsson in #3867
🥣 [SFTTrainer]: Add Aux Loss for MoE models
Mixture of Experts (MoE) models require an auxiliary loss to ensure that the different experts are used evenly. This auxiliary loss is now supported in SFTTrainer.
training_args = SFTConfig(
model_init_kwargs={"output_router_logits": True},
...
)
by @pramodith in #4012
💤 [GRPO/RLOO] Adds an option to sleep vllm when running in colocated mode
When running GRPO (or RLOO) with vLLM in colocated mode, the vLLM server consume VRAM during optimization while not being used. We now have an option to put the vLLM server to sleep during optimization to free up VRAM.
from trl import GRPOConfig
training_args = GRPOConfig(..., vllm_sleep_enabled=True)
by @edbeeching in #3968
⚖️ Add vLLM server mode and VLM support to OnlineDPOTrainer
You can now use vLLM server mode with OnlineDPOTrainer. Additionally, VLM models are now supported.
Comprehensive Paper Index Enhancement with 9 New Algorithm Implementations
The paper index has been significantly enhanced with the addition of 9+ new algorithm implementations, providing a more comprehensive resource for users.
by @behroozazarkhalili in #3990
Other Notable Changes
- 👷 Added Kernels on the Hub x TRL guide by @sergiopaniego in #3969
- 🌵 Refactor entropy_from_logits for memory efficiency by @qgallouedec in #4013
What's Changed
- ⬆️ Bump dev version by @qgallouedec in #3978
- 👮 Fix GRPO CLI by setting parameters for
get_soft_overlong_punishment
by @qgallouedec in #3972 - 🪃
args.gradient_checkpointing = False
instead ofargs = dataclasses.replace(args, gradient_checkpointing=False)
by @qgallouedec in #3981 - [GRPO] Adds an option to sleep vllm when running in colocated mode by @edbeeching in #3968
- 🎯 Add Trackio integration documentation and update TOC by @qgallouedec in #3971
- ⚖️ Fix scale_rewards issue in GRPO by @Peter-Chou in #3992
- ⏰ fix: add return to shift_tokens_right by @ginkyenglee in #3987
- Add pre-commit and hf-doc-builder as dev dependencies by @albertvillanova in #3993
- [GRPO] Truncated Importance Sampling to address rollout-training mismatch by @LeonEricsson in #3867
- Fixed tags shown problem in memory usage docs by @sergiopaniego in #3999
- ✖️ Support pad-to-multiple-of and padding-free by @qgallouedec in #3996
- 💾 [bugfix] fix PPO save_checkpoint by @hjh0119 in #3998
- [GRPO]: Fix Multi-GPU training for Entropy based masking of tokens. by @pramodith in #3964
- 📏
torch_dype
todtype
everywhere by @sergiopaniego in #4000 - Comprehensive Paper Index Enhancement with 9 New Algorithm Implementations by @behroozazarkhalili in #3990
- [SFT] fix: collator docstring by @LeonEricsson in #4011
- 👷 Added Kernels on the Hub x TRL guide by @sergiopaniego in #3969
- 🌵 Refactor entropy_from_logits for memory efficiency by @qgallouedec in #4013
- [SFTTrainer]: Add Aux Loss for MoE models. by @pramodith in #4012
- Add missing doc strings in SFTrainer by @pramodith in #4003
- ⚖️ Add vLLM server mode and VLM support to OnlineDPOTrainer by @vaelev in #3783
- Fix typo in GRPO quickstart by @dwisdom0 in #4020
- Align docstring parameters with function definitions by @albertvillanova in #4017
- Fix formatting errors in docstrings by @albertvillanova in #4025
- [doc] Paper index for Truncated Importance Sampling by @LeonEricsson in #4026
- [doc] Group paper index by trainer by @LeonEricsson in #4027
- Add missing trainer docstrings by @albertvillanova in #4030
- Add autodoc for AlignPropTrainer and AlignPropConfig by @albertvillanova in #4033
- 🥓 [docs] add CP docs by @kashif in #3994
- ⚖️ Remove
average_tokens_across_devices
default replacement by @qgallouedec in #4039 - CI hotfix: xfail test_training_with_transformers_paged by @albertvillanova in #4046
- Update transformers minimum version to 4.56.1 by @albertvillanova in #4047
- 🧨 DFT by @qgallouedec in #4042
- Update VLM arch check to
AutoModelForImageTextToText
for DPO and Online DPO by @sergiopaniego in #4049 - 🏂 Fix label shifting logic in
SFTTrainer
for compatibility with CP by @qgallouedec in #4038 - Add autodoc for BestOfNSampler and improve docstrings by @albertvillanova in #4034
- ✨ Improve SFT doc by @qgallouedec in #4005
- 💬 Remove setting chat template in sft script by @qgallouedec in #4037
- 🪪 Update SFTTrainer to handle labels correctly and add configuration example in paper index by @qgallouedec in #4051
- 🗜 Hotfix: avoid passing
quantization_config=None
by @qgallouedec in #4019 - Release: 0.23 by @qgallouedec in #4053
New Contributors
- @Peter-Chou made their first contribution in #3992
- @ginkyenglee made their first contribution in #3987
- @albertvillanova made their first contribution in #3993
- @hjh0119 made their first contribution in #3998
- @vaelev made their first contribution in #3783
- @dwisdom0 made their first contribution in #4020
Full Changelog: v0.22.0...v0.23.0
v0.22.2
What's Changed
- ⚖️ Fix scale_rewards issue in GRPO by @Peter-Chou in #3992
- ⏰ fix: add return to shift_tokens_right by @ginkyenglee in #3987
- ✖️ Support pad-to-multiple-of and padding-free by @qgallouedec in #3996
New Contributors
- @Peter-Chou made their first contribution in #3992
Full Changelog: v0.22.1...v0.22.2
v0.22.1
What changed
- Refactor version retrieval to use
importlib.metadata
by @qgallouedec - Release: 0.22.1 by @qgallouedec
Full Changelog: v0.22.0...v0.22.1
v0.22.0
Major
🔮 Native VLM support for SFTTrainer
SFTTrainer
now natively supports Vision-Language Models (VLMs). This includes support for both languauge modeling, prompt-completion data.
It also supports training on completion-only.

from trl import SFTConfig, SFTTrainer
from datasets import load_dataset
trainer = SFTTrainer(
model="Qwen/Qwen2.5-VL-3B-Instruct",
args=SFTConfig(max_length=None),
train_dataset=load_dataset("trl-lib/llava-instruct-mix", split="train"),
)
trainer.train()
by @qgallouedec in #3862, #3907 and #3908
🔥 RLOOTrainer
refactor
RLOOTrainer
has been refactored to align with the design principles of other other trainers in the library. You can now use this trainer exactly like GRPO.
from datasets import load_dataset
from trl import RLOOConfig, RLOOTrainer
dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")
# Dummy reward function for demonstration purposes
def reward_num_unique_letters(completions, **kwargs):
"""Reward function that rewards completions with more unique letters."""
completion_contents = [completion[0]["content"] for completion in completions]
return [float(len(set(content))) for content in completion_contents]
trainer = RLOOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_funcs=reward_num_unique_letters,
train_dataset=dataset,
)
trainer.train()
by @shirinyamani in #3801
🧭 HF jobs x TRL guide
You can now levarage Hugging Face Jobs to easily train and deploy your models with TRL.
hf jobs uv run --flavor a100-large --secrets HF_TOKEN "https://raw.githubusercontent.com/huggingface/trl/main/trl/scripts/sft.py" --model_name_or_path Qwen/Qwen2-0.5B --dataset_name trl-lib/Capybara
A guide is available in the docs.
by @sergiopaniego in #3890
🏌️ DAPO loss type
GRPOTrainer
now supports DAPO loss type, which aggregates token-level losses by normalizing with the number of active token in the global accumulated batch. This method was introduced to eliminate length bias. Simply use
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
loss_type="dapo",
...
)
by @qgallouedec in #3938
🪶 [GRPO] PPO Lite: Scale rewards by Std of Batch
The authors of Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning (Lite PPO) find that the combination of:
- scaling rewards by the standard deviation computed over the entire batch and
- aggregating loss over the total number of tokens
can unlock the learning capability of critic-free policies using vanilla PPO loss. Their results demonstrate that this simple combination consistently improves performance, surpassing strategies like GRPO and DAPO.
TRL supports using these learnings to train a GRPO model by:
from trl import GRPOConfig
training_args = GRPOConfig(
scale_rewards="batch",
loss_type="dapo",
...
)
by @pramodith in #3935
🎢 [Callbacks] BEMA
Bias-Corrected Exponential Moving Average (BEMA) improves the stability and efficiency of language model fine-tuning by reducing stochasticity and eliminating bias. To use BEMA with SFT as described in the paper, you can now use the [BEMACallback
]:
from trl import BEMACallback, SFTTrainer
trainer = SFTTrainer(
...
callbacks=[BEMACallback()],
)
Minor
- 🎀 New defaults:
gradient_checkpointing=True
by @qgallouedec in #3510 - 🎚️ Add dataset mixer by @lewtun in #3791
- 💇 Add soft overlong punishment reward function and update documentation by @qgallouedec in #3804
- 🗿 [CPO] Add AlphaPO method via CPOTrainer by @kashif in #3824
- 🗳️ Extend BCO Trainer dataset format support by @reihig-ut in #3134
- 🐯 Support assistant-only training and Liger by @qgallouedec in #3914
- 🎆 Add entropy logging in SFT by @qgallouedec in #3940
- 📸 Return
position_ids
forflash_attention_3
by @jue-jue-zi in #3942
Deprecations
- 🗑️ Deprecate
setup_chat_format
by @qgallouedec in #3929 - 🗑 Deprecate
IterativeSFTTrainer
by @qgallouedec in #3905
What's Changed
- ⬆️ Bump dev version by @qgallouedec in #3850
- 🔗 Fix collection link in doc by @qgallouedec in #3852
- Typo fix in new model description by @sergiopaniego in #3854
- Small style fix in README by @qgallouedec in #3861
- [GRPO] 👁️ Fix vLLM server mode for VLM GRPO training incompatibility for certain AutoProcessors by @ghubnerr in #3832
- 👁️ From
AutoModelForVision2Seq
toAutoModelForImageTextToText
by @qgallouedec in #3836 - 👋 Remove
--bf16
value in scripts by @sergiopaniego in #3869 - 🎀 New defaults:
gradient_checkpointing=True
by @qgallouedec in #3510 - 🦦 Validate
vllm_mode
param in GRPO by @sergiopaniego in #3866 - 🎚️ Add dataset mixer by @lewtun in #3791
- ✨ Integrate PEFT model preparation across trainers and utilities by @qgallouedec in #3882
- ⌨️ Add py.typed by @cyyever in #3841
- 💇 Add soft overlong punishment reward function and update documentation by @qgallouedec in #3804
- 🕹️ [GRPO] Fix vllm mode validation in distributed setting by @Kirill-Kravtsov in #3886
- ⏳ Replaced
unittest.TestCase
withTrlTestCase
that handles tmp dir by @qgallouedec in #3863 - 🔮 Native VLM support for
SFTTrainer
by @qgallouedec in #3862 - Minor optimizations in SFT. by @pramodith in #3884
- 🧩 Fix reward_processing_classes validation in GRPOTrainer by @chi2liu in #3876
- 🎢 [Callbacks] BEMA by @kashif in #3855
- 👁️ VLM blog by @qgallouedec in #3899
- 🪄 Improve quickstart documentation with updated API examples by @behroozazarkhalili in #3873
- 👔 HF Doc Builder style by @qgallouedec in #3498
- ✏️ Fix SFTTrainer token accuracy computation with PromptEncoder by @zk-quantum in #3821
- ☑️ Check eval batch size in grpo by @jp1924 in #3889
- ⚔️ Optimize truncate_with_protected_tokens to use vectorized operations by @chi2liu in #3875
- Add tests for get_position_ids_from_packed_seq_lengths by @pramodith in #3883
- 🌳 Enhance segment tree implementation for non-power-of-2 values by @MengAiDev in #3888
- ⚡ Optimize completion_ids list conversion in GRPO trainer by @chi2liu in #3874
- 🗿 [CPO] Add AlphaPO method via CPOTrainer by @kashif in #3824
- 🗳️ Extend BCO Trainer dataset format support by @reihig-ut in #3134
- 🐯 Support assistant-only training and Liger by @qgallouedec in #3914
- 🗑 Deprecate
IterativeSFTTrainer
by @qgallouedec in #3905 - ♻️
use_cache
should be set in the forward pass by @qgallouedec in #3891 - 🌓 SFTTrainer for VLM: Support for prompt-completion data by @qgallouedec in #3907
- ➡️ SFTTrainer for VLM: support completion-only loss by @qgallouedec in #3908
- 📚 Update BEMACallback documentation to ignore docstyle and fix lag parameter description by @qgallouedec in #3917
- ✏️ Fix typos by @cyyever in #3921
- 🧹 Clean SFT tests by @qgallouedec in #3922
- 🤹♂️ Multi-image testing dataset by @qgallouedec in #3916
- 🧾 Use
logger.warning
instead ofwarnings.warn
by @qgallouedec in #3923 - ♻️ Reuse multimodal message preparation from
SFTTrainer
inGRPOTrainer
by @MengAiDev in #3919 - 🗑️ Deprecate
setup_chat_format
by @qgallouedec in #3929 - 🗞 bugfix 'TrainerState' object is not subscriptable by @ErezYosef in https://github.com/huggingf...
v0.21.0
Major and breaking
🌺 OpenAI GPT OSS & Harmony support

Open AI GPT OSS models are here! Check out the OpenAI Cookbook to see an example of how to SFT these models.
by @qgallouedec in #3848
Add vLLM transformers backend to online methods
You can now pass vllm_model_impl
to the TRL vLLM server.
Example, for transformers
backend:
trl vllm-serve ... --vllm_model_impl transformers
by @merveenoyan in #3773
What's Changed
- ⬆️ Bump dev version by @qgallouedec in #3793
- Fix broken PEFT+TRL docs link in
using_llama_models.md
by @bwook00 in #3794 - 🐙 Add MPO VLM example script by @sergiopaniego in #3799
- Examples list updated in docs by @sergiopaniego in #3806
- Add vLLM transformers backend to online methods by @merveenoyan in #3773
- Correction parameter description by @1787648106 in #3803
- Add GSPO script examples (VLM/LLM) by @sergiopaniego in #3810
- add xpu support for mergekit by @yao-matrix in #3800
- GSPO parameters update from v2 by @BounharAbdelaziz in #3798
- fix CI docs and grpo slow test by @kashif in #3814
- Performance optimization: Replace list comprehensions with tensor operations in BCO and KTO trainers by @chi2liu in #3813
- Improve trainer doc by @qgallouedec in #3818
- Add 'Post training a VLM for reasoning with GRPO using TRL' recipe to Community tutorials by @sergiopaniego in #3843
- [GRPO]: Fix Entropy Mask Threshold Calculation when using Multi-GPU training by @pramodith in #3833
- 🪦 Remove deprecated by @qgallouedec in #3817
- 🌺 OpenAI GPT OSS & Harmony support by @qgallouedec in #3848
- Release: v0.21 by @qgallouedec in #3849
New Contributors
- @bwook00 made their first contribution in #3794
- @merveenoyan made their first contribution in #3773
- @1787648106 made their first contribution in #3803
- @BounharAbdelaziz made their first contribution in #3798
- @chi2liu made their first contribution in #3813
Full Changelog: v0.20.0...v0.21.0
v0.20.0
Breaking and major changes
🎞️ GSPO
GSPO is a GRPO variant that computes importance sampling weights at the sequence level instead of per-token.

📜 Paper: https://huggingface.co/papers/2507.18071
To reproduce the paper's setting, use this configuration:
from trl import GRPOConfig
training_args = GRPOConfig(
importance_sampling_level="sequence",
loss_type="grpo",
steps_per_generation=...,
beta=0.04, # not explicitly specified in the paper, but they likely used the same value as in the GRPO paper
epsilon=3e-4, # https://x.com/ChujieZheng/status/1948933507696525392
)
by @qgallouedec in #3775
👁️ [GRPO] Add VLM training capabilities to the GRPO trainer

The GRPOTrainer can now be used for VLM training. Give a try with this dummy example:
from trl import GRPOTrainer
from datasets import load_dataset
# Dummy vision-language dataset
dataset = load_dataset("trl-internal-testing/zen-image", "conversational_prompt_only", split="train")
# Dummy reward function: count the number of unique characters in the completions
def reward_num_unique_chars(completions, **kwargs):
return [len(set(c[0]["content"])) for c in completions]
trainer = GRPOTrainer(
model="Qwen/Qwen2.5-VL-3B-Instruct",
reward_funcs=[reward_num_unique_chars],
train_dataset=dataset,
)
trainer.train()
by @CompN3rd and @kashif in #3072 in #3760
🐙 MPO

The DPO trainer supports combining multiple loss functions with different weights, enabling more sophisticated optimization strategies. This is particularly useful for implementing algorithms like MPO (Mixed Preference Optimization). MPO is a training approach that combines multiple optimization objectives, as described in the paper Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization.
To combine multiple losses, specify the loss types and corresponding weights as lists:
from trl import DPOConfig
# MPO: Combines DPO (sigmoid) for preference and BCO (bco_pair) for quality
training_args = DPOConfig(
loss_type=["sigmoid", "bco_pair", "sft"], # Loss types to combine
loss_weights=[0.8, 0.2, 1.0] # Corresponding weights, as used in the MPO paper
)
by @qgallouedec in #2544
Add support for CB with native transformers
Continuous Batching allows for faster generation using the transformers
backend. You can now use it with the GRPOTrainer
by setting use_transformers_paged=True
in the config.
use_transformers_paged = True
from trl import GRPOConfig
training_args = GRPOConfig(
# ... other args
use_transformers_paged=Ture,
)
by @ArthurZucker in #3471
Add entropy based filtering inside the GRPOTrainer

In Beyond the 80/20 Rule: High-Entropy Minority Tokens
Drive Effective Reinforcement Learning for LLM Reasoning, it is shown that utilizing only 20% of the highest entropy tokens leads to similar performance as using all tokens. You can now enable this feature in the GRPOTrainer
by setting entropy_filtering=True
in the config.
from trl import GRPOConfig
training_args = GRPOConfig(
# ... other args
top_entropy_quantile=0.2, # Use only the top 20% of tokens based on entropy
)
by @pramodith in #3563
👐 FSDP2+GRPO
GRPO now supports FSDP2 training. Just run your script with an FSDP2 config:
accelerate launch --config_file examples/accelerate_configs/fsdp2.yaml run_grpo.py
by @SalmanMohammadi in #3687
What's Changed
- ⬆️ Bump dev version by @qgallouedec in #3626
- fix grpo generation_kwargs by @ahatamiz in #3634
- fixing num_processes by @shirinyamani in #3637
- env var for vllm colocate exp added by @shirinyamani in #3638
- Update dpo_vlm.py by @Clement25 in #3629
- ☕️ GRPO script reward_funcs error by @tcapelle in #3639
- 🤝 validate gradient_accumulation_steps vs steps_per_generation for on-policy GRPO by @HarryHsing in #3493
- Add entropy based filtering inside the GRPOTrainer. by @pramodith in #3563
- Make sure chat template isn't lost when truncating prompt. by @pramodith in #3651
- Add paranthesis to correct the check. by @pramodith in #3658
- Add support for CB with native transformers by @ArthurZucker in #3471
- feat: Pass trainer state to reward functions by @seungduk-yanolja in #3669
- Enable completion-only loss in SFTTrainer when using Liger Kernel by @kswhitecross in #3674
- Add mlflow support for generate_during_eval DPOTrainer by @dhruvmullick in #3660
- [SFT] drop attention_mask if we have position ids for fa2 by @kashif in #3673
- Faster
position_ids
computation for FFD packing by @mariosasko in #3649 - Support datasets 4 by @lhoestq in #3688
- Update steps_per_generation default description grpo_config.py by @wa008 in #3685
- Fix non-serializable torch.dtype bug in VLLM weight sync by @CarlosArguilar in #3690
- fix: support dict access in SFT Trainer by @jannisborn in #3677
- [fix] type error of quantile by @gitabtion in #3667
- [CI] Fix slow grpo CI by @kashif in #3693
- Restore the effect of liger_kernel's monkey_patch on global modules in UT. by @YangKai0616 in #3680
- Add type hints to
dpo_trainer.py
by @bvantuan in #3631 - Fix mislabeling: "First-fit decreasing" is actually "Best-fit-decreasing" by @LeonEricsson in #3696
- ✂️ [BUG when vllm and prompt_truncation are used]: Strip out pad tokens in truncated prompt text by @pramodith in #3698
- 📣 Use explicit version for checking datasets version by @qgallouedec in #3702
- 🔭 Fix package discovery configuration in setup.cfg by @qgallouedec in #3703
- [SFT] Add
seq_lengths
to signature columns by @LeonEricsson in #3699 - ⚗️ Tiny MoE for test by @qgallouedec in #3712
- BUG: Disregard pad token entropies for entropy threshold calculation by @pramodith in #3715
- Fix ORPOTrainer loss scaling with gradient accumulation by @Aratako in #3716
- [Online DPO] Safeguard logit slice against empty prompt by @LeonEricsson in #3719
- Remove deprecated
processor.tokenizer
by @Tavish9 in #3720 - 👋 Remove
--bf16
flag from training scripts by @qgallouedec in #3724 ↔️ Fix CB in GRPO by @qgallouedec in #3722- 📥 Set environment variables for vLLM distributed training in GRPOTrainer by @qgallouedec in #3723
- [GRPO] remove common activation offloading substring in all cases by @winglian in #3738
- 🔧 Fix GRPO sampling logic by @qgallouedec in #3725
- 🕸 Use
wandb.run.url
instead ofwandb.run.get_url()
(deprecated) by @qgallouedec in #3726 - Updated
processing_class
docs for trainers by @sergiopaniego in #3737 - Updated missing
processing_class
docs for rest of trainers by @sergiopaniego in #3745 - Add comment for
average_tokens_across_devices
by @qgallouedec in #3746 - uses
steps_per_generation
in vllm max_num_seqs by @akakakakakaa in #3747 - 🏗️ Refactor top-entropy in GRPO by @qgallouedec in #3727
- [GRPO] Fix: Processing ref logprobs in batches by @idanshen in #3740
- Add Object detection grounding recipe to Community tutorials by @sergiopaniego in #3752
- 🐙 MPO by @qga...
v0.19.1
What's Changed
- fix grpo generation_kwargs by @ahatamiz in #3634
- Make sure chat template isn't lost when truncating prompt. by @pramodith in #3651
- Add paranthesis to correct the check. by @pramodith in #3658
- [SFT] drop attention_mask if we have position ids for fa2 by @kashif in #3673
- Support datasets 4 by @lhoestq in #3688
- 📣 Use explicit version for checking datasets version by @qgallouedec in #3702
- Fix non-serializable torch.dtype bug in VLLM weight sync by @CarlosArguilar in #3690
- ✂️ [BUG when vllm and prompt_truncation are used]: Strip out pad tokens in truncated prompt text by @pramodith in #3698
New Contributors
- @ahatamiz made their first contribution in #3634
- @lhoestq made their first contribution in #3688
- @CarlosArguilar made their first contribution in #3690
Full Changelog: v0.19.0...v0.19.1
v0.19.0
Breaking and major changes
🧰 [SFT] Tool support
SFTTrainer
now supports training with tools! You just have to add a column tools
to your dataset, which contains a list of tool definitions as json schemas. The tools will be automatically registered and can be used in the training process.
from datasets import Dataset
from transformers.utils import get_json_schema
from trl import SFTTrainer
# Fictitious functions to simulate tool calls
def start_timer(duration: int) -> int:
"""
Starts a timer for the specified duration in seconds.
Args:
duration: Duration in seconds to set the timer for.
Returns:
The duration set for the timer.
"""
return duration
def create_reminder(time: str, note: str) -> str:
"""
Creates a reminder for the specified time and note.
Args:
time: The time for the reminder.
note: The note for the reminder.
Returns:
A confirmation message indicating that the reminder has been set.
"""
return "I'll remind you to call mom at 7 PM."
# Define the JSON schemas for the tools
start_timer = get_json_schema(start_timer)
create_reminder = get_json_schema(create_reminder)
dataset = Dataset.from_dict({
"messages": [
[
{"role": "user", "content": "Set a timer for 10 minutes."},
{"role": "assistant", "tool_calls": [{"type": "function", "function": {"name": "start_timer", "arguments": {"duration": 600}}}]},
{"role": "tool", "name": "start_timer", "content": "600"},
{"role": "assistant", "content": "Timer set for 10 minutes."},
],
...,
],
"tools": [
[start_timer, create_reminder],
...,
]
})
# Initialize the trainer
trainer = SFTTrainer(model="Qwen3-0.6B", train_dataset=dataset)
# Train the model
trainer.train()
by @qgallouedec in #3597
📉 FFD packing
We introduce a new packing method: FFD (First Fit Decreasing) packing. This method is designed to optimize the packing of sequences in a way that more efficiently reduces the size of the training dataset by grouping examples more effectively. Previously, we used a wrapped packing method, which often truncated sequences even when they were not longer than the maximum sequence length. The new FFD packing method avoids unnecessary truncation by grouping sequences more intelligently. This new packing strategy is now the default when packing is enabled.
training_args = SFTConfig(..., packing=True)
by @qgallouedec in #3521 and accelerated by @mariosasko in #3537
[Liger] liger DPO support
The DPOTrainer
now supports the Liger-powered DPO loss, enabling faster training with lower memory usage.
training_args = DPOConfig(..., use_liger_loss=True)
💬 Fix setup_chat_format
and add clone_chat_template
We introduce clone_chat_template
, a more convenient and flexible function for setting up chat templates from any tokenizer that already includes one. It handles EOS tokens and copies all added tokens from the source tokenizer, preserving their "special" status.
You can either use this function directly:
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import clone_chat_template
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
model, tokenizer = clone_chat_template(model, tokenizer, "Qwen/Qwen3-4B")
or use the chat_template_path
parameter in SFTConfig
to specify a chat template, which will be automatically cloned when the SFTTrainer is initialized.
from trl import SFTConfig
training_args = SFTConfig(chat_template_path="Qwen/Qwen3-4B")
by @qgallouedec in #3404 and #3599
📚 SFTTrainer support chat template kwargs
SFTTrainer
now supports passing additional keyword arguments to the chat template. This allows for more flexibility in customizing the chat format during training. To enable it, just add a chat_template_kwargs
column to your your dataset.
example = {'messages': [{'content': 'What is better than ugly?', 'role': 'user'},
{'content': 'Beautiful.', 'role': 'assistant'}]
'chat_template_kwargs': {'my_template_arg': 'my_value'}}
by @qgallouedec in #3609
🤵♂️ SFT on assistant messages only
The SFTTrainer
now supports training on assistant messages only
example = {'messages': [
{'role': 'user', 'content': 'What is better than ugly?'}, # masked in the loss
{'role': 'assistant', 'content': 'Beautiful.'}, # used in the loss
{'role': 'user', 'content': 'And what is better than implicit?'}, # masked in the loss
{'role': 'assistant', 'content': 'Explicit.'}, # used in the loss
]}
by @qgallouedec in #3586
🧬 Add generation_kwargs
as a property of GRPOConfig
to support additional generation arguments
The GRPOConfig
now includes a generation_kwargs
property, allowing users to specify additional generation arguments for the GRPOTrainer
. This allows for further customization of the generation behavior, such as setting suppress_tokens
, num_beams
, etc.
Depending on the generation backend used (transformers or vLLM), this property will be passed either to transformers.GenerationConfig
(if using transformers) or vllm.SamplingParams
(if using vLLM).
from trl import GRPOConfig
training_args = GRPOConfig(..., generation_kwargs={"length_penalty": -0.1})
by @pramodith in #3617
New defaults
- 🎀 New default:
beta=0.0
for GRPO by @qgallouedec in #3516 - 🎀 New defaults: preparing the new structure by @qgallouedec in #3530
- 🎀 New defaults:
logging_steps=10
by @qgallouedec in #3514 - 🎀 [SFT][Bugfix] sets average_tokens_across_devices to true in SFTConfig by @edbeeching in #3538
- 🎀 New defaults:
bf16=True
by @qgallouedec in #3515
Minor changes
- Add support for
IterableDataset
in DPO Trainer by @h-tonywu in #3559 - 🔖 Fix: ensure user-provided
labels
are retained in self._signature_columns by @sxndqc in #3589 - ⭐ Add
vllm_gpu_memory_utilization
recommendation script by @toslali-ibm in #3554
What's Changed
- ⬆️ Bump dev version by @qgallouedec in #3505
- 📎 Fix clip ratio logging by @qgallouedec in #3506
- 📚 Fix doc building by removing vLLM from dev dependencies in
setup.cfg
by @qgallouedec in #3511 - 🧭 Patch release guide by @qgallouedec in #3512
- 🎀 New default:
beta=0.0
for GRPO by @qgallouedec in #3516 - Add "🐯 Liger GRPO meets TRL" by @qgallouedec in #3525
- 📉 FFD packing by @qgallouedec in #3521
- 🎀 New defaults: preparing the new structure by @qgallouedec in #3530
- 🪦 RIP trl chat by @shirinyamani in #3531
- 🎀 New defaults:
logging_steps=10
by @qgallouedec in #3514 - 📰 Add blog "No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL" by @qgallouedec in #3527
- 🎯 Don't use
getattr
to getgradient_checkpointing
by @qgallouedec in #3535 - 🧭 Remove useless transformers version checks by @qgallouedec in #3534
- 🐳 Add DeepseekV3 model configurations and update tests for new models by @qgallouedec in #3536
- 💭 [Data] Fix DeepSeek-R1 case by @kashif in #3522
- 🎀 [SFT][Bugfix] sets average_tokens_across_devices to true in SFTConfig by @edbeeching in #3538
- ⚡ Faster FFD packing by @mariosasko in #3537
- 📦 Packing with flash attn kwargs to avoid cross-contamination by @thepowerfuldeez in #3526
- 💽 [TRLParser] Fail when unknown args are provided in the config file. by @edbeeching in #3543
- 🛋️ Fix CI and bump accelerate by @qgallouedec in #3551
- 🧮 Rearrange DPOTrainer by @DaizeDong in #3501
- 🆙 Bump transformers to 4.51 and use
_VALID_DICT_FIELDS
by @qgallouedec in #3553 - Update tests_latest.yml by @qgallouedec in #3558
- ℹ️ Unify autocast behavior to
torch.autocast
and make it cover XPU by @yao-matrix in #3541 - Fix dev version by @Tavish9 in #3570
- [Lig...
v0.18.2
What's Changed
- 🏗️ Add test for training with multiple dataloader workers and update worker initialization for compatibility with transformers 4.52.0 by @qgallouedec in #3568
Full Changelog: v0.18.1...v0.18.2