Skip to content

[ICML 2025] TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization

Notifications You must be signed in to change notification settings

dvlab-research/TGDPO

Repository files navigation

TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization

This repository contains the code for our paper TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization.

Overview

This work introduces a framework for incorporating token-level reward guidance into preference optimization. Experiment results demonstrate that TGDPO achieves substantial performance improvements over DPO and SimPO, with win rate gains of up to 7.5 points on MT-Bench, 6.2 points on AlpacaEval 2, and 4.3 points on Arena-Hard.

Installation

Environment preparation:

conda env create -f environment.yml
conda activate tgdpo
pip install -e ".[torch,metrics]"

Dataset Preparation

We provide the training data in the following links:

After downloading the training data, please adjust their corresponding path to dataset in data/dataset_info.json.

Token-level Reward Model Preparation

You can use models trained with DPO, SimPO, or other RLHF algorithms on the datasets above as the token-level reward models. You can also leverage any off-the-shelf open-source token-level reward models as guidance.

Training Scripts

The example training script is in examples/llama3_8b_instruct_tgdpo.yaml. The training config is set for 8x80GB GPUs. You will need to adjust model_name_or_path and ref_model to specify the base model (e.g., meta-llama/Meta-Llama-3-8B-Instruct), and set the path of the token-level reward model in tgdpo_reward_model.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch  --config_file ./examples/accelerate/fsdp_config.yaml  ./src/train.py  ./examples/llama3_8b_instruct_tgdpo.yaml

Acknowledgements

We would like to thank the authors of LLaMA-Factory for their excellent code base.

Citation

If you find this work useful, please consider citing:

@inproceedings{
zhu2025tgdpo,
title={{TGDPO}: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization},
author={Mingkang Zhu and Xi Chen and Zhongdao Wang and Bei Yu and Hengshuang Zhao and Jiaya Jia},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=TKHWvyzR1t}
}

About

[ICML 2025] TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages