TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization

This repository contains the code for our paper TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization.

Overview

This work introduces a framework for incorporating token-level reward guidance into preference optimization. Experiment results demonstrate that TGDPO achieves substantial performance improvements over DPO and SimPO, with win rate gains of up to 7.5 points on MT-Bench, 6.2 points on AlpacaEval 2, and 4.3 points on Arena-Hard.

Installation

Environment preparation:

conda env create -f environment.yml
conda activate tgdpo
pip install -e ".[torch,metrics]"

Dataset Preparation

We provide the training data in the following links:

After downloading the training data, please adjust their corresponding path to dataset in data/dataset_info.json.

Token-level Reward Model Preparation

You can use models trained with DPO, SimPO, or other RLHF algorithms on the datasets above as the token-level reward models. You can also leverage any off-the-shelf open-source token-level reward models as guidance.

Training Scripts

The example training script is in examples/llama3_8b_instruct_tgdpo.yaml. The training config is set for 8x80GB GPUs. You will need to adjust model_name_or_path and ref_model to specify the base model (e.g., meta-llama/Meta-Llama-3-8B-Instruct), and set the path of the token-level reward model in tgdpo_reward_model.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch  --config_file ./examples/accelerate/fsdp_config.yaml  ./src/train.py  ./examples/llama3_8b_instruct_tgdpo.yaml

Acknowledgements

We would like to thank the authors of LLaMA-Factory for their excellent code base.

Citation

If you find this work useful, please consider citing:

@inproceedings{
zhu2025tgdpo,
title={{TGDPO}: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization},
author={Mingkang Zhu and Xi Chen and Zhongdao Wang and Bei Yu and Hengshuang Zhao and Jiaya Jia},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=TKHWvyzR1t}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
docker		docker
evaluation		evaluation
examples		examples
scripts		scripts
src		src
.dockerignore		.dockerignore
.env.local		.env.local
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization

Overview

Installation

Dataset Preparation

Token-level Reward Model Preparation

Training Scripts

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Languages

dvlab-research/TGDPO

Folders and files

Latest commit

History

Repository files navigation

TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization

Overview

Installation

Dataset Preparation

Token-level Reward Model Preparation

Training Scripts

Acknowledgements

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages