🚀 DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

Credit: the above image was generated by Sora

Paper link: arXiv

The success of DeepSeek-R1 has spotlighted GRPO (Group Relative Policy Optimization) as a key reinforcement learning method for large reasoning models. However, GRPO suffers several key limitations including entropy collapse, difficulty bias, etc.

How can we design more effective optimization methods for reinforcing large reasoning models in a principled manner without inheriting the limitations of GRPO?

We analyzed GRPO and its variants (Dr. GRPO, DAPO, etc) under a binary reward setting and uncovered two core insights:

⚠️ GRPO suffers from question-level difficulty bias for its discriminative objective
🔍 GRPO has a surprising connection to discriminative learning techniques, particularly AUC maximization

💡 Introducing DisCO — Discriminative Constrained Optimization

DisCO is a new RL framework grounded in discriminative learning. It trains models by increasing scores for positive answers while decreasing those for negatives, enabling:

⚡ Faster convergence
🔒 More stable optimization
🔁 Long-lasting training dynamics

🔍 Why DisCO?

❌ No more difficulty bias – replaces group-relative objective with discriminative objectives
🔄 No clipping operations – uses non-clipping scoring functions (e.g., log-likelihood, likelihood ratio) for smoother learning
📉 Stable training – via simple constrained optimization to keep KL divergence in check
⚖️ Handles sparse rewards – robust to imbalanced data with advanced discriminative approaches

📈 Quick Results

On six math reasoning benchmarks with a 1.5B model, DisCO outperforms GRPO and its variants:

+7% vs GRPO
+6% vs DAPO

DisCO with 8k response length is on par with or even better than GRPO with 32k response length

Model Checkpoints
More Results
Getting Started
Citing DisCO

Model Checkpoints

DisCO (Log-L) finetuned DeepSeek-R1-Distill-Qwen-1.5B Model: DisCO-1.5B-logL
DisCO (L-Ratio) finetuned DeepSeek-R1-Distill-Qwen-1.5B Model: DisCO-1.5B-Lratio
DisCO (Log-L) finetuned DeepSeek-R1-Distill-Qwen-7B Model: DisCO-7B-logL
DisCO (L-Ratio) finetuned DeepSeek-R1-Distill-Qwen-7B Model: DisCO-7B-Lratio

More Results

Comparison with baseline models and baseline methods for fine-tuning 1.5B models. OpenAI-o1-preview is included as a reference. MRL denotes Max Response Length utilized in training/testing. The shaded models are trained by other works and the shaded numbers are reported in their original works or in DeepScalaR. All other results are either evaluated on existing models or on the models trained by us using different approaches. Methods in the bottom area are all for fine-tuning DeepSeek-R1-Distill-Qwen-1.5B model on the same DeepScaleR dataset. DS is short for DeepSeek-R1, DSR is short for DeepScalaR.

Comparison with baseline models and baseline methods for fine-tuning 7B models. Methods in the bottom area are all for fine-tuning DeepSeek-R1-Distill-Qwen-7B model on the the same DeepScalaR dataset.

Training dynamics of different methods: left two are for fine-tuning 1.5B model and right two are for fine-tuning 7B model. (a), (c) plot the training reward (averaged over generated outputs for questions used in each step) vs the number of training steps; (b), (d) plot the generation entropy vs training steps.

Getting Started

Installation

# Recommend Python 3.10.
conda create -n disco python=3.10
conda activate disco
git clone https://github.com/Optimization-AI/DisCO.git
cd DisCO
pip install -e ./verl
pip install -e ./deepscaler
pip install wandb

Datasets

Datesets utilized in our training are included in the datasets folder. Feel free to adapt file scripts/data/deepscaler_dataset.py to generate your own datasets.

Training

We provide training scripts for both single-node and multi-node setups in scripts/train/.

Single-Node Training (8 GPUs)

We start with one node for training 1.5b Qwen models with 8k context, with 8 A100-80GB GPUs. For example, let's run DisCO algorithm with log likelihood as the score function:

bash ./scripts/train/run_disco_logL_1.5b_8k.sh   #### DisCO with `log likelihood`
# bash ./scripts/train/run_disco_Lratio_1.5b_8k.sh   #### DisCO with `likelihood ratio`
# bash ./scripts/train/run_discob_logL_1.5b_8k.sh    #### DisCO-b with `log likelihood`
# bash ./scripts/train/run_discob_Lratio_1.5b_8k.sh  #### DisCO-b with `likelihood ratio`

Multi-Node Training

To train with longer context or larger models, multi-node training is necessary. To achieve this, follow these steps:

On the head node:

# Set XFormers backend to avoid CUDA errors
export VLLM_ATTENTION_BACKEND=XFORMERS
# Start Ray head node
ray start --head

On each worker node:

# Set XFormers backend to avoid CUDA errors
export VLLM_ATTENTION_BACKEND=XFORMERS
# Connect to head node (replace with your head node's address)
ray start --address=[RAY_ADDRESS]

Finally, on the head node, run the training script, such as:

bash ./scripts/train/run_disco_logL_7b_8k.sh

Evaluation

Our evaluation scripts automatically runs vLLM to generate 16 samples for each problem. To run our evaluation scripts, run:

./scripts/eval/eval_model.sh --model [CHECKPOINT_PATH] --datasets [DATASET1] [DATASET2] --output-dir [OUTPUT_DIR]

We report Pass@1 accuracy averaged over 16 samples for each problem. To replicate our reported numbers, for example, run:

./scripts/eval/eval_model.sh --model ganglii/DisCO-1.5B-logL --datasets aime aime25 math amc minerva olympiad_bench --output-dir ./val_results/DisCO-1.5B-logL

Acknowledgements

Our training pipeline is built on the Github repository deepscaler. We thank the authors for open-sourcing their code.

Citing DisCO

If you find DisCO useful in your research, please consider citing the following paper:

@article{li2025disco,
  title={DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization},
  author={Li, Gang and Lin, Ming and Galanti, Tomer and Tu, Zhengzhong and Yang, Tianbao},
  journal={arXiv preprint arXiv:2505.12366},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
assets		assets
datasets/deepscaler/data		datasets/deepscaler/data
deepscaler		deepscaler
scripts		scripts
verl		verl
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

💡 Introducing DisCO — Discriminative Constrained Optimization

🔍 Why DisCO?

📈 Quick Results

Model Checkpoints

More Results

Getting Started

Installation

Datasets

Training

Single-Node Training (8 GPUs)

Multi-Node Training

Evaluation

Acknowledgements

Citing DisCO

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

Optimization-AI/DisCO

Folders and files

Latest commit

History

Repository files navigation

🚀 DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

💡 Introducing DisCO — Discriminative Constrained Optimization

🔍 Why DisCO?

📈 Quick Results

Model Checkpoints

More Results

Getting Started

Installation

Datasets

Training

Single-Node Training (8 GPUs)

Multi-Node Training

Evaluation

Acknowledgements

Citing DisCO

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages