Skip to content

opendilab/awesome-RLVR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Awesome RLVR — Reinforcement Learning with Verifiable Rewards

Stars Forks Contributors License

A curated collection of surveys, tutorials, codebases and papers on
Reinforcement Learning with Verifiable Rewards (RLVR)
a rapidly emerging paradigm that aligns both LLMs and other agents through
objective, externally verifiable signals.


An overview of how Reinforcement Learning with Verifiable Rewards (RLVR) works. (Figure taken from “Tülu 3: Pushing Frontiers in Open Language Model Post-Training”)

Why RLVR?

RLVR couples reinforcement learning with objective, externally verifiable signals, yielding a training paradigm that is simultaneously powerful and trustworthy:

  • Ground-truth rewards – unit tests, formal proofs, or fact-checkers provide binary, tamper-proof feedback.
  • Intrinsic safety & auditability – every reward can be traced back to a transparent verifier run, simplifying debugging and compliance.
  • Strong generalization – models trained on verifiable objectives tend to extrapolate to unseen tasks with minimal extra data.
  • Emergent “aha-moments” – sparse, high-precision rewards encourage systematic exploration that often yields sudden surges in capability when the correct strategy is discovered.
  • Self-bootstrapping improvement – the agent can iteratively refine or even generate new verifiers, compounding its own learning signal.
  • Domain-agnostic applicability – the same recipe works for code generation, theorem proving, robotics, games, and more.

How does it work?

  1. Sampling. We draw one or more candidate completions ( {a}{1..k} ) from a policy model ( \pi\theta ) given a prompt ( s ).
  2. Verification. A deterministic function ( r(s,{a}) ) checks each completion for correctness.
  3. Rewarding.
    • If a completion is verifiably correct, it receives a reward ( r = \gamma ).
    • Otherwise the reward is ( r = 0 ).
  4. Policy update. Using the rewards, we update the policy parameters via RL (e.g., PPO).
  5. (Optional) Verifier refinement. The verifier itself can be trained, hardened, or expanded to cover new edge cases.

Through repeated iterations of this loop, the policy learns to maximise the externally verifiable reward while maintaining a clear audit trail for every decision it makes.


Pull requests are welcome 🎉 — see Contributing for guidelines.

[2025-07-03] New! Initial public release of Awesome-RLVR 🎉

Table of Contents

format:
- [title](paper link) (presentation type)
  - main authors or main affiliations
  - Key: key problems and insights
  - ExpEnv: experiment environments

Surveys & Tutorials

Codebases

Project Description
open-r1 Fully open reproduction of the DeepSeek-R1 pipeline (SFT → GRPO → evaluation).
OpenRLHF An Easy-to-use, Scalable and High-performance RLHF Framework based on Ray (PPO & GRPO & REINFORCE++ & vLLM & Ray & Dynamic Sampling & Async Agentic RL)
verl Volcano Engine reinforcement-learning framework; supports APPO, GRPO, TPPO.
TinyZero ~200-line minimal reproduction of DeepSeek R1-Zero; 4 × RTX 4090 is enough for a 0.5 B LLM.
PRIME Efficient RL-VR (value/reward) training stack for reasoning LLMs.
simpleRL-reason Minimal, didactic RL-VR trainer for reasoning research.
rllm General-purpose “RL-for-LLM” toolbox (algos, logging, evaluators).
OpenR Modular framework for advanced reasoning (process supervision, MCTS, verifier RMs).
Open-Reasoner-Zero a self-play “Zero-RL” framework focused on advanced mathematical / coding reasoning. Includes process-supervision data pipelines, verifier-based dense rewards, and async multi-agent RL training scripts.
ROLL An Efficient and User-Friendly Scaling Library for Reinforcement Learning with Large Language Models

Papers

2025

2024 & Earlier

Other Awesome Lists

  1. Fork this repo.
  2. Add a paper/tool entry under the correct section (keep reverse-chronological order, follow the three-line format).
  3. Open a Pull Request and briefly describe your changes.

License

Awesome-RLVR © 2025 OpenDILab & Contributors Apache 2.0 License

About

A curated list of reinforcement learning with verifiable rewards (continually updated)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published