Awesome RLVR — Reinforcement Learning with Verifiable Rewards

A curated collection of surveys, tutorials, codebases and papers on
Reinforcement Learning with Verifiable Rewards (RLVR)—
a rapidly emerging paradigm that aligns both LLMs and other agents through
objective, externally verifiable signals.

An overview of how Reinforcement Learning with Verifiable Rewards (RLVR) works. (Figure taken from “Tülu 3: Pushing Frontiers in Open Language Model Post-Training”)

Why RLVR?

RLVR couples reinforcement learning with objective, externally verifiable signals, yielding a training paradigm that is simultaneously powerful and trustworthy:

Ground-truth rewards – unit tests, formal proofs, or fact-checkers provide binary, tamper-proof feedback.
Intrinsic safety & auditability – every reward can be traced back to a transparent verifier run, simplifying debugging and compliance.
Strong generalization – models trained on verifiable objectives tend to extrapolate to unseen tasks with minimal extra data.
Emergent “aha-moments” – sparse, high-precision rewards encourage systematic exploration that often yields sudden surges in capability when the correct strategy is discovered.
Self-bootstrapping improvement – the agent can iteratively refine or even generate new verifiers, compounding its own learning signal.
Domain-agnostic applicability – the same recipe works for code generation, theorem proving, robotics, games, and more.

How does it work?

Sampling. We draw one or more candidate completions ( {a}{1..k} ) from a policy model ( \pi\theta ) given a prompt ( s ).
Verification. A deterministic function ( r(s,{a}) ) checks each completion for correctness.
Rewarding.
• If a completion is verifiably correct, it receives a reward ( r = \gamma ).
• Otherwise the reward is ( r = 0 ).
Policy update. Using the rewards, we update the policy parameters via RL (e.g., PPO).
(Optional) Verifier refinement. The verifier itself can be trained, hardened, or expanded to cover new edge cases.

Through repeated iterations of this loop, the policy learns to maximise the externally verifiable reward while maintaining a clear audit trail for every decision it makes.

Pull requests are welcome 🎉 — see Contributing for guidelines.

[2025-07-03] New! Initial public release of Awesome-RLVR 🎉

Surveys & Tutorials

Inference-Time Techniques for LLM Reasoning (Berkeley Lecture 2025)
- DeepMind & UC Berkeley (Xinyun Chen)
- Key: decoding-time search, self-consistency, verifier pipelines
- ExpEnv: code/math reasoning benchmarks
Learning to Self-Improve & Reason with LLMs (Berkeley Talk 2025)
- Meta AI & NYU (Jason Weston)
- Key: continual self-improvement loops, alignment interplay
- ExpEnv: open-ended dialogue & retrieval tasks
LLM Reasoning: Key Ideas and Limitations (Tutorial Slides 2024)
- DeepMind (Denny Zhou)
- Key: theoretical foundations & failure modes of reasoning
- ExpEnv: slide examples, classroom demos
Can LLMs Reason & Plan? (ICML Tutorial 2024)
- Arizona State University (Subbarao Kambhampati)
- Key: planning-oriented reasoning, agent integration
- ExpEnv: symbolic + LLM planning tasks
Towards Reasoning in Large Language Models (ACL Tutorial 2023)
- UIUC (Jie Huang)
- Key: survey of reasoning techniques & benchmarks
- ExpEnv: academic tutorial datasets
From System 1 to System 2: A Survey of Reasoning Large Language Models (arXiv 2025)
- CAS & MBZUAI
- Key: cognitive-style taxonomy (fast vs. deliberative reasoning)
- ExpEnv: logical, mathematical, commonsense datasets
Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models (arXiv 2025)
- Chinese Univ. of Hong Kong
- Key: efficient reasoning, test-time-compute scaling
- ExpEnv: math & code reasoning benchmarks
What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models (arXiv 2025)
- City University of Hong Kong
- Key: methods for scaling inference-time compute (CoT, search, self-consistency)
- ExpEnv: diverse reasoning datasets
A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond (arXiv 2025)
- Shanghai AI Lab et al.
- Key: lifecycle-wide efficiency (pre-training → inference) for LRMs
- ExpEnv: language + vision reasoning tasks
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models (arXiv 2025)
- Rice University
- Key: “overthinking” phenomenon, length-control techniques
- ExpEnv: GSM8K, MATH-500, AIME-24
A Visual Guide to Reasoning LLMs (Newsletter 2025)
- Maarten Grootendorst
- Key: illustrated test-time-compute concepts, DeepSeek-R1 case study
- ExpEnv: graphical explanations & code demos
Understanding Reasoning LLMs – Methods and Strategies for Building and Refining Reasoning Models (Blog 2025)
- Sebastian Raschka
- Key: practical tutorial on data, architectures, evaluation
- ExpEnv: Jupyter notebooks & open-source models
An Illusion of Progress? Assessing the Current State of Web Agents (arXiv 2025)
- Ohio State & UC Berkeley
- Key: empirical audit of LLM-based web agents, evaluation protocols
- ExpEnv: autonomous web-navigation tasks
Agentic Large Language Models, A Survey (arXiv 2025)
- Leiden University
- Key: taxonomy of agentic LLM architectures & planning mechanisms
- ExpEnv: multi-step reasoning / tool-use agents
A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More (arXiv 2024)
- Salesforce AI
- Key: reward modeling & preference-optimization pipelines
- ExpEnv: alignment benchmarks, safety tasks
Self-Improvement of LLM Agents through Reinforcement Learning at Scale (MIT Scale-ML Talk 2024)
- MIT CSAIL & collaborators
- Key: large-scale RL for autonomous agent refinement
- ExpEnv: simulated dialogue & tool-use agents
Reinforcement Learning from Verifiable Rewards (Blog 2025)
- Key: Uses binary, verifiable reward functions to inject precise, unbiased learning signals into RL pipelines for math, code, and other accuracy-critical tasks.
- ExpEnv: Easily reproducible in Jupyter notebooks or any RL setup by plugging in auto-grading tools such as compilers, unit tests, or schema validators.

Codebases

Project	Description
open-r1	Fully open reproduction of the DeepSeek-R1 pipeline (SFT → GRPO → evaluation).
OpenRLHF	An Easy-to-use, Scalable and High-performance RLHF Framework based on Ray (PPO & GRPO & REINFORCE++ & vLLM & Ray & Dynamic Sampling & Async Agentic RL)
verl	Volcano Engine reinforcement-learning framework; supports APPO, GRPO, TPPO.
TinyZero	~200-line minimal reproduction of DeepSeek R1-Zero; 4 × RTX 4090 is enough for a 0.5 B LLM.
PRIME	Efficient RL-VR (value/reward) training stack for reasoning LLMs.
simpleRL-reason	Minimal, didactic RL-VR trainer for reasoning research.
rllm	General-purpose “RL-for-LLM” toolbox (algos, logging, evaluators).
OpenR	Modular framework for advanced reasoning (process supervision, MCTS, verifier RMs).
Open-Reasoner-Zero	a self-play “Zero-RL” framework focused on advanced mathematical / coding reasoning. Includes process-supervision data pipelines, verifier-based dense rewards, and async multi-agent RL training scripts.
ROLL	An Efficient and User-Friendly Scaling Library for Reinforcement Learning with Large Language Models

Papers

2025

Brain Bandit: A Biologically Grounded Neural Network for Efficient Control of Exploration
- Chen Jiang, Jiahui An, Yating Liu, Ni Ji
- Key: explore-exploit, stochastic Hopfield net, Thompson sampling, brain-inspired RL
- ExpEnv: MAB tasks, MDP tasks
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
- Daya Guo, Dejian Yang, Haowei Zhang et al. (DeepSeek-AI)
- Key: GRPO, pure-RL reasoning, distillation to 1.5 B–70 B, open checkpoints
- ExpEnv: AIME-2024, MATH-500, Codeforces, LiveCodeBench, GPQA-Diamond, SWE-Bench
Demystifying Long Chain-of-Thought Reasoning in LLMs
- IN.AI Research Team
- Key: cosine length-scaling reward, repetition penalty, stable long CoT
- ExpEnv: GSM8K, MATH, mixed STEM sets
Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning
- Shanghai AI Lab
- Key: outcome-only reward, sparse-signal RL, math-centric limits
- ExpEnv: MATH-Benchmark, GSM8K, AIME, proof datasets
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
- The University of Hong Kong & UC Berkeley
- Key: SFT vs RLHF/RLVR, memorization-generalization trade-off
- ExpEnv: held-out reasoning & knowledge shift tests
Kimi K 1.5: Scaling Reinforcement Learning with LLMs
- Moonshot AI
- Key: curriculum RL, large-batch PPO, scalable infra
- ExpEnv: multi-domain reasoning, long-context writing, agent benchmarks
S²R: Teaching LLMs to Self-Verify and Self-Correct via Reinforcement Learning
- Tencent AI Lab
- Key: self-verification & correction loops, dual-reward, safety alignment
- ExpEnv: math QA, code generation, natural-language inference
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling (arXiv)
- Tsinghua University
- Key: compute-optimal scaling, small-vs-large model trade-offs
- ExpEnv: reasoning benchmarks, test-time compute scaling
QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search (arXiv)
- UCLA (Yizhou Sun Lab)
- Key: Q-guided stepwise search, agent inference efficiency
- ExpEnv: web-agent tasks, reasoning QA
Solving Math Word Problems with Process- and Outcome-Based Feedback (NeurIPS 2023)
- DeepMind
- Key: process & outcome rewards, verifier feedback for math
- ExpEnv: GSM8K, MATH
Process Reward Models That Think (arXiv)
- University of Michigan
- Key: process reward modelling, reasoning guidance
- ExpEnv: reasoning QA, code tasks
Learning to Reason under Off-Policy Guidance (arXiv)
- Shanghai AI Lab
- Key: off-policy guidance for reasoning RL
- ExpEnv: math and code benchmarks
THINKPRUNE: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning (arXiv)
- Anonymous
- Key: CoT pruning through RL, latency reduction
- ExpEnv: GSM8K, assorted reasoning sets
GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning (arXiv)
- TBD
- Key: lightweight RL baseline, strong reasoning gains
- ExpEnv: diverse reasoning benchmarks
When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning (arXiv)
- Google DeepMind
- Key: dynamic solve-vs-verify decision, compute optimality
- ExpEnv: math & code tasks
SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks (arXiv)
- Meta, UC Berkeley
- Key: multi-turn agent RL, collaborative reasoning
- ExpEnv: agent task suites
L1: Controlling How Long a Reasoning Model Thinks With Reinforcement Learning (arXiv)
- Carnegie Mellon University
- Key: explicit control of reasoning steps via RL
- ExpEnv: GSM8K, MATH
Scaling Test-Time Compute Without Verification or RL is Suboptimal (arXiv)
- CMU, UC Berkeley
- Key: verifier-based vs verifier-free compute scaling
- ExpEnv: reasoning benchmarks
DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models (arXiv)
- Unicom Data Intelligence
- Key: difficulty-adaptive thinking length
- ExpEnv: reasoning sets
Reasoning with Reinforced Functional Token Tuning (arXiv)
- Zhejiang University, Alibaba Cloud Computing
- Key: functional token tuning, RL-aided reasoning
- ExpEnv: reasoning QA, code
Provably Optimal Distributional RL for LLM Post-Training (arXiv)
- Cornell & Harvard
- Key: distributional RL theory for LLM post-training
- ExpEnv: synthetic reasoning, math tasks
On the Emergence of Thinking in LLMs I: Searching for the Right Intuition (arXiv)
- MIT
- Key: self-play RL, emergent reasoning patterns
- ExpEnv: reasoning games, maths puzzles
STP: Self-Play LLM Theorem Provers with Iterative Conjecturing and Proving (arXiv)
- Stanford (Tengyu Ma)
- Key: theorem proving via self-play, sparse-reward tackling
- ExpEnv: proof assistant datasets
A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility (arXiv)
- University of Cambridge, University of Tübingen
- Key: evaluation pitfalls, reproducibility guidelines
- ExpEnv: multiple reasoning benchmarks
Recitation over Reasoning: How Cutting-Edge LMs Fail on Elementary Reasoning Problems (arXiv)
- ByteDance Seed
- Key: fragility to minor perturbations, arithmetic reasoning
- ExpEnv: elementary school-level arithmetic tasks
Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad (arXiv)
- ETH Zurich, INSAIT
- Key: Olympiad-level evaluation, zero-score phenomenon
- ExpEnv: 2025 USAMO problems
(REINFORCE++) A Simple and Efficient Approach for Aligning Large Language Models (arXiv)
- Jian Hu et al.
- Key: REINFORCE++ algorithm, stability vs PPO/GRPO
- ExpEnv: RLHF alignment suites
ReFT v3: Reasoning with Reinforced Fine-Tuning (ACL 2025 Long)
- Trung Le, Jiaqi Zhang et al.
- Key: single-stage RLFT, low-cost math alignment
- ExpEnv: GSM8K, MATH, SVAMP
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (Technical Report)
- DeepSeek-AI
- Key: GRPO, math-only RL, verifier-guided sampling
- ExpEnv: MATH-500, AIME-2024, CNMO-2024
SimPO: Simple Preference Optimization with a Reference-Free Reward (arXiv)
- Shanghai AI Lab
- Key: reference-free preference optimisation, KL-free objective
- ExpEnv: AlpacaEval, helpful/harmless RLHF sets
DeepSeek-Prover v1.5: Harnessing Proof Assistant Feedback for RL and MCTS (arXiv)
- DeepSeek-AI
- Key: proof-assistant feedback, Monte-Carlo Tree Search
- ExpEnv: Lean theorem-proving benchmarks
Tülu 3: Pushing Frontiers in Open Language Model Post-Training (arXiv)
- Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Øyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, Hannaneh Hajishirzi
- Key: post-training, supervised finetuning (SFT), Direct Preference Optimization (DPO), RLVR, open LLMs
- ExpEnv: multi-task language-model benchmarks (Tülu 3 Eval, decontaminated standard suites)
Kimi k1.5: Scaling Reinforcement Learning with LLMs (arXiv)
- Kimi Team – Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, … , Zongyu Lin
- Key: RL with LLMs, long-context scaling, policy optimization, long2short CoT, multi-modal reasoning
- ExpEnv: AIME, MATH 500, Codeforces, MathVista, LiveCodeBench
Model Alignment as Prospect Theoretic Optimization (arXiv)
- Stanford University, Contextual AI.
- Key: prospect-theoretic objective for alignment
- ExpEnv: alignment evaluation suites
UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning
- Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanjing Xiong, Hongsheng Li
- Key: rule-based rewards, GRPO, multimodal LLM, GUI grounding & action, data-efficient RFT (136 samples)
- ExpEnv: ScreenSpot, ScreenSpot-Pro, AndroidControl
GUI-R1: A Generalist R1-Style Vision-Language Action Model for GUI Agents
- Run Luo, Lu Wang, Wanwei He, Xiaobo Xia
- Key: unified action space, GRPO, high-level GUI tasks, cross-platform (Win/Linux/Mac/Android/Web), data-efficient RFT (3 K samples)
- ExpEnv: ScreenSpot, ScreenSpot-Pro, GUI-Act-Web, OmniAct-Web, OmniAct-Desktop, AndroidControl-Low/High, GUI-Odyssey

2024 & Earlier

Direct Preference Optimization: Your Language Model is Secretly a Reward Model (ICLR 2024)
- Rafael Raffel et al.
- Key: preference optimisation without RL, DPO objective
- ExpEnv: summarisation, dialogue alignment
Math-Shepherd: Verify and Reinforce LLMs Step-by-Step without Human Annotations (NeurIPS 2023)
- Peking University, DeepSeek-AI
- Key: step-checker, verifier RL, zero human labels
- ExpEnv: GSM8K-Step, MATH-Step
Let’s Verify Step by Step (ICML 2023)
- OpenAI
- Key: verifier prompts, iterative self-improvement
- ExpEnv: GSM8K, ProofWriter
Solving Olympiad Geometry without Human Demonstrations (Nature 2023)
- DeepMind
- Key: formal geometry solving, RL without human demos
- ExpEnv: geometry proof tasks
Training Language Models to Follow Instructions with Human Feedback (NeurIPS 2022)
- OpenAI
- Key: PPO-based RLHF, instruction-following alignment
- ExpEnv: broad instruction-following tasks (InstructGPT)

Other Awesome Lists

Contributing

Fork this repo.
Add a paper/tool entry under the correct section (keep reverse-chronological order, follow the three-line format).
Open a Pull Request and briefly describe your changes.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome RLVR — Reinforcement Learning with Verifiable Rewards

Why RLVR?

How does it work?

Table of Contents

Surveys & Tutorials

Codebases

Papers

2025

2024 & Earlier

Other Awesome Lists

Contributing

License

About

Uh oh!

Releases

Packages

Contributors 2

License

opendilab/awesome-RLVR

Folders and files

Latest commit

History

Repository files navigation

Awesome RLVR — Reinforcement Learning with Verifiable Rewards

Why RLVR?

How does it work?

Table of Contents

Surveys & Tutorials

Codebases

Papers

2025

2024 & Earlier

Other Awesome Lists

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages