A curated collection covering models, datasets, reward designs, optimization methods, hyperparameters, empirical findings, theoretical insights, and everything about reasoning with reinforcement learning.
- [2025-05-27]: 🔥We are very excited to release MARTI: A Framework for LLM-based Multi-Agent Reinforced Training and Inference. Check it out: Github.
- [2025-04-23]: 🔥Introducing TTRL — an open-source solution for online RL on data without ground-truth labels, especially test data. Check it out: Github and Paper.
⚠️ ⚠️ ⚠️ For the most recent updates, please scroll to the bottom of the table:
This collection covers recent progress in reinforcement learning for large language model reasoning, starting from 2025 in the timeline.
Date | Project | Org | Intro | HF Model | HF Dataset | Takeaway Messages |
---|---|---|---|---|---|---|
2025.0102 | PRIME-RL | THU & UIUC Shanghai AI Lab |
Paper GitHub More |
Eurus-2-7B-PRIME Eurus-2-7B-PRIME-Zero |
Eurus-2-RL-Data | ClickPRIME offers scalable Reinforcement Learning by using dense, token-level implicit rewards derived only from final outcomes. This bypasses costly step-by-step annotations, providing fine-grained feedback to improve sample efficiency and reasoning. |
2025.0122 | DeepSeek-R1 | DeepSeek | Paper GitHub More |
DeepSeek-R1 DeepSeek-R1-Zero |
—— | ClickDeepSeek-R1's core contribution is demonstrating large-scale RL from scratch (600B+) without SFT, achieving emergent "aha moments" (self-reflective reasoning) and matching OpenAI o1's performance at 1/30 cost |
2025.0122 | Kimi k1.5 | Kimi | Paper GitHub More |
—— | —— | ClickKimi 1.5 introduces a simplified RL framework that leverages long-context scaling (128k tokens) and improved policy optimization (e.g., online mirror descent) to enhance reasoning and multimodal performance. |
2025.0124 | TinyZero | Berkeley | Twitter GitHub More |
—— | Countdown-Tasks-3to4 | ClickTinyZero's core contribution is demonstrating that smaller language models (e.g., 1.5B-3B parameters) can develop complex reasoning, search, and self-verification abilities through Reinforcement Learning, replicating capabilities of larger models like DeepSeek R1-Zero at extremely low cost (<$30). |
2025.0124 | Open-R1 | Huggingface | GitHub |
OpenR1-Qwen-7B OlympicCoder-7B OlympicCoder-32B |
OpenR1-Math-220k codeforces |
ClickOpen-R1's core contribution is providing the first fully open-source replication and release of the DeepSeek R1-Zero Reinforcement Learning training pipeline. Its main insight or goal is to democratize access to these advanced RL techniques for enhancing LLM reasoning and planning. |
2025.0125 | simpleRL-reason | HKUST | Paper GitHub More |
Qwen-2.5-Math-7B-SimpleRL-Zero Qwen-2.5-Math-7B-SimpleRL |
MATH | ClickResearchers replicated the DeepSeek-R1-Zero and DeepSeek-R1 training using a 7B model with only 8K MATH examples, achieving strong results on complex mathematical reasoning. |
2025.0205 | Demystify-long-cot | CMU | Paper GitHub More |
—— | —— | ClickThe paper elucidates the role of RL in stabilizing and enhancing long CoT reasoning in LLMs, highlighting the necessity of reward shaping and verifiable reward signals for complex reasoning tasks. |
2025.0207 | No-aha-moment | Sea AI Lab | Blog GitHub |
—— | Countdown-Tasks-3to4 | ClickThis is the first public critique of the 'aha moment' associated with DeepSeek-R1-Zero-style training, suggesting that changes in response length are an intrinsic part of the reinforcement learning dynamics. |
2025.0210 | DeepScaler | Agentica-Org | Blog GitHub More |
DeepScaleR-1.5B-Preview | DeepScaleR-Preview-Dataset | ClickDeepScaleR's core contribution is demonstrating that a small 1.5B parameter model, fine-tuned using scaled Reinforcement Learning (RL) and an iterative context lengthening scheme, can surpass the reasoning performance of larger, state-of-the-art models like OpenAI's O1-Preview on complex benchmarks (e.g., AIME math problems). |
2025.0210 | Logic-RL | MSRA & Ubiquant | Paper GitHub More |
—— | knights-and-knaves knights-and-knaves-ZH | ClickThe paper introduces Logic-RL, a rule-based reinforcement learning framework that enables large language models to develop o3-mini-level reasoning skills through training on logic puzzles. The reasoning capabilities can also be transferred to other domains like math. |
2025.0210 | OREAL | Shanghai AI Lab SJTU & CUHK |
Paper GitHub More |
OREAL-32B OREAL-7B OREAL-DeepSeek-R1-Distill-Qwen-7B OREAL-32B-SFT OREAL-7B-SFT |
OREAL-RL-Prompts | ClickThe paper introduces OREAL, a reinforcement learning framework for mathematical reasoning with binary feedback. It proves that behavior cloning on positive samples is sufficient for optimal learning and proposes reward reshaping for negative samples. A token-level reward model addresses sparse rewards in long reasoning chains. OREAL achieves state-of-the-art results on math benchmarks. |
2025.0217 | LIMR | SJTU | Paper GitHub More |
LIMR | LIMR | ClickThe paper challenges the assumption that scaling up RL training data inherently improves performance in language models, instead finding that a strategically selected subset of 1,389 samples can outperform a full 8,523-sample dataset. |
2025.0218 | Open-Reasoner-Zero | StepFun & THU | Paper GitHub More |
Open-Reasoner-Zero-7B Open-Reasoner-Zero-32B |
ORZ-Math-57k | ClickThe Open-Reasoner-Zero model has achieved notable performance, with Open-Reasoner-Zero-32B outperforming DeepSeek-R1-Zero-Qwen-32B on the GPQA Diamond benchmark while requiring significantly fewer training steps. |
2025.0225 | SWE-RL | FAIR at Meta | Paper GitHub More |
—— | —— | ClickSWE-RL enhances LLMs' code reasoning through RL using open-source software evolution data, achieving state-of-the-art results in software engineering tasks and demonstrating generalized reasoning capabilities beyond coding. |
2025.0227 | Med-RLVR | Microsoft Research | Paper More |
—— | —— | ClickThe Med-RLVR framework demonstrates emergent medical reasoning via RL, achieving performance parity with SFT on in-distribution tasks and improving out-of-distribution generalization, all without explicit reasoning supervision, showcasing RL's potential in medicine. |
2025.0303 | VC-PPO | Bytedance | Paper More |
—— | —— | ClickVC-PPO (Value-Calibrated PPO) diagnoses PPO's collapse in long CoT tasks as stemming from value function inaccuracies (initialization bias and reward signal decay in long sequences). Its core contribution is modifying PPO with value pretraining and decoupled GAE for actor and critic. |
2025.0306 | LCPO-L1 | CMU | Paper GitHub More |
L1-Qwen-1.5B-Max L1-Qwen-1.5B-Exact |
—— | ClickL1 introduces Length Controlled Policy Optimization (LCPO), a RL method enabling precise control over a reasoning model's thinking time (output length) via prompt instructions. It shows that RL effectively controls reasoning duration and unexpectedly enhances even short-chain reasoning capabilities. |
2025.0310 | MRT | CMU | Paper Project GitHub |
—— | —— | ClickMRT (Mixed-Reality Trajectory Preferences) introduces a novel method for fine-tuning cooperative LLM agents. It effectively blends human preferences on real interaction trajectories with AI preferences on synthetic variations, improving data efficiency. This mixed-reality approach surpasses purely AI-driven feedback (RLAIF), especially for complex, multi-turn collaborative tasks. |
2025.0318 | TOPR | Mila & Reliant AI | Paper More |
—— | —— | ClickTOPR (Tapered Off-Policy REINFORCE) introduces a novel RL algorithm for fine-tuning LLMs. Its core contribution is using asymmetric, tapered importance sampling to modify REINFORCE, enabling stable and efficient off-policy learning. This allows reusing past data effectively without the instability often seen in other methods and without needing explicit KL regularization. |
2025.0318 | DAPO | Bytedance THU |
Paper GitHub More |
—— | DAPO-Math-17k | ClickDAPO algorithm introduces four key techniques (Clip-Higher, Dynamic Sampling, Token-Level Loss, Overlong Shaping) for stable and efficient long-chain-of-thought RL training, surpassing previous SoTA results efficiently. |
2025.0320 | Open RS | VNU University of Science & Knovel Engineering Lab | Paper GitHub More |
Open-RS1 Open-RS2 Open-RS3 |
open-s1 open-deepscaler open-rs |
ClickThe study investigates the potential of RL to improve reasoning in small LLMs. The results demonstrate rapid reasoning gains, with accuracy improvements on mathematical reasoning benchmarks, and highlight the efficacy of RL-based fine-tuning for small LLMs as a cost-effective alternative to large-scale approaches, using high-quality training data. |
2025.0321 | Dr. GRPO | Sea AI Lab | Paper GitHub More |
Qwen2.5-Math-7B-Oat-Zero Qwen2.5-Math-1.5B-Oat-Zero Llama-3.2-3B-Oat-Zero |
MATH | ClickThis work critically analyzes R1-Zero-like RL training. It reveals base model properties and GRPO algorithm biases (e.g., length bias) significantly impact outcomes. It contributes the efficient, unbiased Dr. GRPO algorithm and an open-source recipe/codebase for better understanding and reproduction. |
2025.0321 | FastCuRL | Tencent Hunyuan | Paper GitHub |
FastCuRL-1.5B-Preview | FastCuRL | ClickFastCuRL introduces a simple, efficient Curriculum RL method for LLMs. Its core contribution uses target perplexity to dynamically scale the standard RL loss (like PPO), creating an effective curriculum without complex reward models or auxiliary components, enabling faster, more stable training. |
2025.0328 | ARGO | Meta | Paper |
—— | —— | ClickThis paper derived the Any-Generation Reward Optimization (AGRO) frim the consistency condition across any possible generation of the model. AGRO achieves a better convergence than KL-regularized policy gradient method. |
2025.0401 | Z1 | THU | Paper GitHub |
Z1-7B | Z1-Code-Reasoning-107K | ClickThis paper proposes training LLMs on code-related reasoning trajectories using a curated dataset and a "Shifted Thinking Window" technique. This allows models to reduce excessive thinking tokens, achieving efficient test-time scaling and generalizing reasoning abilities. |
2025.0401 | VAPO | ByteDance Seed | Paper |
—— | —— | ClickVAPO offers an integrated solution that effectively alleviates value model bias, the presence of heterogeneous sequence lengths, and the sparsity of reward signal. |
2025.0407 | ConciseRL | Wand AI | Paper | —— | —— | ClickThis work challenges the idea that longer reasoning chains in LLMs inherently mean better accuracy. It uses mathematical analysis of RL principles, particularly PPO, to show that lengthier responses often arise from the optimization process itself, not necessarily improved reasoning. |
2025.0409 | AdaRFT | USC LIME Lab | Paper GitHub |
—— | DeepScaleR_Difficulty | ClickAdaRFT proposes Adaptive Curriculum Reinforcement Finetuning to improve LLM reasoning training efficiency. It dynamically adjusts task difficulty based on recent reward signals, accelerating learning by keeping challenges optimally balanced. Experiments on competition math benchmarks show up to 2x fewer steps and improved accuracy, using standard PPO with minimal changes. |
2025.0410 | Seed-Thinking-v1.5 | ByteDance Seed | GitHub |
—— | —— | ClickSeed-Thinking-v1.5 is a high-performing reasoning model that combines curated chain-of-thought data, stable reinforcement learning, and advanced infrastructure to achieve strong results across math, coding, and logic tasks. |
2025.0410 | d1 & diffu-GRPO | UCLA & Meta | Paper GitHub Project |
—— | —— | ClickThis paper propose d1 to adapt pre-trained masked dLLMs into reasoning via a combination of SFT and RL. The RL method used is named diffu-GRPO. |
2025.0413 | Skywork-OR1 | Skywork AI | Paper Blog GitHub |
Skywork-OR1-32B-Preview Skywork-OR1-7B-Preview Skywork-OR1-Math-7B |
Skywork-OR1-RL-Data | ClickSkywork-OR1 is a series of robust open-source models trained on carefully curated math and code data. The training process incorporates several modifications to the original GRPO, including offline and online data filtering, multi-stage training, and adaptive entropy control. |
2025.0415 | DeepMath | Tencent & SJTU | Paper GitHub |
zwhe99/DeepMath-Zero-7B zwhe99/DeepMath-Zero-Math-7B zwhe99/DeepMath-1.5B zwhe99/DeepMath-Omn-1.5B |
zwhe99/DeepMath-103K | ClickDeepMath-103K is a large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning. Trained on DeepMath-103K, DeepMath series models achieve state-of-the-art performance on many math benchmarks. |
2025.0421 | LUFFY | Shanghai AI Lab | Paper GitHub |
LUFFY-Qwen-Math-7B-Zero LUFFY-Qwen-Math-1.5B-Zero |
Openr1-Math-46k-8192 | ClickThis paper introduces LUFFY (Learning to reason Under oFF-policY guidance), a framework that augments zero-RL with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. |
2025.0423 | TTRL | THU&Shanghai AI Lab | Paper GitHub |
—— | —— | ClickThis paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). |
2025.0430 | Phi-4-reasoning | Mircosoft | Paper | Phi-4-reasoning | —— | ClickThis paper introduces Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. |
2025.0511 | BLEUBERI | Maryland | Paper GitHub |
—— | —— | ClickDemonstrates that BLEU, a simple string-matching metric, can effectively serve as a reward function for instruction-following tasks, rivaling complex reward models. |
2025.0512 | INTELLECT-2 | PrimeIntellect-ai | Paper GitHub |
INTELLECT-2 | —— | ClickINTELLECT-2 is a 32 billion parameter language model trained through a reinforcement learning run leveraging globally distributed, permissionless GPU resources contributed by the community. |
2025.0514 | Qwen3 | Alibaba Qwen | Paper GitHub |
Qwen3 | —— | Clickinsights and contributions about RL for reasoning within 30 words. |
2025.0516 | Subnetwork RL | UIUC | Paper | —— | —— | Clickinsights and contributions about RL for reasoning within 30 words. |
2025.0516 | Data Synthesis RL | PKU&MIT | Paper GitHub |
—— | —— | Clickinsights and contributions about RL for reasoning within 30 words. |
2025.0519 | AR-Lopti | CUHK | Paper GitHub |
—— | —— | ClickThis paper identifies a critical yet underexplored issue in RL training: low-probability tokens disproportionately influence model updates due to their large gradient magnitudes. |
2025.0519 | AnytimeReasoner | Sea AI Lab | Paper GitHub |
—— | DeepScaleR-Preview-Dataset | ClickThis paper proposes a framework for optimizing anytime reasoning under arbitrary token budgets, featuring decoupled optimization of thinking and summarization, dense verifiable rewards, and budget relative policy optimization. |
2025.0521 | EM-PT | UIUC | Paper GitHub |
—— | —— | ClickThis paper shows that this simple objective alone, without any labeled data, can substantially improve large language models’ (LLMs) performance on challenging math, physics, and coding tasks. |
2025.0521 | NOVER | KCL&SJTU | Paper GitHub |
—— | —— | ClickThis paper presents a verifier-free R1-zero-like training, which enables it to train on any data (beyond math and coding)! |
2025.0522 | AceReason-Nemontron | Nvidia | Paper | AceReason-Nemotron-14B | —— | ClickThis paper demonstrates that large-scale RL can significantly enhance the reasoning capabilities of strong, small- and mid-sized models, achieving results that surpass those of state-of-the-art distillation-based models. |
2025.0522 | KTAE | CAS | Paper GitHub |
KTAE-7B/1.5B | —— | ClickThis paper improves the calculation method of advantage based on GRPO, providing more fine-grained token-level advantage, and effectively reducing the generation length. |
2025.0523 | QwenLong-L1 | Qwen-Doc | Paper GitHub |
QwenLong-L1-32B | —— | Clickinsights and contributions about RL for reasoning within 30 words. |
2025.0523 | Trinity-RFT | Alibaba Group | Paper GitHub |
—— | —— | ClickTrinity-RFT is a general-purpose, flexible and scalable framework designed for reinforcement fine-tuning (RFT) of large language models. |
2025.0524 | LlamaRL | Meta | Paper | —— | —— | ClickDistributed async RL framework for LLMs, achieving 10× training speed over DeepSpeed; scales to 405B parameters. |
2025.0525 | SeRL | ZJU | Paper GitHub |
—— | —— | ClickThis paper proposes Self-play Reinforcement Learning (SeRL) to bootstrap LLM training with limited initial data. |
2025.0525 | BRIDGE | CMU | Paper GitHub |
—— | —— | ClickThe paper proposes behavior injection, a task-agnostic data augmentation method that enhances the effectiveness of reinforcement fine-tuning for language models by improving rollout accuracy and data co-influence, leading to consistently better post-RL performance. |
2025.0526 | REA-RL | HIT | Paper GitHub |
—— | —— | ClickIntroduces REA-RL, which enhance the efficiency of LRMs by introducing a reflection model for efficient scaling online, and a reflection reward to prevent non-reflective responses. |
2025.0527 | ConciseR | Tencent Hunyuan | Paper GitHub |
—— | —— | ClickThis paper proposes a simple yet effective two-stage reinforcement learning framework for achieving concise reasoning in LLMs, named ConciseR. |
2025.0527 | VeriFree | Sea AI Lab | Paper GitHub |
—— | —— | ClickThis paper proposes a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. |
2025.0527 | One-Shot-EM | Ubiquant | Paper GitHub |
—— | —— | ClickThis paper trained 13,440 large language models and found that entropy minimization requires only a single unlabeled data and 10 steps optimization to achieve performance improvements greater than those obtained using thousands of data and carefully designed rewards in rule-based reinforcement learning. |
2025.0528 | Entropy-RL | Shanghai AI Lab & THU | Paper GitHub |
—— | —— | ClickThis paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy. |
2025.0528 | RENT-RL | CMU | Paper GitHub |
—— | —— | ClickRENT: Reinforcement Learning via Entropy Minimization is a fully unsupervised reinforcement learning method that improves reasoning performance by using the model's own confidence as a reward. |
2025.0528 | SynLogic | MiniMax-AI | Paper GitHub |
—— | —— | ClickThis paper presents SynLogic, a data synthesis framework and dataset that generates diverse logical reasoning data at scale, encompassing 35 diverse logical reasoning tasks. |
2025.0530 | ProRL | Nvidia | Paper | Nemotron-Qwen-1.5B | —— | ClickThis paper challenges prevailing assumptions by demonstrating that prolonged RL (ProRL) training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling. |
2025.0530 | ReasoningGym | OpenThought | Paper GitHub |
—— | —— | ClickThis paper introduces Reasoning Gym, a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple domains including algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and various common games. |
2025.0530 | AReaL | InclusionAI | Paper GitHub |
—— | —— | ClickAReaL introduces a fully asynchronous reinforcement learning system for language reasoning tasks, decoupling rollout generation from model training to significantly improve GPU utilization and training speed—achieving up to 2.57× speedup over synchronous systems—while maintaining or improving model performance. |
2025.0602 | HighEntropyRL | Qwen&THU | Paper Project |
—— | —— | ClickHigh-entropy minority tokens play an outsized role in RLVR training. This paper provides actionable insights into optimizing reward design. |
2025.0602 | RLVR-Decomposed | Princeton | Paper GitHub |
—— | —— | ClickShows that penalizing incorrect answers alone can significantly boost LLM reasoning via PPO—challenging conventional RLHF approaches. |
2025.0602 | Writing-Zero | Star Writing | Paper | —— | —— | ClickApplies RLVR to creative tasks like story writing by converting non-verifiable tasks into verifiable subgoals. |
2025.0602 | SRPO | ByteDance Seed & OSU | Paper | —— | —— | ClickProposes a two-stage RL framework combining self-reflection and Group Relative Policy Optimization to boost multimodal reasoning. |
2025.0603 | KDRL | HIT&Huawei | Paper | —— | —— | ClickPresents KDRL, a unified framework combining knowledge distillation and RL to enhance LLM reasoning post-training, improving sample efficiency and generalization. |
2025.0603 | TRePO | Amazon | Paper | —— | —— | ClickProposes that response-level rewards suffice for effective online RL in LLMs, offering a mathematical foundation for this approach. |
2025.0603 | Critique-GRPO | CUHK | Paper GitHub |
—— | —— | ClickCombines natural language critiques with numerical rewards in RL to overcome performance plateaus in LLM reasoning tasks. |
2025.0603 | Unlikeliness Rewards | CMU | Paper | —— | —— | ClickThe paper introduces an unlikeliness reward mechanism to address biases in Group Relative Policy Optimization (GRPO), enhancing the diversity and accuracy of large language models on structured tasks like formal theorem proving. |
2025.0604 | RewardAnything | PKU&WeChatAI | Paper GitHub |
RewardAnything-8B-v1 | —— | ClickIntroduces principle-following reward models that generalize across tasks by adhering to natural language specifications, improving alignment without retraining. |
2025.0605 | ALP | Stanford | Paper | —— | —— | ClickIntroduces adaptive length penalties in reinforcement learning to encourage concise reasoning in large language models, enhancing efficiency without sacrificing performance. |
2025.0605 | PatternSelection | HKU | Paper | —— | —— | ClickExplores mechanisms for selecting reasoning patterns in reinforcement learning for language models, aiming to enhance decision-making processes. |
2025.0605 | LogicPuzzleRL | PKU | Paper GitHub |
—— | —— | ClickUtilizes reinforcement learning on custom logic puzzles to cultivate robust mathematical reasoning in large language models. |
2025.0605 | DOTS | UIUC&NYU | Paper | —— | —— | ClickProposes methods to improve data efficiency in reinforcement fine-tuning of LLMs through difficulty-targeted online data selection and rollout replay. |
2025.0605 | ether0 | FutureHouse | Paper GitHub |
—— | —— | ClickA 24B parameter model trained for scientific reasoning in chemistry, capable of generating molecular structures from natural language prompts. |
2025.0605 | Writing-RL | Alibaba | Paper | —— | —— | ClickCurriculum-based RL improves long-form narrative coherence through structured rewards. |
2025.0606 | Confidence | Moscow | Paper | —— | —— | ClickConfidence-driven few-shot RL fine-tuning improves sample efficiency without reward supervision. |
2025.0607 | Thinking vs. Doing | CMU | Paper GitHub |
—— | —— | ClickLLMs learn test-time interaction: when to think, when to act—enhances reasoning efficiency. |
2025.0607 | OptimalReasoning | THU | Paper | —— | —— | ClickTheoretical study on RL-optimality gap for chain-of-thought reasoning. |
2025.0608 | YouronMath | Keio Univ | Paper GitHub |
—— | —— | ClickGamified interface improves LLM math performance via reward shaping and iterative gameplay. |
2025.0608 | Play to Generalize | Rice | Paper GitHub |
—— | —— | ClickTrains reasoning via gameplay to transfer skills across tasks. |
2025.0608 | RPT | Microsoft | Paper | —— | —— | ClickUses RL objectives during pretraining to equip LLMs with better downstream reasoning capabilities. |
2025.0608 | MARL | WM Univ. | Paper | —— | —— | ClickLLMs critique each other in a reflective multi-agent framework to iteratively refine reasoning chains. |
2025.0608 | RLT | Alibaba | Paper | —— | —— | ClickRL teachers dynamically allocate thinking-time during inference to balance latency and accuracy. |
2025.0608 | SwS | Microsoft | Paper | —— | —— | ClickLLM self-assesses its weaknesses, then generates challenging tasks to improve via RL. |
2025.0608 | RuleReasoner | UCLA | Paper | —— | —— | ClickBlends rule-based logic with RL-driven dynamic sampling to solve structured reasoning problems. |
2025.0608 | Bingo | Microsoft | Paper | —— | —— | ClickRL method improves reasoning by amplifying attention on critical intermediate steps. |
2025.0609 | CoRT | Qwen | Paper GitHub |
—— | —— | ClickTool-augmented RL trains LLMs to reason via code synthesis and self-refinement loops. |
2025.0609 | VerIF | THU | Paper GitHub |
—— | —— | ClickVerification-first RL training: modularly verifies and rewrites faulty LLM outputs during policy updates. |
2025.0609 | Router-R1 | UIUC | Paper GitHub |
—— | —— | ClickRL-based routing policies optimize multi-round tool use and answer aggregation. |
2025.0609 | RePO | CUHK + AILab | Paper GitHub |
—— | —— | ClickReplay-Enhanced Policy Optimization: improves sample efficiency and stability of reasoning training. |
2025.0609 | SSA | CUNY | Paper | —— | —— | ClickPromotes consistency by aligning reasoning traces across training samples with shared structure. |
2025.0609 | ComfyUI-R1 | HIT & Alibaba | Paper GitHub |
—— | —— | ClickReasoning-powered LLM agent for UI pipeline automation inspired by ComfyUI workflows. |
2025.0609 | Learning to Clarify | Adobe | Paper | —— | —— | ClickLLMs learn when and how to ask clarification questions via reward-weighted fine-tuning. |
2025.0610 | Magistral | Mistral AI | Paper | Magistral-Small-2506 | —— | ClickFirst RL-trained reasoning LLM from Europe. Strong multilingual chain-of-thought and tool use. Open-source (Apache 2.0). |
2025.0610 | FastEasy & Deep Hard | FDU | Paper | —— | —— | ClickApplies dynamic penalty on output length to focus model effort on harder inputs. |
2025.0610 | PAG | ByteDance | Paper GitHub |
—— | —— | ClickLLMs generate, verify, and correct responses in multi-turn RL framework inspired by verifier-agent loops. |
2025.0610 | SAL | MIT | Paper | —— | —— | ClickExplores self-adjusting LLM behaviors at inference using RL-inspired introspection and elicitability measures. |
2025.0610 | Unsupervised Elicitation | Anthropic | Paper | —— | —— | ClickReveals hidden reasoning capacities without supervision—implications for reward-free training. |
2025.0611 | LearnAlign | CUHK | Paper | —— | —— | ClickGradient-alignment-driven reasoning data selection for better RL fine-tuning of LLMs. |
2025.0611 | Continue-Thinking Token | CTK | Paper | —— | —— | ClickNew token inserted at inference to trigger deeper reasoning steps with zero-shot generalization. |
2025.0611 | TreeRL | THU-DM | Paper GitHub |
—— | —— | ClickCombines on-policy RL and tree search for interpretable decision traces in reasoning tasks. |
2025.0x0x |
Paper GitHub |
hf models | hf datasets | Clickinsights and contributions about RL for reasoning within 30 words. |
Date | Project | Org | Intro | HF Model | HF Dataset | Takeaway Messages |
---|---|---|---|---|---|---|
2025.0128 | Open-R1-MultiModal | LLMs Lab | GitHub More |
Qwen2-VL-2B-GRPO-8k Qwen2-VL-7B-GRPO-8k |
multimodal-open-r1-8k-verified | ClickOpen-R1-MultiModal provides an open-source replication of R1-Zero-like RL for Multimodal LLMs, aiming to enhance complex visual reasoning. It demonstrates the effectiveness of these RL techniques for boosting multimodal performance and promotes reproducibility in the field. |
2025.0202 | R1-V | Deep Agent | Blog GitHub More |
—— | Clevr_CoGenT_TrainA_R1 | ClickR1-V applies RL, specifically RLV-Instruct, to fine-tune VLMs. It enhances complex visual reasoning and instruction-following capabilities in VLMs beyond standard supervised fine-tuning. |
2025.0215 | VLM-R1 | OmAI Lab | Blog GitHub More |
OVD Math REC |
—— | ClickVLM-R1 applies R1-style RL to VLMs, improving stability and generalization on visual reasoning tasks. It shows that RL enhances VLM generalization beyond standard fine-tuning, achieving SOTA results, particularly on complex or out-of-domain benchmarks. |
2025.0303 | Visual-RFT | SJTU & Shanghai AI Lab & CUHK | Paper GitHub More |
Reasoning Grounding | COCO_base65 COCO COCO_8_classes_4_shot LVIS_few_shot Flower_4_shot FGVC_Aircraft_4_shot Car196_4_shot Pets37_4_shot |
ClickVisual-RFT introduces Visual Reinforcement Fine-tuning, which extends reinforcement learning with verified rewards on visual perception tasks that are effective with limited data for fine-tuning. |
2025.0306 | R1-VLM | GroundLight | Blog GitHub More |
—— | —— | ClickR1-VLM enhances VLMs using RL, contributing significantly improved performance on complex visual reasoning tasks (spatial, counting, logic) where standard models falter. It shows that RL effectively unlocks advanced, multi-step reasoning capabilities specifically for vision-language understanding. |
2025.0310 | VisualThinker-R1-Zero | TurningPoint | Paper GitHub More |
VisualThinker-R1-Zero | —— | ClickVisualThinker-R1-Zero adapts the R1-Zero RL paradigm (no supervised fine-tuning) to VLMs, achieving SoTa visual reasoning. It shows that complex visual reasoning can be effectively cultivated directly via RL on a base VLM, bypassing supervised data needs. |
2025.0310 | MM-EUREKA | USTC & ZTE & NEU | Paper Github More |
MM-Eureka-Qwen-7B | MM-Eureka-Dataset | ClickMM-EUREKA reproduces key characteristics of text-based RL systems like DeepSeek-R1 in the multimodal space, which demonstrates that both instruction-tuned and pre-trained models can develop strong multimodal reasoning capabilities through rule-based RL without supervised fine-tuning, showing superior data efficiency compared to alternative approaches. |
2025.0310 | Curr-ReFT | Shanghai AI Lab & SJTU & HKU | Paper GitHub More |
3B-Curr-ReFT 7B-Curr-ReFT |
Curr-ReFT-data | ClickCurr-ReFT proposes a Curriculum Reinforcement Finetuning strategy to enhance the out-of-distribution generalization and reasoning abilities. The curriculum paradim ensures steady progression. Moreover, a rejected sampling-based self-improvement is proposed to maintain the fundamental capabilities of VLMs through selective learning from high-quality multimodal and language examples. |
2025.0311 | LLM-R1 | CUHK & Ant Group | Paper GitHub |
—— | —— | ClickLLM-R1 contributes the RMAVO algorithm to stably enhance LLM reasoning using RL, preventing reward hacking and achieving SOTA results with smaller models via an open-source implementation. It shows that reward model assistance in value optimization is key for stable RL. |
2025.0311 | Vision-R1 | ECNU & Xiaohongshu | Paper GitHub |
—— | Vision-R1-cold | ClickVision-R1 adapts the R1-Zero RL paradigm for VLMs, training them on visual reasoning chains. Its contribution is significantly boosting complex multimodal reasoning performance. It shows that RL applied to explicit reasoning steps effectively enhances VLM capabilities. |
2025.0311 | MMR1 | NTU & SUTD & LASA | GitHub | MMR1-Math-v0-7B | MMR1-Math-RL-Data-v0 | ClickMMR1-Math-v0 achieves state-of-the-art performance among open-source 7B multimodal models, competing effectively even against proprietary models with significantly larger parameter sizes—all trained using only 6k carefully curated data instances. |
2025.0315 | MetaSpatial | Northwestern University | Paper Project GitHub |
—— | 3D_Reasoning | ClickMetaSpatial leverages reinforcement learning to enhance 3D spatial reasoning in vision-language models (VLMs), enabling more structured, realistic, and adaptive scene generation for applications in the metaverse, AR/VR, and game development. |
2025.0327 | Reason-RFT | PKU & BAAI & CASIA & School of Artificial Intelligence, University of Chinese Academy of Sciences | Paper GitHub Project |
—— | tanhuajie2001/Reason-RFT-CoT-Dataset | ClickReason-RFT introduces a two-phase training paradim: (1) SFT with CoT data to activate reasoning potential, followed by (2) GRPO-based reinforcement learning to enhance generalization, which further has potential applications in Emobodied AI. |
2025.0404 | MAYE | SJTU & GAIR | Paper GitHub |
—— | ManTle/MAYE | ClickMAYE is a transparent, reproducible framework and a comprehensive evaluation scheme for applying reinforcement learning (RL) to vision-language models (VLMs). Its codebase is developed entirely from scratch without relying on any existing RL toolkits. |
2025.0408 | Step-R1-V-Mini | StepFun | Website | —— | —— | ClickStep-R1-V-Mini excels in the domain of visual reasoning, while also demonstrating top-tier performance in mathematical, code, and textual reasoning tasks. It supports a context length of 100k. |
2025.0409 | Kimi-VL-Thinking | Kimi Team | Technical Report GitHub |
moonshotai/Kimi-VL-A3B-Thinking | —— | ClickKimi-VL-Thinking is designed to enhance long-horizon reasoning capabilities in vision-language tasks. Built on a foundation of long CoT SFT and RL, with only 2.8 parameters, Kimi-VL-Thinking achieves strong performance across a range of tasks requiring long-term reasoning. It excels in domains such as MMMU, MathVision, and MathVista, achieving impressive scores of 61.7, 36.8, and 71.3, respectively. |
2025.0409 | VideoChat-R1 | Shanghai AI Lab & NJU & ZJU & USTC & Shanghai Innovation Institute & SIAT | Paper GitHub |
—— | —— | ClickVideoChat-R1 provides a systematic exploration of Reinforcement Fine-Tuning (RFT) with GRPO for video MLLMs, which exhibiting remarkable performance on spatio-temporal perception tasks without sacrificing chat ability, while exhibiting emerging spatio-temporal reasoning abilities. |
2025.0410 | Perception-R1 | HUST & BUPT & StepFun & JHU & Tsinghua University | Paper GitHub |
Perception-R1 | Perception-R1 | ClickPerception-R1 explores the effects of RL on different perception tasks, the researchers observe that the percep- tual perplexity is a major factor in determining the effectiveness of RL. The scalable Perception-R1 achieves remarkable performance on the perception tasks. |
2025.0410 | VL-Rethinker | TIGER-Lab | Paper GitHub |
TIGER-Lab/VL-Rethinker-7B TIGER-Lab/VL-Rethinker-72B |
—— | ClickVL-Rethinker proposes Selective Sample Replay (SSR) and Forced Rethinking to enhance fast-thinking models.The model achieves remarkable performance on multi-disciplinary benchmarks. |
2025.0501 | T2I-R1 | CUHK MMLab & CUHK MiuLar Lab & Shanghai AI Lab | Paper GitHub |
CaraJ/ORM-T2I-R1 | —— | ClickT2I-R1 is a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. The semantic-level CoT is utilized for high-level planning of the prompt, and the token-level CoT is designed for low-level pixel processing during patch-by-patch generation. |
2025.0516 | VisualPlanning | Cambridge & UCL & Google | Paper GitHub |
—— | —— | ClickVisualPlanning enables planning through purely visual representations, independent of text. In this paradigm, planning is executed via sequences of images that encode step-by- step inference in the visual domain, akin to how humans sketch or visualize future actions. |
2025.0521 | GRIT | UCSC & eBay | Paper GitHub Project Demo |
yfan1997/GRIT-20-InternVL-2B yfan1997/GRIT-20-Qwen2.5-VL-3B |
yfan1997/GRIT_data | ClickGRIT proposes grounded reasoning with images and text for training MLLMs to think with images. The models generate reasoning chains that interleave natural language and explicit bounding box coordinates. Moreover, built upon the GRPO algorithm, GRIT eliminates the need for annotated reasoning chains or explicit bounding box labels, requiring as few as 20 image-question-answer triplets to train the model. |
2025.0522 | GoT-R1 | HKU MMLab & CUHK MMLab & Sensetime & BUAA | Paper GitHub |
gogoduan/GoT-R1-1B gogoduan/GoT-R1-7B |
—— | ClickGoT-R1 applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. To achieve this, a dual-stage multi-dimensional reward framework is proposed that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The reward system assesses semantic alignment, spatial accu- racy, and visual quality in a unified approach. |
2025.0529 | Jigsaw-R1 | ESAT-PSI | Paper GitHub |
—— | —— | ClickThis paper provides a comprehensive study of rule-based visual RL using jigsaw puzzles as a structured experimental framework, revealing several key findings. |
2025.0603 | SynthRL | NUS&CUHK | Paper GitHub |
SynthRL | —— | ClickIntroduces SynthRL, a pipeline that synthesizes verifiable data to train vision-language models, boosting performance on visual math reasoning tasks. |
2025.0603 | Cell-o1 | UIUC | Paper GitHub |
—— | —— | ClickPresents Cell-o1, an LLM trained via RL to annotate single-cell RNA sequencing data, achieving expert-level reasoning in batch-level contexts. |
2025.0604 | MiMo-VL | XiaomiMimo | Paper GitHub |
MiMo-VL-7B | —— | ClickDetails MiMo-VL-7B models achieving state-of-the-art performance in visual understanding and multimodal reasoning through mixed on-policy RL. |
2025.0604 | ReVisual-R1 | ZJU&FDU | Paper GitHub |
—— | —— | ClickIntroduces ReVisual-R1, a staged RL approach enhancing MLLM reasoning by combining optimized cold starts with text-only RL fine-tuning. |
2025.0604 | LaF-GRPO | PolyU | Paper GitHub |
—— | —— | ClickDevelops an LLM-as-Follower reward mechanism to generate in-situ navigation instructions for the visually impaired, enhancing instruction usability. |
2025.0611 | Visual PTRL | UC Berkeley | Paper | —— | —— | ClickTrains visual backbones on raw image data with reinforcement rewards—unsupervised and scalable. |
2025.0x0x |
Paper GitHub |
hf models | hf datasets | Clickinsights and contributions about RL for reasoning within 30 words. |
Date | Project | Org | Intro | HF Model | HF Dataset | Takeaway Messages |
---|---|---|---|---|---|---|
2025.0126 | RAGEN | RAGEN-AI | Paper GitHub |
—— | —— | ClickRAGEN introduces a RL framework to train reasoning-capable LLM agents for interactive, stochastic environments. Its core contribution is the Reasoning-Interaction Chain Optimization (RICO) algorithm, which jointly optimizes reasoning and action strategies by reinforcing entire trajectories. |
2025.0203 | Verifiers | Independent | GitHub |
—— | —— | ClickThis repository contains a set of tools for reinforcement learning with LLMs in verifiable environments. It can be used for LLM Agent RL in verifable environments. |
2025.0207 | AgenticReasoning | Univ. of Oxford | Paper GitHub |
—— | —— | ClickThis framework introduces the Mind Map agent, which constructs a structured knowledge graph to track logical relationships, improving deductive reasoning. |
2025.0303 | ReSearch | Agent-RL | GitHub More |
—— | —— | ClickThe project train LLMs from scratch, utilizing RL with GRPO to learn to reason via search operations, without reliance on pre-existing reasoning frameworks or supervised data. |
2025.0312 | Search-R1 | UIUC & UMass Amherst | Paper GitHub More |
Search-R1 | 2018 Wikipedia | ClickThe paper introduces Search-R1, a novel RL framework that enables LLMs to interact with search engines in an interleaved manner with their own reasoning. The framework is shown to be effective, with experiments demonstrating average relative improvements of 41% and 20% over RAG baselines, and providing insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning. |
2025.0318 | R1-Searcher | RUC | Paper GitHub |
Llama-3.1-8B-instruct-RAG-RL Qwen-2.5-7B-base-RAG-RL |
RAG-RL-Hotpotqa | ClickR1-Searcher enhances LLM reasoning via RL by training the model to perform adaptive model-based search during generation. This integration enables flexible thinking depth, improving reasoning efficiency and performance compared to fixed-step methods like R1-Zero. |
2025.0319 | SWEET-RL | Meta AI | Paper GitHub |
—— | collaborative_agent_bench | ClickSweet-RL introduces a novel RL algorithm for multi-turn collaborative reasoning LLM agents. Its core contribution is improving credit assignment across long interactions by using an asymmetric actor-critic structure where the critic leverages additional training-time information for step-wise evaluation. |
2025.0327 | UI-R1 | Vivo AI Lab & CUHK | Paper GitHub |
Qwen2.5-VL-3B-UI-R1 | UI-R1-3B-Train | ClickThis paper proposes UI-R1, the first framework to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for GUI action prediction tasks. |
2025.0404 | DeepResearcher | SJTU | Paper GitHub |
DeepResearcher-7b | —— | ClickThis paper introduces DeepResearcher, the first comprehensive framework for end-to-end training of LLM-based deep research agents through scaling reinforcement learning (RL) in real-world environments with authentic web search interactions. |
2025.0414 | ART | OpenPipe | Blog GitHub |
—— | —— | ClickThis release is an early alpha focused on best-in-class training efficiency and agentic multi-turn support. |
2025.0414 | GUI-R1 | CAS & NUS | Paper GitHub |
—— | GUI-R1 | ClickThis paper proposes GUI-R1, the first reinforcement learning framework designed to enhance the GUI capabilities of LVLMs in high-level real-world task scenarios, through unified action space rule modeling. |
2025.0415 | ReTool | ByteDance | Paper GitHub More |
ReTool-Qwen-32B | ReTool-SFT | ClickReTool is a reinforcement learning framework that integrates code interpreter execution into the reasoning loop of large language models (LLMs) to improve their mathematical reasoning capabilities. The framework consists of two primary stages: cold-start supervised fine-tuning and reinforcement learning with interleaved code execution rollout, allowing the model to learn when and how to invoke tools based on outcome feedback. |
2025.0428 | ARTIST | Microsoft | Paper | —— | —— | ClickARTIST enables models to autonomously decide when, how, and which tools to invoke within multi-turn reasoning chains, leveraging outcome-based RL to learn robust strategies for tool use and environment interaction without requiring step-level supervision. |
2025.0430 | WebThinker | RUC | Paper GitHub More |
WebThinker-QwQ-32B WebThinker-R1-7B WebThinker-R1-14B WebThinker-R1-32B |
—— | ClickWebThinker is a deep research agent that empowers large reasoning models (LRMs) to autonomously search the web, navigate web pages, and draft research reports during the reasoning process. It integrates a Deep Web Explorer module and employs an Autonomous Think-Search-and-Draft strategy, allowing for real-time report writing and information gathering. |
2025.0506 | SkyRL-v0 | NovaSky-AI | blog GitHub |
SkyRL-Agent-7B-v0 SkyRL-Agent-8B-v0 SkyRL-Agent-14B-v0 |
SkyRL-v0-293-data | ClickThis paper introduces SkyRL, the RL training pipeline for multi-turn tool use LLMs, optimized for long-horizon, real-environment tasks like SWE-Bench, built on top of VeRL and OpenHands. Using SkyRL, we are able to achieve promising results on SWE-Bench-Verified across model lines, using around 300 samples of training data! |
2025.0512 | Tool-N1 | NVIDIA | Paper GitHub |
—— | —— | ClickThis paper presents Nemotron-Research-Tool-N1, a family of tool-using reasoning language models. These models are trained with an R1-style reinforcement learning algorithm that uses a binary reward to supervise only the structural format and functional correctness of tool calls, without requiring explicit reasoning annotations. |
2025.0512 | ZeroTIR | FDU & Xiaohongshu | Paper GitHub |
—— | —— | ClickThis paper investigates RL from outcome-based rewards for Tool-Integrated Reasoning, ZeroTIR, training base LLMs to spontaneously generate and execute Python code for mathematical problems without supervised tool-use examples. |
2025.0513 | AgentCPM-GUI | OpenBMB | GitHub |
openbmb/AgentCPM-GUI | —— | ClickAgentCPM-GUI is an open-source on-device LLM agent model jointly developed by THUNLP, Renmin University of China and ModelBest. Built on MiniCPM-V with 8 billion parameters, it accepts smartphone screenshots as input and autonomously executes user-specified tasks. |
2025.0514 | AlphaEvolve | Google DeepMind | Blog | —— | —— | ClickAlphaEvolve is an evolutionary coding agent powered by large language models for general-purpose algorithm discovery and optimization. |
2025.0515 | GiGPO | NTU&Skywork | Paper GitHub |
—— | —— | ClickThis paper proposes Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm that achieves fine-grained credit assignment for LLM agents while preserving the appealing properties of group-based RL: critic-free, low memory, and stable convergence. |
2025.0516 | AutoRefine | USTC | Paper GitHub |
hf models | hf datasets | ClickThis paper proposes AutoRefine, a reinforcement learning posttraining framework that adopts a new "search-and-refine-during-think" paradigm. |
2025.0520 | Time-R1 | UIUC | Paper GitHub |
—— | —— | ClickThis paper introduces Time-R1, the first framework to endow a moderate-sized (3B-parameter) LLM with comprehensive temporal abilities: understanding, prediction, and creative generation. |
2025.0521 | Empirical Study | UIUC | Paper GitHub |
—— | —— | ClickThis paper highlights several key findings: format rewards are effective in improving final performance, whereas intermediate retrieval rewards have limited impact; the scale and initialization of the LLM (general-purpose vs. reasoningspecialized) significantly influence RL outcomes; and the choice of search engine plays a critical role in shaping RL training dynamics and the robustness of the trained agent during inference. |
2025.0521 | StepSearch | SenseTime | Paper GitHub |
—— | —— | ClickThis paper introduces StepSearch, a framework for search LLMs that trained with step-wise proximal policy optimization method. |
2025.0521 | GUI-G1 | RUC | Paper GitHub |
—— | —— | ClickThis paper identifies three distinct challenges in the R1-Zero-Like training pipeline of R1-style GUI agents: grounding is harmed by longer reasoning due to grounding’s reliance on image tokens; common reward functions induce sizesensitive reward hacking; and GRPO biases agents toward simpler examples due to its objective. |
2025.0522 | Tool-Star | RUC | Paper GitHub |
Tool-Star-Qwen-3B | Multi-Tool-RL-10K Tool-Star-SFT-54K |
ClickThis paper introduces Tool-Star, an RL-based framework designed to empower LLMs to autonomously invoke multiple external tools during stepwise reasoning. |
2025.0522 | R1-Searcher++ | RUC | Paper GitHub |
—— | —— | Clickinsights and contributions about RL for reasoning within 30 words. |
2025.0522 | ARPO | CUHK | Paper GitHub |
—— | —— | ClickThis paper investigates end-to-end policy optimization for vision-language-based GUI agents with the aim of improving performance on complex, long-horizon computer tasks. |
2025.0522 | AgentThink | THU&McGill | Paper | —— | —— | ClickThis paper introduces AgentThink, a pioneering unified framework that, for the first time, integrates Chainof-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks. |
2025.0523 | Agent-Distillation | KAIST | Paper GitHub |
—— | —— | ClickThis paper proposes Agent Distillation, a framework for transferring not only reasoning capability but full task-solving behavior from LLM-based agents into sLMs with retrieval and code tools. |
2025.0526 | DeepEyes | Xiaohongshu | Paper GitHub |
DeepEyes-7B | DeepEyes-Datasets-47k | ClickThis paper explores the interleaved multimodal reasoning paradigm and introduce DeepEyes, a model with "thinking with images" capabilities incentivized through end-to-end reinforcement learning without the need for cold-start SFT. |
2025.0527 | rStar | MRA | Paper GitHub |
—— | —— | ClickThis paper introduces rStar-Coder, which significantly improves LLM code reasoning capabilities by constructing a large-scale, verified dataset of 418K competition-level code problems, 580K long-reasoning solutions along with rich test cases of varying difficulty. |
2025.0527 | SPA-RL-Agent | PolyU | Paper GitHub |
—— | —— | ClickThis paper proposes Stepwise Progress Attribution (SPA), a general reward redistribution framework that decomposes the final reward into stepwise contributions, each reflecting its incremental progress toward overall task completion. |
2025.0528 | WebDancer | Tongyi Lab | Paper GitHub |
—— | —— | ClickThe paper introduces a unified, data-centric training paradigm for developing agentic web research agents, exemplified by WebDancer, which combines supervised learning and reinforcement learning to achieve strong multi-step information-seeking performance on GAIA and WebWalkerQA benchmarks. |
2025.0529 | ML-Agent | SJTU | Paper | —— | —— | ClickThis paper explores the paradigm of learning-based agentic ML, where an LLM agent learns through interactive experimentation on ML tasks using online reinforcement learning (RL). |
2025.0530 | Pangu DeepDiver | Huawei | Paper | —— | —— | ClickThe paper introduces Pangu DeepDiver, a reinforcement learning framework that equips large language models with adaptive search intensity scaling (SIS) for open-web question answering, using a new WebPuzzle dataset to improve evidence-seeking behavior under real-world ambiguity and noise. |
2025.0601 | VerlTool | TIGER AI Lab | GitHub |
Qwen2.5-Math-VerlTool | —— | ClickVerlTool is a unified and easy-to-extend tool agent training framework based on verl |
2025.0602 | SCA | UCB & Meta | Paper | —— | —— | ClickLLMs generate and solve their own tasks via a "Code-as-Task" setup, using RL for learning. Yields >2× gains on tool-use benchmarks. |
2025.0602 | MMedAgent-RL | UNC | Paper | —— | —— | ClickMulti-agent reinforcement learning for medical reasoning with multimodal data. Promotes coordination and robustness across specialized agents. |
2025.0603 | CURE | ByteDance Seed | Paper GitHub |
reasonflux-coder | —— | ClickIntroduces CURE, a framework where code generation and unit testing co-evolve through RL, enhancing code accuracy without ground-truth supervision. |
2025.0604 | Seed-Coder | ByteDance Seed | Paper GitHub |
Seed-Coder | —— | ClickProposes a self-curating code model that generates and selects its own training data, enhancing code generation capabilities without external supervision. |
2025.0604 | DyMo | Cohere | Paper | —— | —— | ClickPresents a self-verification sampling method for LLMs to enhance tool use by predicting and verifying intermediate steps before proceeding. |
2025.0604 | R-Search | CAS | Paper GitHub |
—— | —— | ClickPresents a multi-reward RL framework enabling LLMs to integrate reasoning with search, improving performance on complex logic and knowledge tasks. |
2025.0605 | MedAgentGym | Emory Univ. | Paper GitHub |
—— | —— | ClickIntroduces a training environment for LLM agents focused on code-based medical reasoning, facilitating the development of AI in healthcare applications. |
2025.0605 | CI-RL | Purdue&Microsoft | Paper | —— | —— | ClickApplies reinforcement learning to enhance contextual integrity in LLMs, aligning their outputs with privacy and safety norms. |
2025.0611 | Grounding-R1 | Salesforce | Blog | —— | —— | ClickGUI grounding via GRPO RL—clicks relevant areas without bounding-box or rationale supervision. |
2025.0611 | Agent-RLVR | Scale AI | Paper | —— | —— | ClickTrains software agents using both environmental feedback and expert guidance—targeting real-world SE tasks. |
2025.0611 | ReVeal | MAR & THU | Paper | —— | —— | ClickSelf-evolving agents improve code generation via iterative RL-based generate–verify cycles. |
2025.0611 | CAGSR-vLLM-MTC | UC Berkeley | Paper | —— | —— | ClickEnhances multi-turn reasoning via vLLM + self-supervised fine-tuning + RL on CoT traces. |
2025.0x0x |
Paper GitHub |
hf models | hf datasets | Clickinsights and contributions about RL for reasoning within 30 words. |
If you have any updates or improvements for this document, please feel free to submit a Pull Request. Thank you!
Project or Paper | Project name or Paper title |
---|---|
GitHub | Username/Project |
Backbone Model | (Base / Instruct / Reasoning; HF Model) |
RL Algorithm | (PPO / GRPO / RLOO / REINFORCE++; OpenRLHF / Verl / Trl) |
Training Dataset | (Size / Source / HF Dataset) |
Rollout Configuration | (Batch Size * N Samples ; Temperature; Dynamic Sampling) |
Reward Function | (Outcome; Process; Repetition & Length) |
Policy Optimization | (KL Loss; Length Penalty; Token-level loss) |
Benchmark | (MATH/GPQA; R1 level; GPT-4o level) |
Core Insights | (Empirical / Theoretical / Insightful Curves) |
Additional Notes | (e.g., code snippet) |
If you find our repository useful in your research, please star us ⭐ and consider citing:
@misc{zhang2025TripleR,
title={Awesome RL Recipes for Reasoning},
author={Kaiyan Zhang, Yuchen Fan, Yuxin Zuo, Guoli Jia, Kai Tian, Xingtai Lv, Xuekai Zhu, Ermo Hua, Ning Ding, Biqing Qi, Bowen Zhou},
year={2025},
howpublished={\url{https://github.com/}},
note={Github Repository},
}