Skip to content

TsinghuaC3I/Awesome-RL-Reasoning-Recipes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome RL Reasoning Recipes ("Triple R")

Awesome

A curated collection covering models, datasets, reward designs, optimization methods, hyperparameters, empirical findings, theoretical insights, and everything about reasoning with reinforcement learning.

News

  • [2025-05-27]: 🔥We are very excited to release MARTI: A Framework for LLM-based Multi-Agent Reinforced Training and Inference. Check it out: Github.
  • [2025-04-23]: 🔥Introducing TTRL — an open-source solution for online RL on data without ground-truth labels, especially test data. Check it out: Github and Paper.

Contents

⚠️⚠️⚠️ For the most recent updates, please scroll to the bottom of the table:

Overview

This collection covers recent progress in reinforcement learning for large language model reasoning, starting from 2025 in the timeline.

Large Language Models

Date Project Org Intro HF Model HF Dataset Takeaway Messages
2025.0102 PRIME-RL THU & UIUC
Shanghai AI Lab
Paper
GitHub
More
Eurus-2-7B-PRIME
Eurus-2-7B-PRIME-Zero
Eurus-2-RL-Data
ClickPRIME offers scalable Reinforcement Learning by using dense, token-level implicit rewards derived only from final outcomes. This bypasses costly step-by-step annotations, providing fine-grained feedback to improve sample efficiency and reasoning.
2025.0122 DeepSeek-R1 DeepSeek Paper
GitHub
More
DeepSeek-R1
DeepSeek-R1-Zero
——
ClickDeepSeek-R1's core contribution is demonstrating large-scale RL from scratch (600B+) without SFT, achieving emergent "aha moments" (self-reflective reasoning) and matching OpenAI o1's performance at 1/30 cost
2025.0122 Kimi k1.5 Kimi Paper
GitHub
More
—— ——
ClickKimi 1.5 introduces a simplified RL framework that leverages long-context scaling (128k tokens) and improved policy optimization (e.g., online mirror descent) to enhance reasoning and multimodal performance.
2025.0124 TinyZero Berkeley Twitter
GitHub
More
—— Countdown-Tasks-3to4
ClickTinyZero's core contribution is demonstrating that smaller language models (e.g., 1.5B-3B parameters) can develop complex reasoning, search, and self-verification abilities through Reinforcement Learning, replicating capabilities of larger models like DeepSeek R1-Zero at extremely low cost (<$30).
2025.0124 Open-R1 Huggingface GitHub OpenR1-Qwen-7B
OlympicCoder-7B
OlympicCoder-32B
OpenR1-Math-220k
codeforces
ClickOpen-R1's core contribution is providing the first fully open-source replication and release of the DeepSeek R1-Zero Reinforcement Learning training pipeline. Its main insight or goal is to democratize access to these advanced RL techniques for enhancing LLM reasoning and planning.
2025.0125 simpleRL-reason HKUST Paper
GitHub
More
Qwen-2.5-Math-7B-SimpleRL-Zero
Qwen-2.5-Math-7B-SimpleRL
MATH
ClickResearchers replicated the DeepSeek-R1-Zero and DeepSeek-R1 training using a 7B model with only 8K MATH examples, achieving strong results on complex mathematical reasoning.
2025.0205 Demystify-long-cot CMU Paper
GitHub
More
—— ——
ClickThe paper elucidates the role of RL in stabilizing and enhancing long CoT reasoning in LLMs, highlighting the necessity of reward shaping and verifiable reward signals for complex reasoning tasks.
2025.0207 No-aha-moment Sea AI Lab Blog
GitHub
—— Countdown-Tasks-3to4
ClickThis is the first public critique of the 'aha moment' associated with DeepSeek-R1-Zero-style training, suggesting that changes in response length are an intrinsic part of the reinforcement learning dynamics.
2025.0210 DeepScaler Agentica-Org Blog
GitHub
More
DeepScaleR-1.5B-Preview DeepScaleR-Preview-Dataset
ClickDeepScaleR's core contribution is demonstrating that a small 1.5B parameter model, fine-tuned using scaled Reinforcement Learning (RL) and an iterative context lengthening scheme, can surpass the reasoning performance of larger, state-of-the-art models like OpenAI's O1-Preview on complex benchmarks (e.g., AIME math problems).
2025.0210 Logic-RL MSRA & Ubiquant Paper
GitHub
More
—— knights-and-knaves knights-and-knaves-ZH
ClickThe paper introduces Logic-RL, a rule-based reinforcement learning framework that enables large language models to develop o3-mini-level reasoning skills through training on logic puzzles. The reasoning capabilities can also be transferred to other domains like math.
2025.0210 OREAL Shanghai AI Lab
SJTU & CUHK
Paper
GitHub
More
OREAL-32B OREAL-7B
OREAL-DeepSeek-R1-Distill-Qwen-7B
OREAL-32B-SFT
OREAL-7B-SFT
OREAL-RL-Prompts
ClickThe paper introduces OREAL, a reinforcement learning framework for mathematical reasoning with binary feedback. It proves that behavior cloning on positive samples is sufficient for optimal learning and proposes reward reshaping for negative samples. A token-level reward model addresses sparse rewards in long reasoning chains. OREAL achieves state-of-the-art results on math benchmarks.
2025.0217 LIMR SJTU Paper
GitHub
More
LIMR LIMR
ClickThe paper challenges the assumption that scaling up RL training data inherently improves performance in language models, instead finding that a strategically selected subset of 1,389 samples can outperform a full 8,523-sample dataset.
2025.0218 Open-Reasoner-Zero StepFun & THU Paper
GitHub
More
Open-Reasoner-Zero-7B
Open-Reasoner-Zero-32B
ORZ-Math-57k
ClickThe Open-Reasoner-Zero model has achieved notable performance, with Open-Reasoner-Zero-32B outperforming DeepSeek-R1-Zero-Qwen-32B on the GPQA Diamond benchmark while requiring significantly fewer training steps.
2025.0225 SWE-RL FAIR at Meta Paper
GitHub
More
—— ——
ClickSWE-RL enhances LLMs' code reasoning through RL using open-source software evolution data, achieving state-of-the-art results in software engineering tasks and demonstrating generalized reasoning capabilities beyond coding.
2025.0227 Med-RLVR Microsoft Research Paper
More
—— ——
ClickThe Med-RLVR framework demonstrates emergent medical reasoning via RL, achieving performance parity with SFT on in-distribution tasks and improving out-of-distribution generalization, all without explicit reasoning supervision, showcasing RL's potential in medicine.
2025.0303 VC-PPO Bytedance Paper
More
—— ——
ClickVC-PPO (Value-Calibrated PPO) diagnoses PPO's collapse in long CoT tasks as stemming from value function inaccuracies (initialization bias and reward signal decay in long sequences). Its core contribution is modifying PPO with value pretraining and decoupled GAE for actor and critic.
2025.0306 LCPO-L1 CMU Paper
GitHub
More
L1-Qwen-1.5B-Max
L1-Qwen-1.5B-Exact
——
ClickL1 introduces Length Controlled Policy Optimization (LCPO), a RL method enabling precise control over a reasoning model's thinking time (output length) via prompt instructions. It shows that RL effectively controls reasoning duration and unexpectedly enhances even short-chain reasoning capabilities.
2025.0310 MRT CMU Paper
Project
GitHub
—— ——
ClickMRT (Mixed-Reality Trajectory Preferences) introduces a novel method for fine-tuning cooperative LLM agents. It effectively blends human preferences on real interaction trajectories with AI preferences on synthetic variations, improving data efficiency. This mixed-reality approach surpasses purely AI-driven feedback (RLAIF), especially for complex, multi-turn collaborative tasks.
2025.0318 TOPR Mila & Reliant AI Paper
More
—— ——
ClickTOPR (Tapered Off-Policy REINFORCE) introduces a novel RL algorithm for fine-tuning LLMs. Its core contribution is using asymmetric, tapered importance sampling to modify REINFORCE, enabling stable and efficient off-policy learning. This allows reusing past data effectively without the instability often seen in other methods and without needing explicit KL regularization.
2025.0318 DAPO Bytedance
THU
Paper
GitHub
More
—— DAPO-Math-17k
ClickDAPO algorithm introduces four key techniques (Clip-Higher, Dynamic Sampling, Token-Level Loss, Overlong Shaping) for stable and efficient long-chain-of-thought RL training, surpassing previous SoTA results efficiently.
2025.0320 Open RS VNU University of Science & Knovel Engineering Lab Paper
GitHub
More
Open-RS1
Open-RS2
Open-RS3
open-s1
open-deepscaler
open-rs
ClickThe study investigates the potential of RL to improve reasoning in small LLMs. The results demonstrate rapid reasoning gains, with accuracy improvements on mathematical reasoning benchmarks, and highlight the efficacy of RL-based fine-tuning for small LLMs as a cost-effective alternative to large-scale approaches, using high-quality training data.
2025.0321 Dr. GRPO Sea AI Lab Paper
GitHub
More
Qwen2.5-Math-7B-Oat-Zero
Qwen2.5-Math-1.5B-Oat-Zero
Llama-3.2-3B-Oat-Zero
MATH
ClickThis work critically analyzes R1-Zero-like RL training. It reveals base model properties and GRPO algorithm biases (e.g., length bias) significantly impact outcomes. It contributes the efficient, unbiased Dr. GRPO algorithm and an open-source recipe/codebase for better understanding and reproduction.
2025.0321 FastCuRL Tencent Hunyuan Paper
GitHub
FastCuRL-1.5B-Preview FastCuRL
ClickFastCuRL introduces a simple, efficient Curriculum RL method for LLMs. Its core contribution uses target perplexity to dynamically scale the standard RL loss (like PPO), creating an effective curriculum without complex reward models or auxiliary components, enabling faster, more stable training.
2025.0328 ARGO Meta Paper
—— ——
ClickThis paper derived the Any-Generation Reward Optimization (AGRO) frim the consistency condition across any possible generation of the model. AGRO achieves a better convergence than KL-regularized policy gradient method.
2025.0401 Z1 THU Paper
GitHub
Z1-7B Z1-Code-Reasoning-107K
ClickThis paper proposes training LLMs on code-related reasoning trajectories using a curated dataset and a "Shifted Thinking Window" technique. This allows models to reduce excessive thinking tokens, achieving efficient test-time scaling and generalizing reasoning abilities.
2025.0401 VAPO ByteDance Seed Paper
—— ——
ClickVAPO offers an integrated solution that effectively alleviates value model bias, the presence of heterogeneous sequence lengths, and the sparsity of reward signal.
2025.0407 ConciseRL Wand AI Paper —— ——
ClickThis work challenges the idea that longer reasoning chains in LLMs inherently mean better accuracy. It uses mathematical analysis of RL principles, particularly PPO, to show that lengthier responses often arise from the optimization process itself, not necessarily improved reasoning.
2025.0409 AdaRFT USC LIME Lab Paper
GitHub
—— DeepScaleR_Difficulty
ClickAdaRFT proposes Adaptive Curriculum Reinforcement Finetuning to improve LLM reasoning training efficiency. It dynamically adjusts task difficulty based on recent reward signals, accelerating learning by keeping challenges optimally balanced. Experiments on competition math benchmarks show up to 2x fewer steps and improved accuracy, using standard PPO with minimal changes.
2025.0410 Seed-Thinking-v1.5 ByteDance Seed GitHub —— ——
ClickSeed-Thinking-v1.5 is a high-performing reasoning model that combines curated chain-of-thought data, stable reinforcement learning, and advanced infrastructure to achieve strong results across math, coding, and logic tasks.
2025.0410 d1 & diffu-GRPO UCLA & Meta Paper
GitHub
Project
—— ——
Click This paper propose d1 to adapt pre-trained masked dLLMs into reasoning via a combination of SFT and RL. The RL method used is named diffu-GRPO.
2025.0413 Skywork-OR1 Skywork AI Paper
Blog
GitHub
Skywork-OR1-32B-Preview
Skywork-OR1-7B-Preview
Skywork-OR1-Math-7B
Skywork-OR1-RL-Data
Click Skywork-OR1 is a series of robust open-source models trained on carefully curated math and code data. The training process incorporates several modifications to the original GRPO, including offline and online data filtering, multi-stage training, and adaptive entropy control.
2025.0415 DeepMath Tencent & SJTU Paper
GitHub
zwhe99/DeepMath-Zero-7B
zwhe99/DeepMath-Zero-Math-7B
zwhe99/DeepMath-1.5B
zwhe99/DeepMath-Omn-1.5B
zwhe99/DeepMath-103K
Click DeepMath-103K is a large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning. Trained on DeepMath-103K, DeepMath series models achieve state-of-the-art performance on many math benchmarks.
2025.0421 LUFFY Shanghai AI Lab Paper
GitHub
LUFFY-Qwen-Math-7B-Zero
LUFFY-Qwen-Math-1.5B-Zero
Openr1-Math-46k-8192
ClickThis paper introduces LUFFY (Learning to reason Under oFF-policY guidance), a framework that augments zero-RL with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training.
2025.0423 TTRL THU&Shanghai AI Lab Paper
GitHub
—— ——
ClickThis paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs).
2025.0430 Phi-4-reasoning Mircosoft Paper Phi-4-reasoning ——
ClickThis paper introduces Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks.
2025.0511 BLEUBERI Maryland Paper
GitHub
—— ——
ClickDemonstrates that BLEU, a simple string-matching metric, can effectively serve as a reward function for instruction-following tasks, rivaling complex reward models.
2025.0512 INTELLECT-2 PrimeIntellect-ai Paper
GitHub
INTELLECT-2 ——
ClickINTELLECT-2 is a 32 billion parameter language model trained through a reinforcement learning run leveraging globally distributed, permissionless GPU resources contributed by the community.
2025.0514 Qwen3 Alibaba Qwen Paper
GitHub
Qwen3 ——
Clickinsights and contributions about RL for reasoning within 30 words.
2025.0516 Subnetwork RL UIUC Paper —— ——
Clickinsights and contributions about RL for reasoning within 30 words.
2025.0516 Data Synthesis RL PKU&MIT Paper
GitHub
—— ——
Clickinsights and contributions about RL for reasoning within 30 words.
2025.0519 AR-Lopti CUHK Paper
GitHub
—— ——
ClickThis paper identifies a critical yet underexplored issue in RL training: low-probability tokens disproportionately influence model updates due to their large gradient magnitudes.
2025.0519 AnytimeReasoner Sea AI Lab Paper
GitHub
—— DeepScaleR-Preview-Dataset
ClickThis paper proposes a framework for optimizing anytime reasoning under arbitrary token budgets, featuring decoupled optimization of thinking and summarization, dense verifiable rewards, and budget relative policy optimization.
2025.0521 EM-PT UIUC Paper
GitHub
—— ——
ClickThis paper shows that this simple objective alone, without any labeled data, can substantially improve large language models’ (LLMs) performance on challenging math, physics, and coding tasks.
2025.0521 NOVER KCL&SJTU Paper
GitHub
—— ——
ClickThis paper presents a verifier-free R1-zero-like training, which enables it to train on any data (beyond math and coding)!
2025.0522 AceReason-Nemontron Nvidia Paper AceReason-Nemotron-14B ——
ClickThis paper demonstrates that large-scale RL can significantly enhance the reasoning capabilities of strong, small- and mid-sized models, achieving results that surpass those of state-of-the-art distillation-based models.
2025.0522 KTAE CAS Paper
GitHub
KTAE-7B/1.5B ——
ClickThis paper improves the calculation method of advantage based on GRPO, providing more fine-grained token-level advantage, and effectively reducing the generation length.
2025.0523 QwenLong-L1 Qwen-Doc Paper
GitHub
QwenLong-L1-32B ——
Clickinsights and contributions about RL for reasoning within 30 words.
2025.0523 Trinity-RFT Alibaba Group Paper
GitHub
—— ——
ClickTrinity-RFT is a general-purpose, flexible and scalable framework designed for reinforcement fine-tuning (RFT) of large language models.
2025.0524 LlamaRL Meta Paper —— ——
ClickDistributed async RL framework for LLMs, achieving 10× training speed over DeepSpeed; scales to 405B parameters.
2025.0525 SeRL ZJU Paper
GitHub
—— ——
ClickThis paper proposes Self-play Reinforcement Learning (SeRL) to bootstrap LLM training with limited initial data.
2025.0525 BRIDGE CMU Paper
GitHub
—— ——
ClickThe paper proposes behavior injection, a task-agnostic data augmentation method that enhances the effectiveness of reinforcement fine-tuning for language models by improving rollout accuracy and data co-influence, leading to consistently better post-RL performance.
2025.0526 REA-RL HIT Paper
GitHub
—— ——
ClickIntroduces REA-RL, which enhance the efficiency of LRMs by introducing a reflection model for efficient scaling online, and a reflection reward to prevent non-reflective responses.
2025.0527 ConciseR Tencent Hunyuan Paper
GitHub
—— ——
ClickThis paper proposes a simple yet effective two-stage reinforcement learning framework for achieving concise reasoning in LLMs, named ConciseR.
2025.0527 VeriFree Sea AI Lab Paper
GitHub
—— ——
ClickThis paper proposes a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer.
2025.0527 One-Shot-EM Ubiquant Paper
GitHub
—— ——
ClickThis paper trained 13,440 large language models and found that entropy minimization requires only a single unlabeled data and 10 steps optimization to achieve performance improvements greater than those obtained using thousands of data and carefully designed rewards in rule-based reinforcement learning.
2025.0528 Entropy-RL Shanghai AI Lab & THU Paper
GitHub
—— ——
ClickThis paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy.
2025.0528 RENT-RL CMU Paper
GitHub
—— ——
ClickRENT: Reinforcement Learning via Entropy Minimization is a fully unsupervised reinforcement learning method that improves reasoning performance by using the model's own confidence as a reward.
2025.0528 SynLogic MiniMax-AI Paper
GitHub
—— ——
ClickThis paper presents SynLogic, a data synthesis framework and dataset that generates diverse logical reasoning data at scale, encompassing 35 diverse logical reasoning tasks.
2025.0530 ProRL Nvidia Paper Nemotron-Qwen-1.5B ——
ClickThis paper challenges prevailing assumptions by demonstrating that prolonged RL (ProRL) training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling.
2025.0530 ReasoningGym OpenThought Paper
GitHub
—— ——
ClickThis paper introduces Reasoning Gym, a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple domains including algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and various common games.
2025.0530 AReaL InclusionAI Paper
GitHub
—— ——
ClickAReaL introduces a fully asynchronous reinforcement learning system for language reasoning tasks, decoupling rollout generation from model training to significantly improve GPU utilization and training speed—achieving up to 2.57× speedup over synchronous systems—while maintaining or improving model performance.
2025.0602 HighEntropyRL Qwen&THU Paper
Project
—— ——
ClickHigh-entropy minority tokens play an outsized role in RLVR training. This paper provides actionable insights into optimizing reward design.
2025.0602 RLVR-Decomposed Princeton Paper
GitHub
—— ——
ClickShows that penalizing incorrect answers alone can significantly boost LLM reasoning via PPO—challenging conventional RLHF approaches.
2025.0602 Writing-Zero Star Writing Paper —— ——
ClickApplies RLVR to creative tasks like story writing by converting non-verifiable tasks into verifiable subgoals.
2025.0602 SRPO ByteDance Seed & OSU Paper —— ——
ClickProposes a two-stage RL framework combining self-reflection and Group Relative Policy Optimization to boost multimodal reasoning.
2025.0603 KDRL HIT&Huawei Paper —— ——
ClickPresents KDRL, a unified framework combining knowledge distillation and RL to enhance LLM reasoning post-training, improving sample efficiency and generalization.
2025.0603 TRePO Amazon Paper —— ——
ClickProposes that response-level rewards suffice for effective online RL in LLMs, offering a mathematical foundation for this approach.
2025.0603 Critique-GRPO CUHK Paper
GitHub
—— ——
ClickCombines natural language critiques with numerical rewards in RL to overcome performance plateaus in LLM reasoning tasks.
2025.0603 Unlikeliness Rewards CMU Paper —— ——
ClickThe paper introduces an unlikeliness reward mechanism to address biases in Group Relative Policy Optimization (GRPO), enhancing the diversity and accuracy of large language models on structured tasks like formal theorem proving.
2025.0604 RewardAnything PKU&WeChatAI Paper
GitHub
RewardAnything-8B-v1 ——
ClickIntroduces principle-following reward models that generalize across tasks by adhering to natural language specifications, improving alignment without retraining.
2025.0605 ALP Stanford Paper —— ——
ClickIntroduces adaptive length penalties in reinforcement learning to encourage concise reasoning in large language models, enhancing efficiency without sacrificing performance.
2025.0605 PatternSelection HKU Paper —— ——
ClickExplores mechanisms for selecting reasoning patterns in reinforcement learning for language models, aiming to enhance decision-making processes.
2025.0605 LogicPuzzleRL PKU Paper
GitHub
—— ——
ClickUtilizes reinforcement learning on custom logic puzzles to cultivate robust mathematical reasoning in large language models.
2025.0605 DOTS UIUC&NYU Paper —— ——
ClickProposes methods to improve data efficiency in reinforcement fine-tuning of LLMs through difficulty-targeted online data selection and rollout replay.
2025.0605 ether0 FutureHouse Paper
GitHub
—— ——
ClickA 24B parameter model trained for scientific reasoning in chemistry, capable of generating molecular structures from natural language prompts.
2025.0605 Writing-RL Alibaba Paper —— ——
ClickCurriculum-based RL improves long-form narrative coherence through structured rewards.
2025.0606 Confidence Moscow Paper —— ——
ClickConfidence-driven few-shot RL fine-tuning improves sample efficiency without reward supervision.
2025.0607 Thinking vs. Doing CMU Paper
GitHub
—— ——
ClickLLMs learn test-time interaction: when to think, when to act—enhances reasoning efficiency.
2025.0607 OptimalReasoning THU Paper —— ——
ClickTheoretical study on RL-optimality gap for chain-of-thought reasoning.
2025.0608 YouronMath Keio Univ Paper
GitHub
—— ——
ClickGamified interface improves LLM math performance via reward shaping and iterative gameplay.
2025.0608 Play to Generalize Rice Paper
GitHub
—— ——
ClickTrains reasoning via gameplay to transfer skills across tasks.
2025.0608 RPT Microsoft Paper —— ——
ClickUses RL objectives during pretraining to equip LLMs with better downstream reasoning capabilities.
2025.0608 MARL WM Univ. Paper —— ——
ClickLLMs critique each other in a reflective multi-agent framework to iteratively refine reasoning chains.
2025.0608 RLT Alibaba Paper —— ——
ClickRL teachers dynamically allocate thinking-time during inference to balance latency and accuracy.
2025.0608 SwS Microsoft Paper —— ——
ClickLLM self-assesses its weaknesses, then generates challenging tasks to improve via RL.
2025.0608 RuleReasoner UCLA Paper —— ——
ClickBlends rule-based logic with RL-driven dynamic sampling to solve structured reasoning problems.
2025.0608 Bingo Microsoft Paper —— ——
ClickRL method improves reasoning by amplifying attention on critical intermediate steps.
2025.0609 CoRT Qwen Paper
GitHub
—— ——
ClickTool-augmented RL trains LLMs to reason via code synthesis and self-refinement loops.
2025.0609 VerIF THU Paper
GitHub
—— ——
ClickVerification-first RL training: modularly verifies and rewrites faulty LLM outputs during policy updates.
2025.0609 Router-R1 UIUC Paper
GitHub
—— ——
ClickRL-based routing policies optimize multi-round tool use and answer aggregation.
2025.0609 RePO CUHK + AILab Paper
GitHub
—— ——
ClickReplay-Enhanced Policy Optimization: improves sample efficiency and stability of reasoning training.
2025.0609 SSA CUNY Paper —— ——
ClickPromotes consistency by aligning reasoning traces across training samples with shared structure.
2025.0609 ComfyUI-R1 HIT & Alibaba Paper
GitHub
—— ——
ClickReasoning-powered LLM agent for UI pipeline automation inspired by ComfyUI workflows.
2025.0609 Learning to Clarify Adobe Paper —— ——
ClickLLMs learn when and how to ask clarification questions via reward-weighted fine-tuning.
2025.0610 Magistral Mistral AI Paper Magistral-Small-2506 ——
ClickFirst RL-trained reasoning LLM from Europe. Strong multilingual chain-of-thought and tool use. Open-source (Apache 2.0).
2025.0610 FastEasy & Deep Hard FDU Paper —— ——
ClickApplies dynamic penalty on output length to focus model effort on harder inputs.
2025.0610 PAG ByteDance Paper
GitHub
—— ——
ClickLLMs generate, verify, and correct responses in multi-turn RL framework inspired by verifier-agent loops.
2025.0610 SAL MIT Paper —— ——
ClickExplores self-adjusting LLM behaviors at inference using RL-inspired introspection and elicitability measures.
2025.0610 Unsupervised Elicitation Anthropic Paper —— ——
ClickReveals hidden reasoning capacities without supervision—implications for reward-free training.
2025.0611 LearnAlign CUHK Paper —— ——
ClickGradient-alignment-driven reasoning data selection for better RL fine-tuning of LLMs.
2025.0611 Continue-Thinking Token CTK Paper —— ——
ClickNew token inserted at inference to trigger deeper reasoning steps with zero-shot generalization.
2025.0611 TreeRL THU-DM Paper
GitHub
—— ——
ClickCombines on-policy RL and tree search for interpretable decision traces in reasoning tasks.
2025.0x0x
Paper
GitHub
hf models hf datasets
Clickinsights and contributions about RL for reasoning within 30 words.

Multimodal Models

Date Project Org Intro HF Model HF Dataset Takeaway Messages
2025.0128 Open-R1-MultiModal LLMs Lab GitHub
More
Qwen2-VL-2B-GRPO-8k
Qwen2-VL-7B-GRPO-8k
multimodal-open-r1-8k-verified
ClickOpen-R1-MultiModal provides an open-source replication of R1-Zero-like RL for Multimodal LLMs, aiming to enhance complex visual reasoning. It demonstrates the effectiveness of these RL techniques for boosting multimodal performance and promotes reproducibility in the field.
2025.0202 R1-V Deep Agent Blog
GitHub
More
—— Clevr_CoGenT_TrainA_R1
ClickR1-V applies RL, specifically RLV-Instruct, to fine-tune VLMs. It enhances complex visual reasoning and instruction-following capabilities in VLMs beyond standard supervised fine-tuning.
2025.0215 VLM-R1 OmAI Lab Blog
GitHub
More
OVD
Math
REC
——
ClickVLM-R1 applies R1-style RL to VLMs, improving stability and generalization on visual reasoning tasks. It shows that RL enhances VLM generalization beyond standard fine-tuning, achieving SOTA results, particularly on complex or out-of-domain benchmarks.
2025.0303 Visual-RFT SJTU & Shanghai AI Lab & CUHK Paper
GitHub
More
Reasoning Grounding COCO_base65
COCO
COCO_8_classes_4_shot
LVIS_few_shot
Flower_4_shot
FGVC_Aircraft_4_shot
Car196_4_shot
Pets37_4_shot
ClickVisual-RFT introduces Visual Reinforcement Fine-tuning, which extends reinforcement learning with verified rewards on visual perception tasks that are effective with limited data for fine-tuning.
2025.0306 R1-VLM GroundLight Blog
GitHub
More
—— ——
ClickR1-VLM enhances VLMs using RL, contributing significantly improved performance on complex visual reasoning tasks (spatial, counting, logic) where standard models falter. It shows that RL effectively unlocks advanced, multi-step reasoning capabilities specifically for vision-language understanding.
2025.0310 VisualThinker-R1-Zero TurningPoint Paper
GitHub
More
VisualThinker-R1-Zero ——
ClickVisualThinker-R1-Zero adapts the R1-Zero RL paradigm (no supervised fine-tuning) to VLMs, achieving SoTa visual reasoning. It shows that complex visual reasoning can be effectively cultivated directly via RL on a base VLM, bypassing supervised data needs.
2025.0310 MM-EUREKA USTC & ZTE & NEU Paper
Github
More
MM-Eureka-Qwen-7B MM-Eureka-Dataset
ClickMM-EUREKA reproduces key characteristics of text-based RL systems like DeepSeek-R1 in the multimodal space, which demonstrates that both instruction-tuned and pre-trained models can develop strong multimodal reasoning capabilities through rule-based RL without supervised fine-tuning, showing superior data efficiency compared to alternative approaches.
2025.0310 Curr-ReFT Shanghai AI Lab & SJTU & HKU Paper
GitHub
More
3B-Curr-ReFT
7B-Curr-ReFT
Curr-ReFT-data
ClickCurr-ReFT proposes a Curriculum Reinforcement Finetuning strategy to enhance the out-of-distribution generalization and reasoning abilities. The curriculum paradim ensures steady progression. Moreover, a rejected sampling-based self-improvement is proposed to maintain the fundamental capabilities of VLMs through selective learning from high-quality multimodal and language examples.
2025.0311 LLM-R1 CUHK & Ant Group Paper
GitHub
—— ——
ClickLLM-R1 contributes the RMAVO algorithm to stably enhance LLM reasoning using RL, preventing reward hacking and achieving SOTA results with smaller models via an open-source implementation. It shows that reward model assistance in value optimization is key for stable RL.
2025.0311 Vision-R1 ECNU & Xiaohongshu Paper
GitHub
—— Vision-R1-cold
ClickVision-R1 adapts the R1-Zero RL paradigm for VLMs, training them on visual reasoning chains. Its contribution is significantly boosting complex multimodal reasoning performance. It shows that RL applied to explicit reasoning steps effectively enhances VLM capabilities.
2025.0311 MMR1 NTU & SUTD & LASA GitHub MMR1-Math-v0-7B MMR1-Math-RL-Data-v0
ClickMMR1-Math-v0 achieves state-of-the-art performance among open-source 7B multimodal models, competing effectively even against proprietary models with significantly larger parameter sizes—all trained using only 6k carefully curated data instances.
2025.0315 MetaSpatial Northwestern University Paper
Project
GitHub
—— 3D_Reasoning
ClickMetaSpatial leverages reinforcement learning to enhance 3D spatial reasoning in vision-language models (VLMs), enabling more structured, realistic, and adaptive scene generation for applications in the metaverse, AR/VR, and game development.
2025.0327 Reason-RFT PKU & BAAI & CASIA & School of Artificial Intelligence, University of Chinese Academy of Sciences Paper
GitHub
Project
—— tanhuajie2001/Reason-RFT-CoT-Dataset
ClickReason-RFT introduces a two-phase training paradim: (1) SFT with CoT data to activate reasoning potential, followed by (2) GRPO-based reinforcement learning to enhance generalization, which further has potential applications in Emobodied AI.
2025.0404 MAYE SJTU & GAIR Paper
GitHub
—— ManTle/MAYE
ClickMAYE is a transparent, reproducible framework and a comprehensive evaluation scheme for applying reinforcement learning (RL) to vision-language models (VLMs). Its codebase is developed entirely from scratch without relying on any existing RL toolkits.
2025.0408 Step-R1-V-Mini StepFun Website —— ——
ClickStep-R1-V-Mini excels in the domain of visual reasoning, while also demonstrating top-tier performance in mathematical, code, and textual reasoning tasks. It supports a context length of 100k.
2025.0409 Kimi-VL-Thinking Kimi Team Technical Report
GitHub
moonshotai/Kimi-VL-A3B-Thinking ——
ClickKimi-VL-Thinking is designed to enhance long-horizon reasoning capabilities in vision-language tasks. Built on a foundation of long CoT SFT and RL, with only 2.8 parameters, Kimi-VL-Thinking achieves strong performance across a range of tasks requiring long-term reasoning. It excels in domains such as MMMU, MathVision, and MathVista, achieving impressive scores of 61.7, 36.8, and 71.3, respectively.
2025.0409 VideoChat-R1 Shanghai AI Lab & NJU & ZJU & USTC & Shanghai Innovation Institute & SIAT Paper
GitHub
—— ——
ClickVideoChat-R1 provides a systematic exploration of Reinforcement Fine-Tuning (RFT) with GRPO for video MLLMs, which exhibiting remarkable performance on spatio-temporal perception tasks without sacrificing chat ability, while exhibiting emerging spatio-temporal reasoning abilities.
2025.0410 Perception-R1 HUST & BUPT & StepFun & JHU & Tsinghua University Paper
GitHub
Perception-R1 Perception-R1
ClickPerception-R1 explores the effects of RL on different perception tasks, the researchers observe that the percep- tual perplexity is a major factor in determining the effectiveness of RL. The scalable Perception-R1 achieves remarkable performance on the perception tasks.
2025.0410 VL-Rethinker TIGER-Lab Paper
GitHub
TIGER-Lab/VL-Rethinker-7B
TIGER-Lab/VL-Rethinker-72B
——
ClickVL-Rethinker proposes Selective Sample Replay (SSR) and Forced Rethinking to enhance fast-thinking models.The model achieves remarkable performance on multi-disciplinary benchmarks.
2025.0501 T2I-R1 CUHK MMLab & CUHK MiuLar Lab & Shanghai AI Lab Paper
GitHub
CaraJ/ORM-T2I-R1 ——
ClickT2I-R1 is a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. The semantic-level CoT is utilized for high-level planning of the prompt, and the token-level CoT is designed for low-level pixel processing during patch-by-patch generation.
2025.0516 VisualPlanning Cambridge & UCL & Google Paper
GitHub
—— ——
Click VisualPlanning enables planning through purely visual representations, independent of text. In this paradigm, planning is executed via sequences of images that encode step-by- step inference in the visual domain, akin to how humans sketch or visualize future actions.
2025.0521 GRIT UCSC & eBay Paper
GitHub
Project
Demo
yfan1997/GRIT-20-InternVL-2B
yfan1997/GRIT-20-Qwen2.5-VL-3B
yfan1997/GRIT_data
Click GRIT proposes grounded reasoning with images and text for training MLLMs to think with images. The models generate reasoning chains that interleave natural language and explicit bounding box coordinates. Moreover, built upon the GRPO algorithm, GRIT eliminates the need for annotated reasoning chains or explicit bounding box labels, requiring as few as 20 image-question-answer triplets to train the model.
2025.0522 GoT-R1 HKU MMLab & CUHK MMLab & Sensetime & BUAA Paper
GitHub
gogoduan/GoT-R1-1B
gogoduan/GoT-R1-7B
——
Click GoT-R1 applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. To achieve this, a dual-stage multi-dimensional reward framework is proposed that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The reward system assesses semantic alignment, spatial accu- racy, and visual quality in a unified approach.
2025.0529 Jigsaw-R1 ESAT-PSI Paper
GitHub
—— ——
ClickThis paper provides a comprehensive study of rule-based visual RL using jigsaw puzzles as a structured experimental framework, revealing several key findings.
2025.0603 SynthRL NUS&CUHK Paper
GitHub
SynthRL ——
ClickIntroduces SynthRL, a pipeline that synthesizes verifiable data to train vision-language models, boosting performance on visual math reasoning tasks.
2025.0603 Cell-o1 UIUC Paper
GitHub
—— ——
ClickPresents Cell-o1, an LLM trained via RL to annotate single-cell RNA sequencing data, achieving expert-level reasoning in batch-level contexts.
2025.0604 MiMo-VL XiaomiMimo Paper
GitHub
MiMo-VL-7B ——
ClickDetails MiMo-VL-7B models achieving state-of-the-art performance in visual understanding and multimodal reasoning through mixed on-policy RL.
2025.0604 ReVisual-R1 ZJU&FDU Paper
GitHub
—— ——
ClickIntroduces ReVisual-R1, a staged RL approach enhancing MLLM reasoning by combining optimized cold starts with text-only RL fine-tuning.
2025.0604 LaF-GRPO PolyU Paper
GitHub
—— ——
ClickDevelops an LLM-as-Follower reward mechanism to generate in-situ navigation instructions for the visually impaired, enhancing instruction usability.
2025.0611 Visual PTRL UC Berkeley Paper —— ——
ClickTrains visual backbones on raw image data with reinforcement rewards—unsupervised and scalable.
2025.0x0x
Paper
GitHub
hf models hf datasets
Clickinsights and contributions about RL for reasoning within 30 words.

Agentic Applications

Date Project Org Intro HF Model HF Dataset Takeaway Messages
2025.0126 RAGEN RAGEN-AI Paper
GitHub
—— ——
ClickRAGEN introduces a RL framework to train reasoning-capable LLM agents for interactive, stochastic environments. Its core contribution is the Reasoning-Interaction Chain Optimization (RICO) algorithm, which jointly optimizes reasoning and action strategies by reinforcing entire trajectories.
2025.0203 Verifiers Independent GitHub —— ——
ClickThis repository contains a set of tools for reinforcement learning with LLMs in verifiable environments. It can be used for LLM Agent RL in verifable environments.
2025.0207 AgenticReasoning Univ. of Oxford Paper
GitHub
—— ——
ClickThis framework introduces the Mind Map agent, which constructs a structured knowledge graph to track logical relationships, improving deductive reasoning.
2025.0303 ReSearch Agent-RL GitHub
More
—— ——
ClickThe project train LLMs from scratch, utilizing RL with GRPO to learn to reason via search operations, without reliance on pre-existing reasoning frameworks or supervised data.
2025.0312 Search-R1 UIUC & UMass Amherst Paper
GitHub
More
Search-R1 2018 Wikipedia
ClickThe paper introduces Search-R1, a novel RL framework that enables LLMs to interact with search engines in an interleaved manner with their own reasoning. The framework is shown to be effective, with experiments demonstrating average relative improvements of 41% and 20% over RAG baselines, and providing insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning.
2025.0318 R1-Searcher RUC Paper
GitHub
Llama-3.1-8B-instruct-RAG-RL
Qwen-2.5-7B-base-RAG-RL
RAG-RL-Hotpotqa
ClickR1-Searcher enhances LLM reasoning via RL by training the model to perform adaptive model-based search during generation. This integration enables flexible thinking depth, improving reasoning efficiency and performance compared to fixed-step methods like R1-Zero.
2025.0319 SWEET-RL Meta AI Paper
GitHub
—— collaborative_agent_bench
ClickSweet-RL introduces a novel RL algorithm for multi-turn collaborative reasoning LLM agents. Its core contribution is improving credit assignment across long interactions by using an asymmetric actor-critic structure where the critic leverages additional training-time information for step-wise evaluation.
2025.0327 UI-R1 Vivo AI Lab & CUHK Paper
GitHub
Qwen2.5-VL-3B-UI-R1 UI-R1-3B-Train
ClickThis paper proposes UI-R1, the first framework to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for GUI action prediction tasks.
2025.0404 DeepResearcher SJTU Paper
GitHub
DeepResearcher-7b ——
ClickThis paper introduces DeepResearcher, the first comprehensive framework for end-to-end training of LLM-based deep research agents through scaling reinforcement learning (RL) in real-world environments with authentic web search interactions.
2025.0414 ART OpenPipe Blog
GitHub
—— ——
ClickThis release is an early alpha focused on best-in-class training efficiency and agentic multi-turn support.
2025.0414 GUI-R1 CAS & NUS Paper
GitHub
—— GUI-R1
ClickThis paper proposes GUI-R1, the first reinforcement learning framework designed to enhance the GUI capabilities of LVLMs in high-level real-world task scenarios, through unified action space rule modeling.
2025.0415 ReTool ByteDance Paper
GitHub
More
ReTool-Qwen-32B ReTool-SFT
ClickReTool is a reinforcement learning framework that integrates code interpreter execution into the reasoning loop of large language models (LLMs) to improve their mathematical reasoning capabilities. The framework consists of two primary stages: cold-start supervised fine-tuning and reinforcement learning with interleaved code execution rollout, allowing the model to learn when and how to invoke tools based on outcome feedback.
2025.0428 ARTIST Microsoft Paper —— ——
ClickARTIST enables models to autonomously decide when, how, and which tools to invoke within multi-turn reasoning chains, leveraging outcome-based RL to learn robust strategies for tool use and environment interaction without requiring step-level supervision.
2025.0430 WebThinker RUC Paper
GitHub
More
WebThinker-QwQ-32B
WebThinker-R1-7B
WebThinker-R1-14B
WebThinker-R1-32B
——
ClickWebThinker is a deep research agent that empowers large reasoning models (LRMs) to autonomously search the web, navigate web pages, and draft research reports during the reasoning process. It integrates a Deep Web Explorer module and employs an Autonomous Think-Search-and-Draft strategy, allowing for real-time report writing and information gathering.
2025.0506 SkyRL-v0 NovaSky-AI blog
GitHub
SkyRL-Agent-7B-v0
SkyRL-Agent-8B-v0
SkyRL-Agent-14B-v0
SkyRL-v0-293-data
ClickThis paper introduces SkyRL, the RL training pipeline for multi-turn tool use LLMs, optimized for long-horizon, real-environment tasks like SWE-Bench, built on top of VeRL and OpenHands. Using SkyRL, we are able to achieve promising results on SWE-Bench-Verified across model lines, using around 300 samples of training data!
2025.0512 Tool-N1 NVIDIA Paper
GitHub
—— ——
ClickThis paper presents Nemotron-Research-Tool-N1, a family of tool-using reasoning language models. These models are trained with an R1-style reinforcement learning algorithm that uses a binary reward to supervise only the structural format and functional correctness of tool calls, without requiring explicit reasoning annotations.
2025.0512 ZeroTIR FDU & Xiaohongshu Paper
GitHub
—— ——
ClickThis paper investigates RL from outcome-based rewards for Tool-Integrated Reasoning, ZeroTIR, training base LLMs to spontaneously generate and execute Python code for mathematical problems without supervised tool-use examples.
2025.0513 AgentCPM-GUI OpenBMB GitHub openbmb/AgentCPM-GUI ——
ClickAgentCPM-GUI is an open-source on-device LLM agent model jointly developed by THUNLP, Renmin University of China and ModelBest. Built on MiniCPM-V with 8 billion parameters, it accepts smartphone screenshots as input and autonomously executes user-specified tasks.
2025.0514 AlphaEvolve Google DeepMind Blog —— ——
ClickAlphaEvolve is an evolutionary coding agent powered by large language models for general-purpose algorithm discovery and optimization.
2025.0515 GiGPO NTU&Skywork Paper
GitHub
—— ——
ClickThis paper proposes Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm that achieves fine-grained credit assignment for LLM agents while preserving the appealing properties of group-based RL: critic-free, low memory, and stable convergence.
2025.0516 AutoRefine USTC Paper
GitHub
hf models hf datasets
ClickThis paper proposes AutoRefine, a reinforcement learning posttraining framework that adopts a new "search-and-refine-during-think" paradigm.
2025.0520 Time-R1 UIUC Paper
GitHub
—— ——
ClickThis paper introduces Time-R1, the first framework to endow a moderate-sized (3B-parameter) LLM with comprehensive temporal abilities: understanding, prediction, and creative generation.
2025.0521 Empirical Study UIUC Paper
GitHub
—— ——
ClickThis paper highlights several key findings: format rewards are effective in improving final performance, whereas intermediate retrieval rewards have limited impact; the scale and initialization of the LLM (general-purpose vs. reasoningspecialized) significantly influence RL outcomes; and the choice of search engine plays a critical role in shaping RL training dynamics and the robustness of the trained agent during inference.
2025.0521 StepSearch SenseTime Paper
GitHub
—— ——
ClickThis paper introduces StepSearch, a framework for search LLMs that trained with step-wise proximal policy optimization method.
2025.0521 GUI-G1 RUC Paper
GitHub
—— ——
ClickThis paper identifies three distinct challenges in the R1-Zero-Like training pipeline of R1-style GUI agents: grounding is harmed by longer reasoning due to grounding’s reliance on image tokens; common reward functions induce sizesensitive reward hacking; and GRPO biases agents toward simpler examples due to its objective.
2025.0522 Tool-Star RUC Paper
GitHub
Tool-Star-Qwen-3B Multi-Tool-RL-10K
Tool-Star-SFT-54K
ClickThis paper introduces Tool-Star, an RL-based framework designed to empower LLMs to autonomously invoke multiple external tools during stepwise reasoning.
2025.0522 R1-Searcher++ RUC Paper
GitHub
—— ——
Clickinsights and contributions about RL for reasoning within 30 words.
2025.0522 ARPO CUHK Paper
GitHub
—— ——
ClickThis paper investigates end-to-end policy optimization for vision-language-based GUI agents with the aim of improving performance on complex, long-horizon computer tasks.
2025.0522 AgentThink THU&McGill Paper —— ——
ClickThis paper introduces AgentThink, a pioneering unified framework that, for the first time, integrates Chainof-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks.
2025.0523 Agent-Distillation KAIST Paper
GitHub
—— ——
ClickThis paper proposes Agent Distillation, a framework for transferring not only reasoning capability but full task-solving behavior from LLM-based agents into sLMs with retrieval and code tools.
2025.0526 DeepEyes Xiaohongshu Paper
GitHub
DeepEyes-7B DeepEyes-Datasets-47k
ClickThis paper explores the interleaved multimodal reasoning paradigm and introduce DeepEyes, a model with "thinking with images" capabilities incentivized through end-to-end reinforcement learning without the need for cold-start SFT.
2025.0527 rStar MRA Paper
GitHub
—— ——
ClickThis paper introduces rStar-Coder, which significantly improves LLM code reasoning capabilities by constructing a large-scale, verified dataset of 418K competition-level code problems, 580K long-reasoning solutions along with rich test cases of varying difficulty.
2025.0527 SPA-RL-Agent PolyU Paper
GitHub
—— ——
ClickThis paper proposes Stepwise Progress Attribution (SPA), a general reward redistribution framework that decomposes the final reward into stepwise contributions, each reflecting its incremental progress toward overall task completion.
2025.0528 WebDancer Tongyi Lab Paper
GitHub
—— ——
ClickThe paper introduces a unified, data-centric training paradigm for developing agentic web research agents, exemplified by WebDancer, which combines supervised learning and reinforcement learning to achieve strong multi-step information-seeking performance on GAIA and WebWalkerQA benchmarks.
2025.0529 ML-Agent SJTU Paper —— ——
ClickThis paper explores the paradigm of learning-based agentic ML, where an LLM agent learns through interactive experimentation on ML tasks using online reinforcement learning (RL).
2025.0530 Pangu DeepDiver Huawei Paper —— ——
ClickThe paper introduces Pangu DeepDiver, a reinforcement learning framework that equips large language models with adaptive search intensity scaling (SIS) for open-web question answering, using a new WebPuzzle dataset to improve evidence-seeking behavior under real-world ambiguity and noise.
2025.0601 VerlTool TIGER AI Lab GitHub Qwen2.5-Math-VerlTool ——
ClickVerlTool is a unified and easy-to-extend tool agent training framework based on verl
2025.0602 SCA UCB & Meta Paper —— ——
ClickLLMs generate and solve their own tasks via a "Code-as-Task" setup, using RL for learning. Yields >2× gains on tool-use benchmarks.
2025.0602 MMedAgent-RL UNC Paper —— ——
ClickMulti-agent reinforcement learning for medical reasoning with multimodal data. Promotes coordination and robustness across specialized agents.
2025.0603 CURE ByteDance Seed Paper
GitHub
reasonflux-coder ——
ClickIntroduces CURE, a framework where code generation and unit testing co-evolve through RL, enhancing code accuracy without ground-truth supervision.
2025.0604 Seed-Coder ByteDance Seed Paper
GitHub
Seed-Coder ——
ClickProposes a self-curating code model that generates and selects its own training data, enhancing code generation capabilities without external supervision.
2025.0604 DyMo Cohere Paper —— ——
ClickPresents a self-verification sampling method for LLMs to enhance tool use by predicting and verifying intermediate steps before proceeding.
2025.0604 R-Search CAS Paper
GitHub
—— ——
ClickPresents a multi-reward RL framework enabling LLMs to integrate reasoning with search, improving performance on complex logic and knowledge tasks.
2025.0605 MedAgentGym Emory Univ. Paper
GitHub
—— ——
ClickIntroduces a training environment for LLM agents focused on code-based medical reasoning, facilitating the development of AI in healthcare applications.
2025.0605 CI-RL Purdue&Microsoft Paper —— ——
ClickApplies reinforcement learning to enhance contextual integrity in LLMs, aligning their outputs with privacy and safety norms.
2025.0611 Grounding-R1 Salesforce Blog —— ——
ClickGUI grounding via GRPO RL—clicks relevant areas without bounding-box or rationale supervision.
2025.0611 Agent-RLVR Scale AI Paper —— ——
ClickTrains software agents using both environmental feedback and expert guidance—targeting real-world SE tasks.
2025.0611 ReVeal MAR & THU Paper —— ——
ClickSelf-evolving agents improve code generation via iterative RL-based generate–verify cycles.
2025.0611 CAGSR-vLLM-MTC UC Berkeley Paper —— ——
ClickEnhances multi-turn reasoning via vLLM + self-supervised fine-tuning + RL on CoT traces.
2025.0x0x
Paper
GitHub
hf models hf datasets
Clickinsights and contributions about RL for reasoning within 30 words.

Projects

Contributing

If you have any updates or improvements for this document, please feel free to submit a Pull Request. Thank you!

202x.0x0x, Template

Project or Paper Project name or Paper title
GitHub Username/Project
Backbone Model (Base / Instruct / Reasoning; HF Model)
RL Algorithm (PPO / GRPO / RLOO / REINFORCE++; OpenRLHF / Verl / Trl)
Training Dataset (Size / Source / HF Dataset)
Rollout Configuration (Batch Size * N Samples ; Temperature; Dynamic Sampling)
Reward Function (Outcome; Process; Repetition & Length)
Policy Optimization (KL Loss; Length Penalty; Token-level loss)
Benchmark (MATH/GPQA; R1 level; GPT-4o level)
Core Insights (Empirical / Theoretical / Insightful Curves)
Additional Notes (e.g., code snippet)

Citation

If you find our repository useful in your research, please star us ⭐ and consider citing:

@misc{zhang2025TripleR,
  title={Awesome RL Recipes for Reasoning},
  author={Kaiyan Zhang, Yuchen Fan, Yuxin Zuo, Guoli Jia, Kai Tian, Xingtai Lv, Xuekai Zhu, Ermo Hua, Ning Ding, Biqing Qi, Bowen Zhou},
  year={2025},
  howpublished={\url{https://github.com/}},
  note={Github Repository},
}

Star History

Star History Chart

About

Awesome RL Reasoning Recipes ("Triple R")

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published