Awesome RL Reasoning Recipes ("Triple R")

A curated collection covering models, datasets, reward designs, optimization methods, hyperparameters, empirical findings, theoretical insights, and everything about reasoning with reinforcement learning.

News

[2025-05-27]: 🔥We are very excited to release MARTI: A Framework for LLM-based Multi-Agent Reinforced Training and Inference. Check it out: Github.
[2025-04-23]: 🔥Introducing TTRL — an open-source solution for online RL on data without ground-truth labels, especially test data. Check it out: Github and Paper.

Overview

This collection covers recent progress in reinforcement learning for large language model reasoning, starting from 2025 in the timeline.

Large Language Models

Date	Project	Org	Intro	HF Model	HF Dataset	Takeaway Messages
2025.0102	PRIME-RL	THU & UIUC Shanghai AI Lab	Paper GitHub More	Eurus-2-7B-PRIME Eurus-2-7B-PRIME-Zero	Eurus-2-RL-Data	Click PRIME offers scalable Reinforcement Learning by using dense, token-level implicit rewards derived only from final outcomes. This bypasses costly step-by-step annotations, providing fine-grained feedback to improve sample efficiency and reasoning.
2025.0122	DeepSeek-R1	DeepSeek	Paper GitHub More	DeepSeek-R1 DeepSeek-R1-Zero	——	Click DeepSeek-R1's core contribution is demonstrating large-scale RL from scratch (600B+) without SFT, achieving emergent "aha moments" (self-reflective reasoning) and matching OpenAI o1's performance at 1/30 cost
2025.0122	Kimi k1.5	Kimi	Paper GitHub More	——	——	Click Kimi 1.5 introduces a simplified RL framework that leverages long-context scaling (128k tokens) and improved policy optimization (e.g., online mirror descent) to enhance reasoning and multimodal performance.
2025.0124	TinyZero	Berkeley	Twitter GitHub More	——	Countdown-Tasks-3to4	Click TinyZero's core contribution is demonstrating that smaller language models (e.g., 1.5B-3B parameters) can develop complex reasoning, search, and self-verification abilities through Reinforcement Learning, replicating capabilities of larger models like DeepSeek R1-Zero at extremely low cost (<$30).
2025.0124	Open-R1	Huggingface	GitHub	OpenR1-Qwen-7B OlympicCoder-7B OlympicCoder-32B	OpenR1-Math-220k codeforces	Click Open-R1's core contribution is providing the first fully open-source replication and release of the DeepSeek R1-Zero Reinforcement Learning training pipeline. Its main insight or goal is to democratize access to these advanced RL techniques for enhancing LLM reasoning and planning.
2025.0125	simpleRL-reason	HKUST	Paper GitHub More	Qwen-2.5-Math-7B-SimpleRL-Zero Qwen-2.5-Math-7B-SimpleRL	MATH	Click Researchers replicated the DeepSeek-R1-Zero and DeepSeek-R1 training using a 7B model with only 8K MATH examples, achieving strong results on complex mathematical reasoning.
2025.0205	Demystify-long-cot	CMU	Paper GitHub More	——	——	Click The paper elucidates the role of RL in stabilizing and enhancing long CoT reasoning in LLMs, highlighting the necessity of reward shaping and verifiable reward signals for complex reasoning tasks.
2025.0207	No-aha-moment	Sea AI Lab	Blog GitHub	——	Countdown-Tasks-3to4	Click This is the first public critique of the 'aha moment' associated with DeepSeek-R1-Zero-style training, suggesting that changes in response length are an intrinsic part of the reinforcement learning dynamics.
2025.0210	DeepScaler	Agentica-Org	Blog GitHub More	DeepScaleR-1.5B-Preview	DeepScaleR-Preview-Dataset	Click DeepScaleR's core contribution is demonstrating that a small 1.5B parameter model, fine-tuned using scaled Reinforcement Learning (RL) and an iterative context lengthening scheme, can surpass the reasoning performance of larger, state-of-the-art models like OpenAI's O1-Preview on complex benchmarks (e.g., AIME math problems).
2025.0210	Logic-RL	MSRA & Ubiquant	Paper GitHub More	——	knights-and-knaves knights-and-knaves-ZH	Click The paper introduces Logic-RL, a rule-based reinforcement learning framework that enables large language models to develop o3-mini-level reasoning skills through training on logic puzzles. The reasoning capabilities can also be transferred to other domains like math.
2025.0210	OREAL	Shanghai AI Lab SJTU & CUHK	Paper GitHub More	OREAL-32B OREAL-7B OREAL-DeepSeek-R1-Distill-Qwen-7B OREAL-32B-SFT OREAL-7B-SFT	OREAL-RL-Prompts	Click The paper introduces OREAL, a reinforcement learning framework for mathematical reasoning with binary feedback. It proves that behavior cloning on positive samples is sufficient for optimal learning and proposes reward reshaping for negative samples. A token-level reward model addresses sparse rewards in long reasoning chains. OREAL achieves state-of-the-art results on math benchmarks.
2025.0217	LIMR	SJTU	Paper GitHub More	LIMR	LIMR	Click The paper challenges the assumption that scaling up RL training data inherently improves performance in language models, instead finding that a strategically selected subset of 1,389 samples can outperform a full 8,523-sample dataset.
2025.0218	Open-Reasoner-Zero	StepFun & THU	Paper GitHub More	Open-Reasoner-Zero-7B Open-Reasoner-Zero-32B	ORZ-Math-57k	Click The Open-Reasoner-Zero model has achieved notable performance, with Open-Reasoner-Zero-32B outperforming DeepSeek-R1-Zero-Qwen-32B on the GPQA Diamond benchmark while requiring significantly fewer training steps.
2025.0225	SWE-RL	FAIR at Meta	Paper GitHub More	——	——	Click SWE-RL enhances LLMs' code reasoning through RL using open-source software evolution data, achieving state-of-the-art results in software engineering tasks and demonstrating generalized reasoning capabilities beyond coding.
2025.0227	Med-RLVR	Microsoft Research	Paper More	——	——	Click The Med-RLVR framework demonstrates emergent medical reasoning via RL, achieving performance parity with SFT on in-distribution tasks and improving out-of-distribution generalization, all without explicit reasoning supervision, showcasing RL's potential in medicine.
2025.0303	VC-PPO	Bytedance	Paper More	——	——	Click VC-PPO (Value-Calibrated PPO) diagnoses PPO's collapse in long CoT tasks as stemming from value function inaccuracies (initialization bias and reward signal decay in long sequences). Its core contribution is modifying PPO with value pretraining and decoupled GAE for actor and critic.
2025.0306	LCPO-L1	CMU	Paper GitHub More	L1-Qwen-1.5B-Max L1-Qwen-1.5B-Exact	——	Click L1 introduces Length Controlled Policy Optimization (LCPO), a RL method enabling precise control over a reasoning model's thinking time (output length) via prompt instructions. It shows that RL effectively controls reasoning duration and unexpectedly enhances even short-chain reasoning capabilities.
2025.0310	MRT	CMU	Paper Project GitHub	——	——	Click MRT (Mixed-Reality Trajectory Preferences) introduces a novel method for fine-tuning cooperative LLM agents. It effectively blends human preferences on real interaction trajectories with AI preferences on synthetic variations, improving data efficiency. This mixed-reality approach surpasses purely AI-driven feedback (RLAIF), especially for complex, multi-turn collaborative tasks.
2025.0318	TOPR	Mila & Reliant AI	Paper More	——	——	Click TOPR (Tapered Off-Policy REINFORCE) introduces a novel RL algorithm for fine-tuning LLMs. Its core contribution is using asymmetric, tapered importance sampling to modify REINFORCE, enabling stable and efficient off-policy learning. This allows reusing past data effectively without the instability often seen in other methods and without needing explicit KL regularization.
2025.0318	DAPO	Bytedance THU	Paper GitHub More	——	DAPO-Math-17k	Click DAPO algorithm introduces four key techniques (Clip-Higher, Dynamic Sampling, Token-Level Loss, Overlong Shaping) for stable and efficient long-chain-of-thought RL training, surpassing previous SoTA results efficiently.
2025.0320	Open RS	VNU University of Science & Knovel Engineering Lab	Paper GitHub More	Open-RS1 Open-RS2 Open-RS3	open-s1 open-deepscaler open-rs	Click The study investigates the potential of RL to improve reasoning in small LLMs. The results demonstrate rapid reasoning gains, with accuracy improvements on mathematical reasoning benchmarks, and highlight the efficacy of RL-based fine-tuning for small LLMs as a cost-effective alternative to large-scale approaches, using high-quality training data.
2025.0321	Dr. GRPO	Sea AI Lab	Paper GitHub More	Qwen2.5-Math-7B-Oat-Zero Qwen2.5-Math-1.5B-Oat-Zero Llama-3.2-3B-Oat-Zero	MATH	Click This work critically analyzes R1-Zero-like RL training. It reveals base model properties and GRPO algorithm biases (e.g., length bias) significantly impact outcomes. It contributes the efficient, unbiased Dr. GRPO algorithm and an open-source recipe/codebase for better understanding and reproduction.
2025.0321	FastCuRL	Tencent Hunyuan	Paper GitHub	FastCuRL-1.5B-Preview	FastCuRL	Click FastCuRL introduces a simple, efficient Curriculum RL method for LLMs. Its core contribution uses target perplexity to dynamically scale the standard RL loss (like PPO), creating an effective curriculum without complex reward models or auxiliary components, enabling faster, more stable training.
2025.0328	ARGO	Meta	Paper	——	——	Click This paper derived the Any-Generation Reward Optimization (AGRO) frim the consistency condition across any possible generation of the model. AGRO achieves a better convergence than KL-regularized policy gradient method.
2025.0401	Z1	THU	Paper GitHub	Z1-7B	Z1-Code-Reasoning-107K	Click This paper proposes training LLMs on code-related reasoning trajectories using a curated dataset and a "Shifted Thinking Window" technique. This allows models to reduce excessive thinking tokens, achieving efficient test-time scaling and generalizing reasoning abilities.
2025.0401	VAPO	ByteDance Seed	Paper	——	——	Click VAPO offers an integrated solution that effectively alleviates value model bias, the presence of heterogeneous sequence lengths, and the sparsity of reward signal.
2025.0407	ConciseRL	Wand AI	Paper	——	——	Click This work challenges the idea that longer reasoning chains in LLMs inherently mean better accuracy. It uses mathematical analysis of RL principles, particularly PPO, to show that lengthier responses often arise from the optimization process itself, not necessarily improved reasoning.
2025.0409	AdaRFT	USC LIME Lab	Paper GitHub	——	DeepScaleR_Difficulty	Click AdaRFT proposes Adaptive Curriculum Reinforcement Finetuning to improve LLM reasoning training efficiency. It dynamically adjusts task difficulty based on recent reward signals, accelerating learning by keeping challenges optimally balanced. Experiments on competition math benchmarks show up to 2x fewer steps and improved accuracy, using standard PPO with minimal changes.
2025.0410	Seed-Thinking-v1.5	ByteDance Seed	GitHub	——	——	Click Seed-Thinking-v1.5 is a high-performing reasoning model that combines curated chain-of-thought data, stable reinforcement learning, and advanced infrastructure to achieve strong results across math, coding, and logic tasks.
2025.0410	d1 & diffu-GRPO	UCLA & Meta	Paper GitHub Project	——	——	Click This paper propose d1 to adapt pre-trained masked dLLMs into reasoning via a combination of SFT and RL. The RL method used is named diffu-GRPO.
2025.0413	Skywork-OR1	Skywork AI	Paper Blog GitHub	Skywork-OR1-32B-Preview Skywork-OR1-7B-Preview Skywork-OR1-Math-7B	Skywork-OR1-RL-Data	Click Skywork-OR1 is a series of robust open-source models trained on carefully curated math and code data. The training process incorporates several modifications to the original GRPO, including offline and online data filtering, multi-stage training, and adaptive entropy control.
2025.0415	DeepMath	Tencent & SJTU	Paper GitHub	zwhe99/DeepMath-Zero-7B zwhe99/DeepMath-Zero-Math-7B zwhe99/DeepMath-1.5B zwhe99/DeepMath-Omn-1.5B	zwhe99/DeepMath-103K	Click DeepMath-103K is a large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning. Trained on DeepMath-103K, DeepMath series models achieve state-of-the-art performance on many math benchmarks.
2025.0421	LUFFY	Shanghai AI Lab	Paper GitHub	LUFFY-Qwen-Math-7B-Zero LUFFY-Qwen-Math-1.5B-Zero	Openr1-Math-46k-8192	Click This paper introduces LUFFY (Learning to reason Under oFF-policY guidance), a framework that augments zero-RL with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training.
2025.0423	TTRL	THU&Shanghai AI Lab	Paper GitHub	——	——	Click This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs).
2025.0430	Phi-4-reasoning	Mircosoft	Paper	Phi-4-reasoning	——	Click This paper introduces Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks.
2025.0511	BLEUBERI	Maryland	Paper GitHub	——	——	Click Demonstrates that BLEU, a simple string-matching metric, can effectively serve as a reward function for instruction-following tasks, rivaling complex reward models.
2025.0512	INTELLECT-2	PrimeIntellect-ai	Paper GitHub	INTELLECT-2	——	Click INTELLECT-2 is a 32 billion parameter language model trained through a reinforcement learning run leveraging globally distributed, permissionless GPU resources contributed by the community.
2025.0514	Qwen3	Alibaba Qwen	Paper GitHub	Qwen3	——	Click insights and contributions about RL for reasoning within 30 words.
2025.0516	Subnetwork RL	UIUC	Paper	——	——	Click insights and contributions about RL for reasoning within 30 words.
2025.0516	Data Synthesis RL	PKU&MIT	Paper GitHub	——	——	Click insights and contributions about RL for reasoning within 30 words.
2025.0519	AR-Lopti	CUHK	Paper GitHub	——	——	Click This paper identifies a critical yet underexplored issue in RL training: low-probability tokens disproportionately influence model updates due to their large gradient magnitudes.
2025.0519	AnytimeReasoner	Sea AI Lab	Paper GitHub	——	DeepScaleR-Preview-Dataset	Click This paper proposes a framework for optimizing anytime reasoning under arbitrary token budgets, featuring decoupled optimization of thinking and summarization, dense verifiable rewards, and budget relative policy optimization.
2025.0521	EM-PT	UIUC	Paper GitHub	——	——	Click This paper shows that this simple objective alone, without any labeled data, can substantially improve large language models’ (LLMs) performance on challenging math, physics, and coding tasks.
2025.0521	NOVER	KCL&SJTU	Paper GitHub	——	——	Click This paper presents a verifier-free R1-zero-like training, which enables it to train on any data (beyond math and coding)!
2025.0522	AceReason-Nemontron	Nvidia	Paper	AceReason-Nemotron-14B	——	Click This paper demonstrates that large-scale RL can significantly enhance the reasoning capabilities of strong, small- and mid-sized models, achieving results that surpass those of state-of-the-art distillation-based models.
2025.0522	KTAE	CAS	Paper GitHub	KTAE-7B/1.5B	——	Click This paper improves the calculation method of advantage based on GRPO, providing more fine-grained token-level advantage, and effectively reducing the generation length.
2025.0523	QwenLong-L1	Qwen-Doc	Paper GitHub	QwenLong-L1-32B	——	Click insights and contributions about RL for reasoning within 30 words.
2025.0523	Trinity-RFT	Alibaba Group	Paper GitHub	——	——	Click Trinity-RFT is a general-purpose, flexible and scalable framework designed for reinforcement fine-tuning (RFT) of large language models.
2025.0524	LlamaRL	Meta	Paper	——	——	Click Distributed async RL framework for LLMs, achieving 10× training speed over DeepSpeed; scales to 405B parameters.
2025.0525	SeRL	ZJU	Paper GitHub	——	——	Click This paper proposes Self-play Reinforcement Learning (SeRL) to bootstrap LLM training with limited initial data.
2025.0525	BRIDGE	CMU	Paper GitHub	——	——	Click The paper proposes behavior injection, a task-agnostic data augmentation method that enhances the effectiveness of reinforcement fine-tuning for language models by improving rollout accuracy and data co-influence, leading to consistently better post-RL performance.
2025.0526	REA-RL	HIT	Paper GitHub	——	——	Click Introduces REA-RL, which enhance the efficiency of LRMs by introducing a reflection model for efficient scaling online, and a reflection reward to prevent non-reflective responses.
2025.0527	ConciseR	Tencent Hunyuan	Paper GitHub	——	——	Click This paper proposes a simple yet effective two-stage reinforcement learning framework for achieving concise reasoning in LLMs, named ConciseR.
2025.0527	VeriFree	Sea AI Lab	Paper GitHub	——	——	Click This paper proposes a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer.
2025.0527	One-Shot-EM	Ubiquant	Paper GitHub	——	——	Click This paper trained 13,440 large language models and found that entropy minimization requires only a single unlabeled data and 10 steps optimization to achieve performance improvements greater than those obtained using thousands of data and carefully designed rewards in rule-based reinforcement learning.
2025.0528	Entropy-RL	Shanghai AI Lab & THU	Paper GitHub	——	——	Click This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy.
2025.0528	RENT-RL	CMU	Paper GitHub	——	——	Click RENT: Reinforcement Learning via Entropy Minimization is a fully unsupervised reinforcement learning method that improves reasoning performance by using the model's own confidence as a reward.
2025.0528	SynLogic	MiniMax-AI	Paper GitHub	——	——	Click This paper presents SynLogic, a data synthesis framework and dataset that generates diverse logical reasoning data at scale, encompassing 35 diverse logical reasoning tasks.
2025.0530	ProRL	Nvidia	Paper	Nemotron-Qwen-1.5B	——	Click This paper challenges prevailing assumptions by demonstrating that prolonged RL (ProRL) training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling.
2025.0530	ReasoningGym	OpenThought	Paper GitHub	——	——	Click This paper introduces Reasoning Gym, a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple domains including algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and various common games.
2025.0530	AReaL	InclusionAI	Paper GitHub	——	——	Click AReaL introduces a fully asynchronous reinforcement learning system for language reasoning tasks, decoupling rollout generation from model training to significantly improve GPU utilization and training speed—achieving up to 2.57× speedup over synchronous systems—while maintaining or improving model performance.
2025.0602	HighEntropyRL	Qwen&THU	Paper Project	——	——	Click High-entropy minority tokens play an outsized role in RLVR training. This paper provides actionable insights into optimizing reward design.
2025.0602	RLVR-Decomposed	Princeton	Paper GitHub	——	——	Click Shows that penalizing incorrect answers alone can significantly boost LLM reasoning via PPO—challenging conventional RLHF approaches.
2025.0602	Writing-Zero	Star Writing	Paper	——	——	Click Applies RLVR to creative tasks like story writing by converting non-verifiable tasks into verifiable subgoals.
2025.0602	SRPO	ByteDance Seed & OSU	Paper	——	——	Click Proposes a two-stage RL framework combining self-reflection and Group Relative Policy Optimization to boost multimodal reasoning.
2025.0603	KDRL	HIT&Huawei	Paper	——	——	Click Presents KDRL, a unified framework combining knowledge distillation and RL to enhance LLM reasoning post-training, improving sample efficiency and generalization.
2025.0603	TRePO	Amazon	Paper	——	——	Click Proposes that response-level rewards suffice for effective online RL in LLMs, offering a mathematical foundation for this approach.
2025.0603	Critique-GRPO	CUHK	Paper GitHub	——	——	Click Combines natural language critiques with numerical rewards in RL to overcome performance plateaus in LLM reasoning tasks.
2025.0603	Unlikeliness Rewards	CMU	Paper	——	——	Click The paper introduces an unlikeliness reward mechanism to address biases in Group Relative Policy Optimization (GRPO), enhancing the diversity and accuracy of large language models on structured tasks like formal theorem proving.
2025.0604	RewardAnything	PKU&WeChatAI	Paper GitHub	RewardAnything-8B-v1	——	Click Introduces principle-following reward models that generalize across tasks by adhering to natural language specifications, improving alignment without retraining.
2025.0605	ALP	Stanford	Paper	——	——	Click Introduces adaptive length penalties in reinforcement learning to encourage concise reasoning in large language models, enhancing efficiency without sacrificing performance.
2025.0605	PatternSelection	HKU	Paper	——	——	Click Explores mechanisms for selecting reasoning patterns in reinforcement learning for language models, aiming to enhance decision-making processes.
2025.0605	LogicPuzzleRL	PKU	Paper GitHub	——	——	Click Utilizes reinforcement learning on custom logic puzzles to cultivate robust mathematical reasoning in large language models.
2025.0605	DOTS	UIUC&NYU	Paper	——	——	Click Proposes methods to improve data efficiency in reinforcement fine-tuning of LLMs through difficulty-targeted online data selection and rollout replay.
2025.0605	ether0	FutureHouse	Paper GitHub	——	——	Click A 24B parameter model trained for scientific reasoning in chemistry, capable of generating molecular structures from natural language prompts.
2025.0605	Writing-RL	Alibaba	Paper	——	——	Click Curriculum-based RL improves long-form narrative coherence through structured rewards.
2025.0606	Confidence	Moscow	Paper	——	——	Click Confidence-driven few-shot RL fine-tuning improves sample efficiency without reward supervision.
2025.0607	Thinking vs. Doing	CMU	Paper GitHub	——	——	Click LLMs learn test-time interaction: when to think, when to act—enhances reasoning efficiency.
2025.0607	OptimalReasoning	THU	Paper	——	——	Click Theoretical study on RL-optimality gap for chain-of-thought reasoning.
2025.0608	YouronMath	Keio Univ	Paper GitHub	——	——	Click Gamified interface improves LLM math performance via reward shaping and iterative gameplay.
2025.0608	Play to Generalize	Rice	Paper GitHub	——	——	Click Trains reasoning via gameplay to transfer skills across tasks.
2025.0608	RPT	Microsoft	Paper	——	——	Click Uses RL objectives during pretraining to equip LLMs with better downstream reasoning capabilities.
2025.0608	MARL	WM Univ.	Paper	——	——	Click LLMs critique each other in a reflective multi-agent framework to iteratively refine reasoning chains.
2025.0608	RLT	Alibaba	Paper	——	——	Click RL teachers dynamically allocate thinking-time during inference to balance latency and accuracy.
2025.0608	SwS	Microsoft	Paper	——	——	Click LLM self-assesses its weaknesses, then generates challenging tasks to improve via RL.
2025.0608	RuleReasoner	UCLA	Paper	——	——	Click Blends rule-based logic with RL-driven dynamic sampling to solve structured reasoning problems.
2025.0608	Bingo	Microsoft	Paper	——	——	Click RL method improves reasoning by amplifying attention on critical intermediate steps.
2025.0609	CoRT	Qwen	Paper GitHub	——	——	Click Tool-augmented RL trains LLMs to reason via code synthesis and self-refinement loops.
2025.0609	VerIF	THU	Paper GitHub	——	——	Click Verification-first RL training: modularly verifies and rewrites faulty LLM outputs during policy updates.
2025.0609	Router-R1	UIUC	Paper GitHub	——	——	Click RL-based routing policies optimize multi-round tool use and answer aggregation.
2025.0609	RePO	CUHK + AILab	Paper GitHub	——	——	Click Replay-Enhanced Policy Optimization: improves sample efficiency and stability of reasoning training.
2025.0609	SSA	CUNY	Paper	——	——	Click Promotes consistency by aligning reasoning traces across training samples with shared structure.
2025.0609	ComfyUI-R1	HIT & Alibaba	Paper GitHub	——	——	Click Reasoning-powered LLM agent for UI pipeline automation inspired by ComfyUI workflows.
2025.0609	Learning to Clarify	Adobe	Paper	——	——	Click LLMs learn when and how to ask clarification questions via reward-weighted fine-tuning.
2025.0610	Magistral	Mistral AI	Paper	Magistral-Small-2506	——	Click First RL-trained reasoning LLM from Europe. Strong multilingual chain-of-thought and tool use. Open-source (Apache 2.0).
2025.0610	FastEasy & Deep Hard	FDU	Paper	——	——	Click Applies dynamic penalty on output length to focus model effort on harder inputs.
2025.0610	PAG	ByteDance	Paper GitHub	——	——	Click LLMs generate, verify, and correct responses in multi-turn RL framework inspired by verifier-agent loops.
2025.0610	SAL	MIT	Paper	——	——	Click Explores self-adjusting LLM behaviors at inference using RL-inspired introspection and elicitability measures.
2025.0610	Unsupervised Elicitation	Anthropic	Paper	——	——	Click Reveals hidden reasoning capacities without supervision—implications for reward-free training.
2025.0611	LearnAlign	CUHK	Paper	——	——	Click Gradient-alignment-driven reasoning data selection for better RL fine-tuning of LLMs.
2025.0611	Continue-Thinking Token	CTK	Paper	——	——	Click New token inserted at inference to trigger deeper reasoning steps with zero-shot generalization.
2025.0611	TreeRL	THU-DM	Paper GitHub	——	——	Click Combines on-policy RL and tree search for interpretable decision traces in reasoning tasks.
2025.0x0x			Paper GitHub	hf models	hf datasets	Click insights and contributions about RL for reasoning within 30 words.

Multimodal Models

Date	Project	Org	Intro	HF Model	HF Dataset	Takeaway Messages
2025.0128	Open-R1-MultiModal	LLMs Lab	GitHub More	Qwen2-VL-2B-GRPO-8k Qwen2-VL-7B-GRPO-8k	multimodal-open-r1-8k-verified	Click Open-R1-MultiModal provides an open-source replication of R1-Zero-like RL for Multimodal LLMs, aiming to enhance complex visual reasoning. It demonstrates the effectiveness of these RL techniques for boosting multimodal performance and promotes reproducibility in the field.
2025.0202	R1-V	Deep Agent	Blog GitHub More	——	Clevr_CoGenT_TrainA_R1	Click R1-V applies RL, specifically RLV-Instruct, to fine-tune VLMs. It enhances complex visual reasoning and instruction-following capabilities in VLMs beyond standard supervised fine-tuning.
2025.0215	VLM-R1	OmAI Lab	Blog GitHub More	OVD Math REC	——	Click VLM-R1 applies R1-style RL to VLMs, improving stability and generalization on visual reasoning tasks. It shows that RL enhances VLM generalization beyond standard fine-tuning, achieving SOTA results, particularly on complex or out-of-domain benchmarks.
2025.0303	Visual-RFT	SJTU & Shanghai AI Lab & CUHK	Paper GitHub More	Reasoning Grounding	COCO_base65 COCO COCO_8_classes_4_shot LVIS_few_shot Flower_4_shot FGVC_Aircraft_4_shot Car196_4_shot Pets37_4_shot	Click Visual-RFT introduces Visual Reinforcement Fine-tuning, which extends reinforcement learning with verified rewards on visual perception tasks that are effective with limited data for fine-tuning.
2025.0306	R1-VLM	GroundLight	Blog GitHub More	——	——	Click R1-VLM enhances VLMs using RL, contributing significantly improved performance on complex visual reasoning tasks (spatial, counting, logic) where standard models falter. It shows that RL effectively unlocks advanced, multi-step reasoning capabilities specifically for vision-language understanding.
2025.0310	VisualThinker-R1-Zero	TurningPoint	Paper GitHub More	VisualThinker-R1-Zero	——	Click VisualThinker-R1-Zero adapts the R1-Zero RL paradigm (no supervised fine-tuning) to VLMs, achieving SoTa visual reasoning. It shows that complex visual reasoning can be effectively cultivated directly via RL on a base VLM, bypassing supervised data needs.
2025.0310	MM-EUREKA	USTC & ZTE & NEU	Paper Github More	MM-Eureka-Qwen-7B	MM-Eureka-Dataset	Click MM-EUREKA reproduces key characteristics of text-based RL systems like DeepSeek-R1 in the multimodal space, which demonstrates that both instruction-tuned and pre-trained models can develop strong multimodal reasoning capabilities through rule-based RL without supervised fine-tuning, showing superior data efficiency compared to alternative approaches.
2025.0310	Curr-ReFT	Shanghai AI Lab & SJTU & HKU	Paper GitHub More	3B-Curr-ReFT 7B-Curr-ReFT	Curr-ReFT-data	Click Curr-ReFT proposes a Curriculum Reinforcement Finetuning strategy to enhance the out-of-distribution generalization and reasoning abilities. The curriculum paradim ensures steady progression. Moreover, a rejected sampling-based self-improvement is proposed to maintain the fundamental capabilities of VLMs through selective learning from high-quality multimodal and language examples.
2025.0311	LLM-R1	CUHK & Ant Group	Paper GitHub	——	——	Click LLM-R1 contributes the RMAVO algorithm to stably enhance LLM reasoning using RL, preventing reward hacking and achieving SOTA results with smaller models via an open-source implementation. It shows that reward model assistance in value optimization is key for stable RL.
2025.0311	Vision-R1	ECNU & Xiaohongshu	Paper GitHub	——	Vision-R1-cold	Click Vision-R1 adapts the R1-Zero RL paradigm for VLMs, training them on visual reasoning chains. Its contribution is significantly boosting complex multimodal reasoning performance. It shows that RL applied to explicit reasoning steps effectively enhances VLM capabilities.
2025.0311	MMR1	NTU & SUTD & LASA	GitHub	MMR1-Math-v0-7B	MMR1-Math-RL-Data-v0	Click MMR1-Math-v0 achieves state-of-the-art performance among open-source 7B multimodal models, competing effectively even against proprietary models with significantly larger parameter sizes—all trained using only 6k carefully curated data instances.
2025.0315	MetaSpatial	Northwestern University	Paper Project GitHub	——	3D_Reasoning	Click MetaSpatial leverages reinforcement learning to enhance 3D spatial reasoning in vision-language models (VLMs), enabling more structured, realistic, and adaptive scene generation for applications in the metaverse, AR/VR, and game development.
2025.0327	Reason-RFT	PKU & BAAI & CASIA & School of Artificial Intelligence, University of Chinese Academy of Sciences	Paper GitHub Project	——	tanhuajie2001/Reason-RFT-CoT-Dataset	Click Reason-RFT introduces a two-phase training paradim: (1) SFT with CoT data to activate reasoning potential, followed by (2) GRPO-based reinforcement learning to enhance generalization, which further has potential applications in Emobodied AI.
2025.0404	MAYE	SJTU & GAIR	Paper GitHub	——	ManTle/MAYE	Click MAYE is a transparent, reproducible framework and a comprehensive evaluation scheme for applying reinforcement learning (RL) to vision-language models (VLMs). Its codebase is developed entirely from scratch without relying on any existing RL toolkits.
2025.0408	Step-R1-V-Mini	StepFun	Website	——	——	Click Step-R1-V-Mini excels in the domain of visual reasoning, while also demonstrating top-tier performance in mathematical, code, and textual reasoning tasks. It supports a context length of 100k.
2025.0409	Kimi-VL-Thinking	Kimi Team	Technical Report GitHub	moonshotai/Kimi-VL-A3B-Thinking	——	Click Kimi-VL-Thinking is designed to enhance long-horizon reasoning capabilities in vision-language tasks. Built on a foundation of long CoT SFT and RL, with only 2.8 parameters, Kimi-VL-Thinking achieves strong performance across a range of tasks requiring long-term reasoning. It excels in domains such as MMMU, MathVision, and MathVista, achieving impressive scores of 61.7, 36.8, and 71.3, respectively.
2025.0409	VideoChat-R1	Shanghai AI Lab & NJU & ZJU & USTC & Shanghai Innovation Institute & SIAT	Paper GitHub	——	——	Click VideoChat-R1 provides a systematic exploration of Reinforcement Fine-Tuning (RFT) with GRPO for video MLLMs, which exhibiting remarkable performance on spatio-temporal perception tasks without sacrificing chat ability, while exhibiting emerging spatio-temporal reasoning abilities.
2025.0410	Perception-R1	HUST & BUPT & StepFun & JHU & Tsinghua University	Paper GitHub	Perception-R1	Perception-R1	Click Perception-R1 explores the effects of RL on different perception tasks, the researchers observe that the percep- tual perplexity is a major factor in determining the effectiveness of RL. The scalable Perception-R1 achieves remarkable performance on the perception tasks.
2025.0410	VL-Rethinker	TIGER-Lab	Paper GitHub	TIGER-Lab/VL-Rethinker-7B TIGER-Lab/VL-Rethinker-72B	——	Click VL-Rethinker proposes Selective Sample Replay (SSR) and Forced Rethinking to enhance fast-thinking models.The model achieves remarkable performance on multi-disciplinary benchmarks.
2025.0501	T2I-R1	CUHK MMLab & CUHK MiuLar Lab & Shanghai AI Lab	Paper GitHub	CaraJ/ORM-T2I-R1	——	Click T2I-R1 is a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. The semantic-level CoT is utilized for high-level planning of the prompt, and the token-level CoT is designed for low-level pixel processing during patch-by-patch generation.
2025.0516	VisualPlanning	Cambridge & UCL & Google	Paper GitHub	——	——	Click VisualPlanning enables planning through purely visual representations, independent of text. In this paradigm, planning is executed via sequences of images that encode step-by- step inference in the visual domain, akin to how humans sketch or visualize future actions.
2025.0521	GRIT	UCSC & eBay	Paper GitHub Project Demo	yfan1997/GRIT-20-InternVL-2B yfan1997/GRIT-20-Qwen2.5-VL-3B	yfan1997/GRIT_data	Click GRIT proposes grounded reasoning with images and text for training MLLMs to think with images. The models generate reasoning chains that interleave natural language and explicit bounding box coordinates. Moreover, built upon the GRPO algorithm, GRIT eliminates the need for annotated reasoning chains or explicit bounding box labels, requiring as few as 20 image-question-answer triplets to train the model.
2025.0522	GoT-R1	HKU MMLab & CUHK MMLab & Sensetime & BUAA	Paper GitHub	gogoduan/GoT-R1-1B gogoduan/GoT-R1-7B	——	Click GoT-R1 applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. To achieve this, a dual-stage multi-dimensional reward framework is proposed that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The reward system assesses semantic alignment, spatial accu- racy, and visual quality in a unified approach.
2025.0529	Jigsaw-R1	ESAT-PSI	Paper GitHub	——	——	Click This paper provides a comprehensive study of rule-based visual RL using jigsaw puzzles as a structured experimental framework, revealing several key findings.
2025.0603	SynthRL	NUS&CUHK	Paper GitHub	SynthRL	——	Click Introduces SynthRL, a pipeline that synthesizes verifiable data to train vision-language models, boosting performance on visual math reasoning tasks.
2025.0603	Cell-o1	UIUC	Paper GitHub	——	——	Click Presents Cell-o1, an LLM trained via RL to annotate single-cell RNA sequencing data, achieving expert-level reasoning in batch-level contexts.
2025.0604	MiMo-VL	XiaomiMimo	Paper GitHub	MiMo-VL-7B	——	Click Details MiMo-VL-7B models achieving state-of-the-art performance in visual understanding and multimodal reasoning through mixed on-policy RL.
2025.0604	ReVisual-R1	ZJU&FDU	Paper GitHub	——	——	Click Introduces ReVisual-R1, a staged RL approach enhancing MLLM reasoning by combining optimized cold starts with text-only RL fine-tuning.
2025.0604	LaF-GRPO	PolyU	Paper GitHub	——	——	Click Develops an LLM-as-Follower reward mechanism to generate in-situ navigation instructions for the visually impaired, enhancing instruction usability.
2025.0611	Visual PTRL	UC Berkeley	Paper	——	——	Click Trains visual backbones on raw image data with reinforcement rewards—unsupervised and scalable.
2025.0x0x			Paper GitHub	hf models	hf datasets	Click insights and contributions about RL for reasoning within 30 words.

Agentic Applications

Date	Project	Org	Intro	HF Model	HF Dataset	Takeaway Messages
2025.0126	RAGEN	RAGEN-AI	Paper GitHub	——	——	Click RAGEN introduces a RL framework to train reasoning-capable LLM agents for interactive, stochastic environments. Its core contribution is the Reasoning-Interaction Chain Optimization (RICO) algorithm, which jointly optimizes reasoning and action strategies by reinforcing entire trajectories.
2025.0203	Verifiers	Independent	GitHub	——	——	Click This repository contains a set of tools for reinforcement learning with LLMs in verifiable environments. It can be used for LLM Agent RL in verifable environments.
2025.0207	AgenticReasoning	Univ. of Oxford	Paper GitHub	——	——	Click This framework introduces the Mind Map agent, which constructs a structured knowledge graph to track logical relationships, improving deductive reasoning.
2025.0303	ReSearch	Agent-RL	GitHub More	——	——	Click The project train LLMs from scratch, utilizing RL with GRPO to learn to reason via search operations, without reliance on pre-existing reasoning frameworks or supervised data.
2025.0312	Search-R1	UIUC & UMass Amherst	Paper GitHub More	Search-R1	2018 Wikipedia	Click The paper introduces Search-R1, a novel RL framework that enables LLMs to interact with search engines in an interleaved manner with their own reasoning. The framework is shown to be effective, with experiments demonstrating average relative improvements of 41% and 20% over RAG baselines, and providing insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning.
2025.0318	R1-Searcher	RUC	Paper GitHub	Llama-3.1-8B-instruct-RAG-RL Qwen-2.5-7B-base-RAG-RL	RAG-RL-Hotpotqa	Click R1-Searcher enhances LLM reasoning via RL by training the model to perform adaptive model-based search during generation. This integration enables flexible thinking depth, improving reasoning efficiency and performance compared to fixed-step methods like R1-Zero.
2025.0319	SWEET-RL	Meta AI	Paper GitHub	——	collaborative_agent_bench	Click Sweet-RL introduces a novel RL algorithm for multi-turn collaborative reasoning LLM agents. Its core contribution is improving credit assignment across long interactions by using an asymmetric actor-critic structure where the critic leverages additional training-time information for step-wise evaluation.
2025.0327	UI-R1	Vivo AI Lab & CUHK	Paper GitHub	Qwen2.5-VL-3B-UI-R1	UI-R1-3B-Train	Click This paper proposes UI-R1, the first framework to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for GUI action prediction tasks.
2025.0404	DeepResearcher	SJTU	Paper GitHub	DeepResearcher-7b	——	Click This paper introduces DeepResearcher, the first comprehensive framework for end-to-end training of LLM-based deep research agents through scaling reinforcement learning (RL) in real-world environments with authentic web search interactions.
2025.0414	ART	OpenPipe	Blog GitHub	——	——	Click This release is an early alpha focused on best-in-class training efficiency and agentic multi-turn support.
2025.0414	GUI-R1	CAS & NUS	Paper GitHub	——	GUI-R1	Click This paper proposes GUI-R1, the first reinforcement learning framework designed to enhance the GUI capabilities of LVLMs in high-level real-world task scenarios, through unified action space rule modeling.
2025.0415	ReTool	ByteDance	Paper GitHub More	ReTool-Qwen-32B	ReTool-SFT	Click ReTool is a reinforcement learning framework that integrates code interpreter execution into the reasoning loop of large language models (LLMs) to improve their mathematical reasoning capabilities. The framework consists of two primary stages: cold-start supervised fine-tuning and reinforcement learning with interleaved code execution rollout, allowing the model to learn when and how to invoke tools based on outcome feedback.
2025.0428	ARTIST	Microsoft	Paper	——	——	Click ARTIST enables models to autonomously decide when, how, and which tools to invoke within multi-turn reasoning chains, leveraging outcome-based RL to learn robust strategies for tool use and environment interaction without requiring step-level supervision.
2025.0430	WebThinker	RUC	Paper GitHub More	WebThinker-QwQ-32B WebThinker-R1-7B WebThinker-R1-14B WebThinker-R1-32B	——	Click WebThinker is a deep research agent that empowers large reasoning models (LRMs) to autonomously search the web, navigate web pages, and draft research reports during the reasoning process. It integrates a Deep Web Explorer module and employs an Autonomous Think-Search-and-Draft strategy, allowing for real-time report writing and information gathering.
2025.0506	SkyRL-v0	NovaSky-AI	blog GitHub	SkyRL-Agent-7B-v0 SkyRL-Agent-8B-v0 SkyRL-Agent-14B-v0	SkyRL-v0-293-data	Click This paper introduces SkyRL, the RL training pipeline for multi-turn tool use LLMs, optimized for long-horizon, real-environment tasks like SWE-Bench, built on top of VeRL and OpenHands. Using SkyRL, we are able to achieve promising results on SWE-Bench-Verified across model lines, using around 300 samples of training data!
2025.0512	Tool-N1	NVIDIA	Paper GitHub	——	——	Click This paper presents Nemotron-Research-Tool-N1, a family of tool-using reasoning language models. These models are trained with an R1-style reinforcement learning algorithm that uses a binary reward to supervise only the structural format and functional correctness of tool calls, without requiring explicit reasoning annotations.
2025.0512	ZeroTIR	FDU & Xiaohongshu	Paper GitHub	——	——	Click This paper investigates RL from outcome-based rewards for Tool-Integrated Reasoning, ZeroTIR, training base LLMs to spontaneously generate and execute Python code for mathematical problems without supervised tool-use examples.
2025.0513	AgentCPM-GUI	OpenBMB	GitHub	openbmb/AgentCPM-GUI	——	Click AgentCPM-GUI is an open-source on-device LLM agent model jointly developed by THUNLP, Renmin University of China and ModelBest. Built on MiniCPM-V with 8 billion parameters, it accepts smartphone screenshots as input and autonomously executes user-specified tasks.
2025.0514	AlphaEvolve	Google DeepMind	Blog	——	——	Click AlphaEvolve is an evolutionary coding agent powered by large language models for general-purpose algorithm discovery and optimization.
2025.0515	GiGPO	NTU&Skywork	Paper GitHub	——	——	Click This paper proposes Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm that achieves fine-grained credit assignment for LLM agents while preserving the appealing properties of group-based RL: critic-free, low memory, and stable convergence.
2025.0516	AutoRefine	USTC	Paper GitHub	hf models	hf datasets	Click This paper proposes AutoRefine, a reinforcement learning posttraining framework that adopts a new "search-and-refine-during-think" paradigm.
2025.0520	Time-R1	UIUC	Paper GitHub	——	——	Click This paper introduces Time-R1, the first framework to endow a moderate-sized (3B-parameter) LLM with comprehensive temporal abilities: understanding, prediction, and creative generation.
2025.0521	Empirical Study	UIUC	Paper GitHub	——	——	Click This paper highlights several key findings: format rewards are effective in improving final performance, whereas intermediate retrieval rewards have limited impact; the scale and initialization of the LLM (general-purpose vs. reasoningspecialized) significantly influence RL outcomes; and the choice of search engine plays a critical role in shaping RL training dynamics and the robustness of the trained agent during inference.
2025.0521	StepSearch	SenseTime	Paper GitHub	——	——	Click This paper introduces StepSearch, a framework for search LLMs that trained with step-wise proximal policy optimization method.
2025.0521	GUI-G1	RUC	Paper GitHub	——	——	Click This paper identifies three distinct challenges in the R1-Zero-Like training pipeline of R1-style GUI agents: grounding is harmed by longer reasoning due to grounding’s reliance on image tokens; common reward functions induce sizesensitive reward hacking; and GRPO biases agents toward simpler examples due to its objective.
2025.0522	Tool-Star	RUC	Paper GitHub	Tool-Star-Qwen-3B	Multi-Tool-RL-10K Tool-Star-SFT-54K	Click This paper introduces Tool-Star, an RL-based framework designed to empower LLMs to autonomously invoke multiple external tools during stepwise reasoning.
2025.0522	R1-Searcher++	RUC	Paper GitHub	——	——	Click insights and contributions about RL for reasoning within 30 words.
2025.0522	ARPO	CUHK	Paper GitHub	——	——	Click This paper investigates end-to-end policy optimization for vision-language-based GUI agents with the aim of improving performance on complex, long-horizon computer tasks.
2025.0522	AgentThink	THU&McGill	Paper	——	——	Click This paper introduces AgentThink, a pioneering unified framework that, for the first time, integrates Chainof-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks.
2025.0523	Agent-Distillation	KAIST	Paper GitHub	——	——	Click This paper proposes Agent Distillation, a framework for transferring not only reasoning capability but full task-solving behavior from LLM-based agents into sLMs with retrieval and code tools.
2025.0526	DeepEyes	Xiaohongshu	Paper GitHub	DeepEyes-7B	DeepEyes-Datasets-47k	Click This paper explores the interleaved multimodal reasoning paradigm and introduce DeepEyes, a model with "thinking with images" capabilities incentivized through end-to-end reinforcement learning without the need for cold-start SFT.
2025.0527	rStar	MRA	Paper GitHub	——	——	Click This paper introduces rStar-Coder, which significantly improves LLM code reasoning capabilities by constructing a large-scale, verified dataset of 418K competition-level code problems, 580K long-reasoning solutions along with rich test cases of varying difficulty.
2025.0527	SPA-RL-Agent	PolyU	Paper GitHub	——	——	Click This paper proposes Stepwise Progress Attribution (SPA), a general reward redistribution framework that decomposes the final reward into stepwise contributions, each reflecting its incremental progress toward overall task completion.
2025.0528	WebDancer	Tongyi Lab	Paper GitHub	——	——	Click The paper introduces a unified, data-centric training paradigm for developing agentic web research agents, exemplified by WebDancer, which combines supervised learning and reinforcement learning to achieve strong multi-step information-seeking performance on GAIA and WebWalkerQA benchmarks.
2025.0529	ML-Agent	SJTU	Paper	——	——	Click This paper explores the paradigm of learning-based agentic ML, where an LLM agent learns through interactive experimentation on ML tasks using online reinforcement learning (RL).
2025.0530	Pangu DeepDiver	Huawei	Paper	——	——	Click The paper introduces Pangu DeepDiver, a reinforcement learning framework that equips large language models with adaptive search intensity scaling (SIS) for open-web question answering, using a new WebPuzzle dataset to improve evidence-seeking behavior under real-world ambiguity and noise.
2025.0601	VerlTool	TIGER AI Lab	GitHub	Qwen2.5-Math-VerlTool	——	Click VerlTool is a unified and easy-to-extend tool agent training framework based on verl
2025.0602	SCA	UCB & Meta	Paper	——	——	Click LLMs generate and solve their own tasks via a "Code-as-Task" setup, using RL for learning. Yields >2× gains on tool-use benchmarks.
2025.0602	MMedAgent-RL	UNC	Paper	——	——	Click Multi-agent reinforcement learning for medical reasoning with multimodal data. Promotes coordination and robustness across specialized agents.
2025.0603	CURE	ByteDance Seed	Paper GitHub	reasonflux-coder	——	Click Introduces CURE, a framework where code generation and unit testing co-evolve through RL, enhancing code accuracy without ground-truth supervision.
2025.0604	Seed-Coder	ByteDance Seed	Paper GitHub	Seed-Coder	——	Click Proposes a self-curating code model that generates and selects its own training data, enhancing code generation capabilities without external supervision.
2025.0604	DyMo	Cohere	Paper	——	——	Click Presents a self-verification sampling method for LLMs to enhance tool use by predicting and verifying intermediate steps before proceeding.
2025.0604	R-Search	CAS	Paper GitHub	——	——	Click Presents a multi-reward RL framework enabling LLMs to integrate reasoning with search, improving performance on complex logic and knowledge tasks.
2025.0605	MedAgentGym	Emory Univ.	Paper GitHub	——	——	Click Introduces a training environment for LLM agents focused on code-based medical reasoning, facilitating the development of AI in healthcare applications.
2025.0605	CI-RL	Purdue&Microsoft	Paper	——	——	Click Applies reinforcement learning to enhance contextual integrity in LLMs, aligning their outputs with privacy and safety norms.
2025.0611	Grounding-R1	Salesforce	Blog	——	——	Click GUI grounding via GRPO RL—clicks relevant areas without bounding-box or rationale supervision.
2025.0611	Agent-RLVR	Scale AI	Paper	——	——	Click Trains software agents using both environmental feedback and expert guidance—targeting real-world SE tasks.
2025.0611	ReVeal	MAR & THU	Paper	——	——	Click Self-evolving agents improve code generation via iterative RL-based generate–verify cycles.
2025.0611	CAGSR-vLLM-MTC	UC Berkeley	Paper	——	——	Click Enhances multi-turn reasoning via vLLM + self-supervised fine-tuning + RL on CoT traces.
2025.0x0x			Paper GitHub	hf models	hf datasets	Click insights and contributions about RL for reasoning within 30 words.

Projects

Contributing

If you have any updates or improvements for this document, please feel free to submit a Pull Request. Thank you!

202x.0x0x, Template

Project or Paper	Project name or Paper title
GitHub	Username/Project
Backbone Model	(Base / Instruct / Reasoning; HF Model)
RL Algorithm	(PPO / GRPO / RLOO / REINFORCE++; OpenRLHF / Verl / Trl)
Training Dataset	(Size / Source / HF Dataset)
Rollout Configuration	(Batch Size * N Samples ; Temperature; Dynamic Sampling)
Reward Function	(Outcome; Process; Repetition & Length)
Policy Optimization	(KL Loss; Length Penalty; Token-level loss)
Benchmark	(MATH/GPQA; R1 level; GPT-4o level)
Core Insights	(Empirical / Theoretical / Insightful Curves)
Additional Notes	(e.g., code snippet)

Citation

If you find our repository useful in your research, please star us ⭐ and consider citing:

@misc{zhang2025TripleR,
  title={Awesome RL Recipes for Reasoning},
  author={Kaiyan Zhang, Yuchen Fan, Yuxin Zuo, Guoli Jia, Kai Tian, Xingtai Lv, Xuekai Zhu, Ermo Hua, Ning Ding, Biqing Qi, Bowen Zhou},
  year={2025},
  howpublished={\url{https://github.com/}},
  note={Github Repository},
}

Name		Name	Last commit message	Last commit date
Latest commit History 146 Commits
LICENSE		LICENSE
README.md		README.md
agent.md		agent.md
llm.md		llm.md
vlm.md		vlm.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome RL Reasoning Recipes ("Triple R")

News

Contents

Overview

Large Language Models

Multimodal Models

Agentic Applications

Projects

Contributing

202x.0x0x, Template

Citation

Star History

About

Uh oh!

Releases

Packages

Contributors 14

License

TsinghuaC3I/Awesome-RL-Reasoning-Recipes

Folders and files

Latest commit

History

Repository files navigation

Awesome RL Reasoning Recipes ("Triple R")

News

Contents

Overview

Large Language Models

Multimodal Models

Agentic Applications

Projects

Contributing

202x.0x0x, Template

Citation

Star History

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 14

Packages