We are witnessing an exciting era where LLM capabilities have rapidly advanced in just a few years, enabling lower costs and stronger performance for real-world applications.
The next key step is to enhance Language Agents' ability to handle diverse tasks, which is crucial for deployment. We also focus on optimizing their structure and training methods to improve task completion rates.
Training Language Agents is an essential yet still emerging technology. This repository is dedicated to pushing the boundaries and exploring new possibilities in this field.
- [2503] ATLaS: Agent Tuning via Learning Critical Steps (UTS, UMD) (Not learning all behaviors, but some of the key behaviors.)
- [2502] STeCa: Step-level Trajectory Calibration for LLM Agent Learning (PolyU) [The behaviors to be learned may not be exactly gold standard, so calibrate them.]
- [2502] Training a Generally Curious Agent (CMU) [we use rejection sampling on self-generated data to teach the model better behaviors. look like combine ArCHer with PAE, multi-turn decision making and self-generated data. some explanations]
- [2501] Improving Vision-Language-Action Model with Online Reinforcement Learning (THU) [Generating successful trajectories with RL, then use SL to learn new data. Two-steps iRe-VLA]
- [2412] Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents (ICML 25 | UC Berkeley) [Good behavior is limited, synthesize more good behavior.]
- [2406] Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement (EMNLP 24 | PKU) [Learning good behavior is not as well rewarded with outcome level as Step-Level.]
- [2310] AgentTuning: Enabling Generalized Agent Abilities for LLMs (ICLR 24-r | Tsinghua, 144) [Similar to FireAct]
- [2310] ⭐️ FireAct: Toward Language Agent Fine-tuning (ICLR 23 | Princeton, 107) [fine-tuning Llama2-7B with 500 agent trajectories generated by GPT-4 leads to a 77% HotpotQA performance increase.]
- [1700] World of Bits: An Open-Domain Platform for Web-Based Agents (ICML 17 | Stanford, 237) [originator level paper.]
- Comment: I think iRe-VLA and PAE are similar.
1.2 Behavior Cloning (Learning from Both Good and Bad Behaviors, Utilizing Mistakes Selective/Comparative)
- [2501] ⭐️ Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training (ByteDance) [behavior cloning is not robust for the real world. learning from mistake (self-critique) is important, using MCTS create datasets. self-reflaction]
- [2409] E2CL: Exploration-based Error Correction Learning for Embodied Agents (EMNLP2024) [Empowering embodied agents with self-correction through exploration-induced errors and environmental feedback, enabling adaptive alignment and improved task performance.]
- [2408] Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents (ICLR 25-R | MultiOn, Stanford) (a combination of DPO, MCTS, and process supervision for web navigation task)
- [2403] Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents (ACL 24 | PKU) [allows agents to learn from their exploration failures. Gathering failure trajectories to create contrastive trajectory pairs. DPO]
- Comment: Learning from ideally good existing behaviors, points of challenge, behaviors not always right, picked out or DPO utilized.
- [2506] PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier
- [2505] Group-in-Group Policy Optimization for LLM Agent Training
- [2505] SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution [Advanced reward redistribution for multi-turn RL agents, transforming delayed rewards into stepwise progress signals]
- [0425] ⭐️ RAGEN: Understanding Self-Evolution in LLM A gents via Multi-Turn Reinforcement Learning (NWU) [The success behind o4 maybe multi-turn RL]
- [2504] Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models [supplementary material]
- [2503] ⭐️ SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks (Meta, UC Berkeley) [we want the agents not only complete a task, but also complete the task in the way we want it to be done. propose new ColBench.]
- [2502] Multi-Turn Code Generation Through Single-Step Rewards [supplementary material]
- [2502] EPO: Explicit Policy Optimization for Strategic Reasoning in LLMs via Reinforcement Learning (UCAS) [ArCHer plus]
- [2409] Building Math Agents with Multi-Turn Iterative Preference Learning (ICLR 25 | UIUC) []
- [2406] Direct Multi-Turn Preference Optimization for Language Agents [multi-turn dpo, considering previous trajectories and desiging a weighted strategy]
- [2402] ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL (ICML 24 | UC Berkeley) [Optimize the end goal with multi-turn RL, not for the short-term goal. frames multi-step tasks via a two-level hierarchical MDP, where the higher level MDP considers completions as actions and the lower level MDP considers tokens as actions by ucode]
- [2505] Towards Efficient Online Tuning of VLM Agents via Counterfactual Soft Reinforcement Learning (considering causal weight)
- [2504] ⭐️ LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities (DeepMind) [failure modes in decision-making: greediness, frequency bias, and the knowing-doing gap. Mitigation of these shortcomings by fine-tuning via Reinforcement Learning (RL) on self-generated CoT rationales.]
- [2503] UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning (VIVO Lab, CUHK) [Explore GUI Agent with GRPO, inspired by Deepseek-r1]
- [2502] Digi-Q: Learning Q-Value Functions for Training Device-Control Agents (ICLR 25 | UC Berkeley) [DigiRL plus, Add VLM Q function]
- [2411] WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning (ICLR 25 | Tsinghua) (DigiRL plus)
- [2406] ⭐️ DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning (NeurIPS 24 | UC Berkeley) [Step 1: learning in clean environment, step2: learning in real world because the real world includes many unexpected situtation]
- [2401] ⭐️ Grounding large language models in interactive environments with online reinforcement learning (ICML 23 | Huggingface, 196)
- [2401] ⭐️ True Knowledge Comes from Practice: Aligning LLMs with Embodied Environments via Reinforcement Learning (ICLR 24 | NTU, 47)
- Comment: WebRL and DigiRL can be unstable and highly sensitive to hyperparameters and reward design commented by PLAN-AND-ACT
- [2507] MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent [GIGPO plus]
- [2505] Group-in-Group Policy Optimization for LLM Agent Training [Adding sliding windows to StarPO to consider long-term tasks]
- [2503] PLAN-AND-ACT: Improving Planning of Agents for Long-Horizon Tasks (UC Berkeley) [Recent work separates high-level planning from low-level execution, enabling better balance between objectives and details. However, generating accurate plans remains difficult since LLMs are not inherently trained for this task. This paper generate good plans.]
- [2503] MPO: Boosting LLM Agents with Meta Plan Optimization (PKU) [Imroving planning]
- [2502] Reinforcement Learning for Long-Horizon Interactive LLM Agents (Apple) [propose AppWorld task other than WebShop]
- Comment: Generating data seems to be used as a strategy.
- [2505] MARTI: A Framework for Multi-Agent LLM Systems - Reinforced Training and Inference (Tsinghua)
- [2504] MARFT: Multi-Agent Reinforcement Fine-Tuning (SJTU)
- [2503] M3HF: Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality (KCL)
- [2502] Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search (USTC) []
- [2410] OPTIMA: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System (ICLR 25-r | Tsinghua) []
- [2406] Autonomous Agents for Collaborative Task under Information Asymmetry (NeurIPS 24 | Tsinghua) [Asymmetric information and collaborative, no-training SWEET-RL basis]
- Comment: Generating data seems to be used as a strategy.
- [2506] Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction (CMU, UIUC)
- [2505] WEBCOT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback (CityU, Tencent) ()
- [2505] BacktrackAgent: Enhancing GUI Agent with Error Detection and Backtracking Mechanism (Xiaomi)
- [2504] Enhancing Web Agents with Explicit Rollback Mechanisms (Tencent)
- [2406] ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search (Tsinghua 131)
- [2401] Self-Rewarding Language Models (Meta 550)
- [2312] ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent (Deepmind 46)
- [2308] Reinforced Self-Training (ReST) for Language Modeling (Deepmind 265) (SWEET-RL similar to this)
- [2305] Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Stanford 3295)
- [2203] Training language models to follow instructions with human feedback (OpenAI 14011)
- [2203] STaR: Bootstrapping Reasoning With Reasoning (Stanford 704)
- [2009] Learning to summarize from human feedback (OpenAI 2187)
- [1706] Deep Reinforcement Learning from Human Preferences (OpenAI 3940)
- [2405] AGILE: A Novel Reinforcement Learning Framework of LLM Agents (NeurIPS 24 | ByteDance) [The same type of work as ReAct, WebGPT, presents a dataset.]
- [2303] Reflexion: Language Agents with Verbal Reinforcement Learning (NeurIPS 23 | Shunyu Yao, 1420)
- [2210] ReAct: Synergizing Reasoning and Acting in Language Models (ICLR 23 | Shunyu Yao, 2731) [code]
- [2112] WebGPT: Browser-assisted question-answering with human feedback (OpenAI, 1275)
Difference between Multi-turn RL with Single-turn RL
Single-turn RL: Like scratching a lottery ticket. See the ticket, scratch! See money. End of round. Corresponding to LLM, user enters prompt, gets answer, quality of answer. End.
Multi-turn RL: It's like playing Super Mario, constantly seeing new screens, performing new actions, and getting rewards (deaths or gold coins). Corresponding to LLM, for each new answer, the input is not only the current prompt, but also the previous answer, the previous prompt, and the previous reward (i.e., the previous trajectories).
- [2504] Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation [Behavioral cloning is for LLM to learn the correct trajectory during the training phase. The method is to directly correct the trajectory of LLM during the “test” by training a process-reward model. But the reality is complex and changing, doesn't the reward model still have to keep learning?]
- [2504] ⭐️ τ -bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains (ICLR 25 | Shunyu Yao)
- [2504] Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use (Stanford)
- [2504] OTC: Optimal Tool Calls via Reinforcement Learning (CUHK)
- [2503] R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning (RUC)
- [2503] Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning (UIUC)
- [2503] ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning (Baichuan)
- [2504] Meta-Thinking in LLMs via Multi-Agent Reinforcement Learning: A Survey [Meta Thinking]
RAGEN (Training agent)
Search-R1 (Train your LLMs to reason and call a search engine with reinforcement learning)
OpenManus-RL (A live stream development of RL tunning for LLM agents)
MetaSpatial (Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse)
- Feel free to contribute more papers or other any resources!