Awesome-Agent-Training

We are witnessing an exciting era where LLM capabilities have rapidly advanced in just a few years, enabling lower costs and stronger performance for real-world applications.

The next key step is to enhance Language Agents' ability to handle diverse tasks, which is crucial for deployment. We also focus on optimizing their structure and training methods to improve task completion rates.

Training Language Agents is an essential yet still emerging technology. This repository is dedicated to pushing the boundaries and exploring new possibilities in this field.

The Second Half

Papers

1.1 Behavior Cloning (Learning from Good Behavior)

[2503] ATLaS: Agent Tuning via Learning Critical Steps (UTS, UMD) (Not learning all behaviors, but some of the key behaviors.)
[2502] STeCa: Step-level Trajectory Calibration for LLM Agent Learning (PolyU) [The behaviors to be learned may not be exactly gold standard, so calibrate them.]
[2502] Training a Generally Curious Agent (CMU) [we use rejection sampling on self-generated data to teach the model better behaviors. look like combine ArCHer with PAE, multi-turn decision making and self-generated data. some explanations]
[2501] Improving Vision-Language-Action Model with Online Reinforcement Learning (THU) [Generating successful trajectories with RL, then use SL to learn new data. Two-steps iRe-VLA]
[2412] Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents (ICML 25 | UC Berkeley) [Good behavior is limited, synthesize more good behavior.]
[2406] Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement (EMNLP 24 | PKU) [Learning good behavior is not as well rewarded with outcome level as Step-Level.]
[2310] AgentTuning: Enabling Generalized Agent Abilities for LLMs (ICLR 24-r | Tsinghua, 144) [Similar to FireAct]
[2310] ⭐️ FireAct: Toward Language Agent Fine-tuning (ICLR 23 | Princeton, 107) [fine-tuning Llama2-7B with 500 agent trajectories generated by GPT-4 leads to a 77% HotpotQA performance increase.]
[1700] World of Bits: An Open-Domain Platform for Web-Based Agents (ICML 17 | Stanford, 237) [originator level paper.]
Comment: I think iRe-VLA and PAE are similar.

1.2 Behavior Cloning (Learning from Both Good and Bad Behaviors, Utilizing Mistakes Selective/Comparative)

[2501] ⭐️ Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training (ByteDance) [behavior cloning is not robust for the real world. learning from mistake (self-critique) is important, using MCTS create datasets. self-reflaction]
[2409] E2CL: Exploration-based Error Correction Learning for Embodied Agents (EMNLP2024) [Empowering embodied agents with self-correction through exploration-induced errors and environmental feedback, enabling adaptive alignment and improved task performance.]
[2408] Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents (ICLR 25-R | MultiOn, Stanford) （a combination of DPO, MCTS, and process supervision for web navigation task）
[2403] Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents (ACL 24 | PKU) [allows agents to learn from their exploration failures. Gathering failure trajectories to create contrastive trajectory pairs. DPO]
Comment: Learning from ideally good existing behaviors, points of challenge, behaviors not always right, picked out or DPO utilized.

2.1 Alignment with the Real World (Considering Previous Trajectories, Multi-turn)

[2506] PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier
[2505] Group-in-Group Policy Optimization for LLM Agent Training
[2505] SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution [Advanced reward redistribution for multi-turn RL agents, transforming delayed rewards into stepwise progress signals]
[0425] ⭐️ RAGEN: Understanding Self-Evolution in LLM A gents via Multi-Turn Reinforcement Learning (NWU) [The success behind o4 maybe multi-turn RL]
[2504] Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models [supplementary material]
[2503] ⭐️ SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks (Meta, UC Berkeley) [we want the agents not only complete a task, but also complete the task in the way we want it to be done. propose new ColBench.]
[2502] Multi-Turn Code Generation Through Single-Step Rewards [supplementary material]
[2502] EPO: Explicit Policy Optimization for Strategic Reasoning in LLMs via Reinforcement Learning (UCAS) [ArCHer plus]
[2409] Building Math Agents with Multi-Turn Iterative Preference Learning (ICLR 25 | UIUC) []
[2406] Direct Multi-Turn Preference Optimization for Language Agents [multi-turn dpo, considering previous trajectories and desiging a weighted strategy]
[2402] ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL (ICML 24 | UC Berkeley) [Optimize the end goal with multi-turn RL, not for the short-term goal. frames multi-step tasks via a two-level hierarchical MDP, where the higher level MDP considers completions as actions and the lower level MDP considers tokens as actions by ucode]

2.2 Alignment with the Real World (Considering Unexpected Cases)

[2505] Towards Efficient Online Tuning of VLM Agents via Counterfactual Soft Reinforcement Learning (considering causal weight)
[2504] ⭐️ LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities (DeepMind) [failure modes in decision-making: greediness, frequency bias, and the knowing-doing gap. Mitigation of these shortcomings by fine-tuning via Reinforcement Learning (RL) on self-generated CoT rationales.]
[2503] UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning (VIVO Lab, CUHK) [Explore GUI Agent with GRPO, inspired by Deepseek-r1]
[2502] Digi-Q: Learning Q-Value Functions for Training Device-Control Agents (ICLR 25 | UC Berkeley) [DigiRL plus, Add VLM Q function]
[2411] WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning (ICLR 25 | Tsinghua) (DigiRL plus)
[2406] ⭐️ DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning (NeurIPS 24 | UC Berkeley) [Step 1: learning in clean environment, step2: learning in real world because the real world includes many unexpected situtation]
[2401] ⭐️ Grounding large language models in interactive environments with online reinforcement learning (ICML 23 | Huggingface, 196)
[2401] ⭐️ True Knowledge Comes from Practice: Aligning LLMs with Embodied Environments via Reinforcement Learning (ICLR 24 | NTU, 47)
Comment: WebRL and DigiRL can be unstable and highly sensitive to hyperparameters and reward design commented by PLAN-AND-ACT

2.3 Alignment with the Real World (Considering Long-Horizon Task)

[2507] MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent [GIGPO plus]
[2505] Group-in-Group Policy Optimization for LLM Agent Training [Adding sliding windows to StarPO to consider long-term tasks]
[2503] PLAN-AND-ACT: Improving Planning of Agents for Long-Horizon Tasks (UC Berkeley) [Recent work separates high-level planning from low-level execution, enabling better balance between objectives and details. However, generating accurate plans remains difficult since LLMs are not inherently trained for this task. This paper generate good plans.]
[2503] MPO: Boosting LLM Agents with Meta Plan Optimization (PKU) [Imroving planning]
[2502] Reinforcement Learning for Long-Horizon Interactive LLM Agents (Apple) [propose AppWorld task other than WebShop]
Comment: Generating data seems to be used as a strategy.

Basic Understanding

Difference between Multi-turn RL with Single-turn RL
Single-turn RL: Like scratching a lottery ticket. See the ticket, scratch! See money. End of round. Corresponding to LLM, user enters prompt, gets answer, quality of answer. End.
Multi-turn RL: It's like playing Super Mario, constantly seeing new screens, performing new actions, and getting rewards (deaths or gold coins). Corresponding to LLM, for each new answer, the input is not only the current prompt, but also the previous answer, the previous prompt, and the previous reward (i.e., the previous trajectories).

To be Classified

[2504] Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation [Behavioral cloning is for LLM to learn the correct trajectory during the training phase. The method is to directly correct the trajectory of LLM during the “test” by training a process-reward model. But the reality is complex and changing, doesn't the reward model still have to keep learning?]

Tool Call

[2504] ⭐️ τ -bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains (ICLR 25 | Shunyu Yao)
[2504] Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use (Stanford)
[2504] OTC: Optimal Tool Calls via Reinforcement Learning (CUHK)

RL + Search

[2503] R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning (RUC)
[2503] Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning (UIUC)
[2503] ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning (Baichuan)

Meta-Thinking

[2504] Meta-Thinking in LLMs via Multi-Agent Reinforcement Learning: A Survey [Meta Thinking]

Open-Source Project

RAGEN (Training agent)
Search-R1 (Train your LLMs to reason and call a search engine with reinforcement learning)
OpenManus-RL (A live stream development of RL tunning for LLM agents)
MetaSpatial (Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse)

Contributing

Feel free to contribute more papers or other any resources!

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome-Agent-Training

Papers

1.1 Behavior Cloning (Learning from Good Behavior)

1.2 Behavior Cloning (Learning from Both Good and Bad Behaviors, Utilizing Mistakes Selective/Comparative)

2.1 Alignment with the Real World (Considering Previous Trajectories, Multi-turn)

2.2 Alignment with the Real World (Considering Unexpected Cases)

2.3 Alignment with the Real World (Considering Long-Horizon Task)

2.4 Alignment with the Real World (Considering Multi-Agent Collaborate)

3.1 Agent Trajectories Construction

4 Backtracking

Supplementary Classical Papers