Yueen Ma1, Zixing Song1, Yuzheng Zhuang2, Jianye Hao2, Irwin King1
-
The Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China (Email: {yema21, zxsong, king}@cse.cuhk.edu.hk)
-
Huawei Noah's Ark Lab, Shenzhen, China (Email: {zhuangyuzheng, haojianye}@huawei.com)
The official repo of the survey, containing a curated list of papers on Vision-Language-Action Models for Embodied AI.
Feel free to send us pull requests or emails to add papers!
If you find this repository useful, please consider citing, staring, and sharing with others!
- Definitions
- Latest
- Taxonomy
- Components of VLA
- Low-level Control Policies
- High-level Task Planners
- Related Surveys
- Latest Developments
- Citation
-
Generalized VLA
Input: state, instruction.
Output: action. -
Large VLA
A special type of generalized VLA that is adapted from large VLMs. (Same as VLA defined by RT-2.)
We utilize various charts to visualize key aspects of VLA developments from 2020 to 2025. To supplement the VLAs discussed in the main text, we employed a hybrid approach combining automated scripting and manual searching to retrieve VLA-related papers published between January 2020 and December 2025. We queried the keywords "VLA", "Vision-language-action", and "Vision language action", filtering false positives based on their relevance to "embodied AI" and "robotics". This pipeline yielded approximately 400 VLA-related papers. Acknowledging the potential for automated errors, we welcome feedback and requests for corrections regarding the included data.
The raw data for these visualizations are available in the data folder:
data/vla_models.json: VLA models datadata/institute_abbr.json: Institute abbreviations
- LeRobot
https://github.com/huggingface/lerobot - StarVLA
https://github.com/starVLA/starVLA - EmbodiChain
https://github.com/DexForce/EmbodiChain
There are many other lists related to Embodied AI that are actively being updated. You may also want to check them out:
- Awesome World Models
https://github.com/leofan90/Awesome-World-Models - Awesome Embodied VLA
https://github.com/jonyzhang2023/awesome-embodied-vla-va-vln - Awesome LLM Robotics
https://github.com/GT-RIPL/Awesome-LLM-Robotics - Awesome Physical AI
https://github.com/keon/awesome-physical-ai - Embodied AI Paper List
https://github.com/HCPLab-SYSU/Embodied_AI_Paper_List - Awesome RL VLA
https://github.com/Denghaoyuan123/Awesome-RL-VLA - 3D Gaussian Splatting Papers
https://github.com/Awesome3DGS/3D-Gaussian-Splatting-Papers - VLM Survey
https://github.com/jingyi0000/VLM_survey
A number of other survey papers on VLA models, embodied AI, robotics, etc. are also available:
- "A Survey on Reinforcement Learning of Vision-Language-Action Models for Robotic Manipulation", Dec 2025 [Paper]
- "An Anatomy of Vision-Language-Action Models- From Modules to Milestones and Challenges", Dec 2025 [Paper]
- "Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications", Oct 2025 [Paper]
- "Vision Language Action Models in Robotic Manipulation: A Systematic Review", Jul 2025 [Paper]
- "A Survey on Vision-Language-Action Models: An Action Tokenization Perspective", Jul 2025 [Paper]
- "Vision-Language-Action Models: Concepts, Progress, Applications and Challenges", May 2025 [Paper]
- "Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models", Dec 2024 [Paper]
- "Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI", Aug 2025 [Paper]
- "Real-World Robot Applications of Foundation Models: A Review", Feb 2024 [Paper]
- "Large Language Models for Robotics: Opportunities, Challenges, and Perspectives", Jan 2024 [Paper]
- "Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis", Dec 2023 [Paper]
- "Foundation Models in Robotics: Applications, Challenges, and the Future", Dec 2023 [Paper]
- "A Survey of Embodied AI: From Simulators to Research Tasks", Jan 2022 [Paper]
- "Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond", May 2024 [Paper]
- "Understanding the planning of LLM agents: A survey", Feb 2024 [Paper]
- "Foundation Models for Decision Making: Problems, Methods, and Opportunities", Mar 2023 [Paper]
- "Neural Fields in Robotics: A Survey", Oct 2024 [Paper]
- DT: "Decision Transformer: Reinforcement Learning via Sequence Modeling", NeurIPS, 2021 [Paper][Code]
- Trajectory Transformer: "Offline Reinforcement Learning as One Big Sequence Modeling Problem", NeurIPS, 2021 [Paper][Code]
- SEED: "Primitive Skill-based Robot Learning from Human Evaluative Feedback", IROS, 2023 [Paper][Code]
- Reflexion: "Reflexion: Language Agents with Verbal Reinforcement Learning", NeurIPS, 2023 [Paper][Code]
- "Learning Transferable Visual Models From Natural Language Supervision", ICML, 2021 [Paper][Website][Code]
- Voltron: "Language-Driven Representation Learning for Robotics", RSS, 2023 [Paper]
- VC-1: "Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?", NeurIPS, 2023 [Paper][Website][Code]
- "The (Un)surprising Effectiveness of Pre-Trained Vision Models for Control", ICML, 2022 [Paper]
- R3M: "R3M: A Universal Visual Representation for Robot Manipulation", CoRL, 2022 [Paper][Website][Code]
- VIP: "VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training", ICLR, 2023 [Paper][Website][Code]
- DINOv2: "DINOv2: Learning Robust Visual Features without Supervision", Trans. Mach. Learn. Res., 2023 [Paper][Code]
- I-JEPA: "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture", CVPR, 2023 [Paper]
-
Theia: "Theia: Distilling Diverse Vision Foundation Models for Robot Learning", CoRL, 2024 [Paper]
-
HRP: "HRP: Human Affordances for Robotic Pre-Training", RSS, 2024 [Paper][Website][Code]
- HPT: "Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers", NeurIPS, 2024 [Paper][Website][Code]
- F3RM: "Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation", CoRL, 2023 [Paper][Website][Code]
- PhysGaussian: "PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics", CVPR, 2024 [Paper][Website][Code]
- UniGS: "UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting", ICLR, 2025 [Paper][Code]
- That Sounds Right: "That Sounds Right: Auditory Self-Supervision for Dynamic Robot Manipulation", CoRL, 2023 [Paper][Code]
- MaskDP: "Masked Autoencoding for Scalable and Generalizable Decision Making", NeurIPS, 2022 [Paper][Code]
- PACT: "PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training", IROS, 2023 [Paper]
- GR-1: "Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation", ICLR, 2024 [Paper]
- SMART: "SMART: Self-supervised Multi-task pretrAining with contRol Transformers", ICLR, 2023 [Paper]
- MIDAS: "Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning", ICML, 2024 [Paper][Website]
- Vi-PRoM: "Exploring Visual Pre-training for Robot Manipulation: Datasets, Models and Methods", IROS, 2023 [Paper][Website]
- VPT: "Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos", NeurIPS, 2022 [Paper]
- "A Path Towards Autonomous Machine Intelligence", OpenReview, 2022 [Paper]
- DreamerV1: "Dream to Control: Learning Behaviors by Latent Imagination", ICLR, 2020 [Paper]
- DreamerV2: "Mastering Atari with Discrete World Models", ICLR, 2021 [Paper]
- DreamerV3: "Mastering Diverse Domains through World Models", arXiv, Jan 2023 [Paper]
- DayDreamer: "DayDreamer: World Models for Physical Robot Learning", CoRL, 2022 [Paper]
- TWM: "Transformer-based World Models Are Happy With 100k Interactions", ICLR, 2023 [Paper]
- DECKARD: "Do Embodied Agents Dream of Pixelated Sheep: Embodied Decision Making using Language Guided World Modelling", ICML, 2023 [Paper][Website][Code]
- LLM-MCTS: "Large Language Models as Commonsense Knowledge for Large-Scale Task Planning", NeurIPS, 2023 [Paper]
- RAP: "Reasoning with Language Model is Planning with World Model", EMNLP, 2023 [Paper]
- LLM+P: "LLM+P: Empowering Large Language Models with Optimal Planning Proficiency", arXiv, Apr 2023 [Paper][Code]
- LLM-DM: "Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning", NeurIPS, 2023 [Paper][Website][Code]
- E2WM: "Language Models Meet World Models: Embodied Experiences Enhance Language Models", NeurIPS, 2023 [Paper][Code]
- ThinkBot: "ThinkBot: Embodied Instruction Following with Thought Chain Reasoning", arXiv, Dec 2023 [Paper]
- ReAct: "ReAct: Synergizing Reasoning and Acting in Language Models", ICLR, 2023 [Paper]
- RAT: "RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation", arXiv, Mar 2024 [Paper]
- Tree-Planner: "Tree-Planner: Efficient Close-loop Task Planning with Large Language Models", ICLR, 2024 [Paper]
- ECoT: "Robotic Control via Embodied Chain-of-Thought Reasoning", arXiv, Jul 2024 [Paper]
- CoT-VLA: "CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models", CVPR, 2025 [Paper][Website]
- V-GPS: "Steering Your Generalists: Improving Robotic Foundation Models via Value Guidance", CoRL, 2024 [Paper][Website][Code]
- RoboMonkey: "RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models", arXiv, Oct 2024 [Paper][Website][Code]
- Transporter Networks: "Transporter Networks: Rearranging the Visual World for Robotic Manipulation", CoRL, 2020 [Paper]
- CLIPort: "CLIPort: What and Where Pathways for Robotic Manipulation", CoRL, 2021 [Paper][Website][Code]
- BC-Z: "BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning", CoRL, 2021 [Paper][Website][Code]
- HULC: "What Matters in Language Conditioned Robotic Imitation Learning over Unstructured Data", arXiv, Apr 2022 [Paper][Website][Code]
- HULC++: "Grounding Language with Visual Affordances over Unstructured Data", ICRA, 2023 [Paper][Website][Paper]
- MCIL: "Language Conditioned Imitation Learning over Unstructured Data", Robotics: Science and Systems, 2021 [Paper][Website][Paper]
- UniPi: "Learning Universal Policies via Text-Guided Video Generation", NeurIPS, 2023 [Paper][Website]
- RoboFlamingo: "Vision-Language Foundation Models as Effective Robot Imitators", arXiv, Jan 2025 [Paper][Website][Code]
- ACT: "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware", Robotics: Science and Systems, 2023 [Paper]
- RoboCat: "RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation", arXiv, Mar 2021 [Paper]
- Gato: "A Generalist Agent", Trans. Mach. Learn. Res., 2022 [Paper]
- RT-Trajectory: "RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches", ICLR, 2023 [Paper]
- Q-Transformer: "Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions", arXiv, Sep 2023 [Paper]
- Interactive Language: "Interactive Language: Talking to Robots in Real Time", arXiv, Oct 2022 [Paper]
- MT-ACT: "RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking", ICRA, 2024 [Paper][Code][Code]
- Hiveformer: "Instruction-driven history-aware policies for robotic manipulations", CoRL, 2022 [Paper][Website][Code]
- VIMA: "VIMA: General Robot Manipulation with Multimodal Prompts", arXiv, Oct 2022 [Paper]
- MOO: "Open-World Object Manipulation using Pre-trained Vision-Language Models", CoRL, 2023 [Paper]
- VER: "Volumetric Environment Representation for Vision-Language Navigation", CVPR, 2024 [Paper][Code]
- RVT: "RVT: Robotic View Transformer for 3D Object Manipulation", CoRL, 2023 [Paper]
- RVT-2: "RVT-2: Learning Precise Manipulation from Few Demonstrations", arXiv, Jun 2024 [Paper]
- RoboUniView: "RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulaiton", [Code]
- PerAct: "Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation", CoRL, 2022 [Paper]
- Act3D: "Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation", CoRL, 2023 [Paper][Website][Code]
- MDT: "Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals", Robotics: Science and Systems, 2024 [Paper][Website][Code]
- RDT-1B: "RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation", arXiv, Oct 2024 [Paper][Website][Code]
- Diffusion Policy: "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion", Robotics: Science and Systems, 2023 [Paper][Website][Code]
- Octo: "Octo: An Open-Source Generalist Robot Policy", Robotics: Science and Systems, 2024 [Paper][Website][Code]
- SUDD: "Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition", CoRL, 2023 [Paper][Code]
- ScaleDP: "Scaling Diffusion Policy in Transformer to 1 Billion Parameters for Robotic Manipulation", ICRA, 2025 [Paper][Website][Code]
- 3D Diffuser Actor: "3D Diffuser Actor: Policy Diffusion with 3D Scene Representations", arXiv, Feb 2024 [Paper][Code]
- DP3: "3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations", Proceedings of Robotics: Science and Systems (RSS), 2024 [Paper][Website][Code]
- VoxPoser: "VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models", CoRL, 2023 [Paper][Website][Code]
- Language costs: "Correcting Robot Plans with Natural Language Feedback", Robotics: Science and Systems, 2022 [Paper][Website]
- RoboTAP: "RoboTAP: Tracking Arbitrary Points for Few-Shot Visual Imitation", ICRA, 2024 [Paper][Website]
- ReKep: "ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation", arXiv, Sep 2024 [Paper][Website][Code]
- RoboPoint: "RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics", arXiv, Jun 2024 [Paper][Website][Code]
- PIVOT: "PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs", ICML, 2024 [Paper][Website]
- RT-2: "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control", CoRL, 2023 [Paper][Website]
- RT-H: "RT-H: Action Hierarchies Using Language", Robotics: Science and Systems, 2024 [Paper][Website]
- RT-X, OXE: "Open X-Embodiment: Robotic Learning Datasets and RT-X Models", arXiv, Oct 2023 [Paper][Website][Code]
- RT-A: "RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation", ICRA, 2025 [Paper][Website]
- OpenVLA: "OpenVLA: An Open-Source Vision-Language-Action Model", CoRL, 2024 [Paper][Website][Code]
- OpenVLA-OFT: "OpenVLA-OFT: Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success", arXiv, 2025 [Paper][Website][Code]
- TraceVLA: "TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies", ICLR, 2025 [Paper]
- π0: "π0: A Vision-Language-Action Flow Model for General Robot Control", arXiv, Oct 2024 [Paper][Website]
-
π0.5: "π0.5: a Vision-Language-Action Model with Open-World Generalization", arXiv, Apr 2025 [Paper][Website]
-
RoboMamba: "RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation", NeurIPS, 2024 [Paper][Website]
-
SpatialVLA: "SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model", arXiv, 2025 [Paper][Website]
-
LAPA: "Latent Action Pretraining from Videos", ICLR, 2025 [Paper][Website][Code]
-
TinyVLA: "TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation", arXiv, 2024 [Paper][Website][Code]
-
CogACT: "CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation", arXiv, 2024 [Paper][Website][Code]
-
DexVLA: "DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control", CoRL, 2025 [Paper][Website][Code]
-
HybridVLA: "HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model", arXiv, 2025 [Paper][Website][Code]
-
WorldVLA: "WorldVLA: Towards Autoregressive Action World Model", arXiv, Jun 2025 [Paper][Code]
-
UniVLA: "Unified Vision-Language-Action Model", arXiv, Jun 2025 [Paper][Website][Code]
-
Instruct2Act: "Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model", arXiv, 2023 [Paper][Code]
- VLA-Adapter: "VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model", arXiv, 2025 [Paper][Website][Code]
- SmolVLA: "SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics", arXiv, 2025 [Paper]
- UP-VLA: "UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent", arXiv, Jan 2025 [Paper][Code]
- DreamVLA: "DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge", arXiv, Jul 2025 [Paper][Website][Code]
- HiMoE-VLA: "HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies", arXiv, Jul 2025 [Paper][Code]
- InternVLA-M1: "InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy", arXiv, Oct 2025 [Paper][Website][Code]
- (SL)^3: "Skill Induction and Planning with Latent Language", ACL, 2022 [Paper]
- Translated <LM>: "Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents", ICML, 2022 [Paper][Code]
- SayCan: "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances", CoRL, 2022 [Paper][Website][Code]
- EmbodiedGPT: "EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought", NeurIPS, 2023 [Paper][Code]
- MultiPLY: "MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World", CVPR, 2024 [Paper]
- ShapeLLM: "ShapeLLM: Universal 3D Object Understanding for Embodied Interaction", ECCV, 2024 [Paper][Website][Code]
- ReAct: "ReAct: Synergizing Reasoning and Acting in Language Models", ICLR, 2023 [Paper][Website][Code]
- Socratic Models: "Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language", ICLR, 2023 [Paper]
- LID: "Pre-Trained Language Models for Interactive Decision-Making", NeurIPS, 2022 [Paper][Website][Code]
- Inner Monologue: "Inner Monologue: Embodied Reasoning through Planning with Language Models", arXiv, Jul 2022 [Paper][Website]
- LLM-Planner: "LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models", ICCV, 2023 [Paper][Website][Website]
- ChatGPT for Robotics: "ChatGPT for Robotics: Design Principles and Model Abilities", IEEE Access, 2023 [Paper][Website][Code]
- DEPS: "Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents", arXiv, Feb 2023 [Paper][Code]
- ConceptGraphs: "ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning", ICRA, 2023 [Paper][Website][Code]
- CaP: "Code as Policies: Language Model Programs for Embodied Control", ICRA, 2023 [Paper][Website][Code]
- ProgPrompt: "ProgPrompt: Generating Situated Robot Task Plans using Large Language Models", ICRA, 2023 [Paper][Website][Code]
- COME-robot: "Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V", arXiv, Apr 2024 [Paper][Website]
- "Foundation Models in Robotics: Applications, Challenges, and the Future", arXiv, Dec 2023 [Paper]
- "Real-World Robot Applications of Foundation Models: A Review", arXiv, Feb 2024 [Paper]
- "Large Language Models for Robotics: Opportunities, Challenges, and Perspectives", arXiv, Jan 2024 [Paper]
- "Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis", arXiv, Dec 2023 [Paper]
- "Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond", arXiv, May 2024 [Paper]
-
LLaRA: "LLaRA: Supercharging Robot Learning Data for Vision-Language Policy", ICLR, 2025 [Paper][Code]
-
Mobility VLA: "Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs", CoRL, 2024 [Paper]
-
GR00T N1: "GR00T N1: An Open Foundation Model for Generalist Humanoid Robots", arXiv, Mar 2025 [Paper][Code]
-
Humanoid-VLA: "Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration", arXiv, Feb 2025 [Paper]
-
QUAR-VLA: "QUAR-VLA: Vision-Language-Action Model for Quadruped Robots", ECCV, 2024 [Paper]
-
QUART-Online: "QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning", ICRA, 2025 [Paper][Website][Code]
-
MoRE: "MoRE: Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models", ICRA, 2025 [Paper]
- DexGraspVLA: "DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping", arXiv, Feb 2025 [Paper][Website][Code]
Thank you for your interest! If you find our work helpful, please consider citing us with:
@article{DBLP:journals/corr/abs-2405-14093,
author = {Yueen Ma and
Zixing Song and
Yuzheng Zhuang and
Jianye Hao and
Irwin King},
title = {A Survey on Vision-Language-Action Models for Embodied {AI}},
journal = {CoRR},
volume = {abs/2405.14093},
year = {2024}
}









