Skip to content

yueen-ma/Awesome-VLA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome VLA

A Survey on Vision-Language-Action Models for Embodied AI

arXiv License Awesome GitHub stars Visitors Pull Requests Badge Issues Badge

Yueen Ma1, Zixing Song1, Yuzheng Zhuang2, Jianye Hao2, Irwin King1

  1. The Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China (Email: {yema21, zxsong, king}@cse.cuhk.edu.hk)

  2. Huawei Noah's Ark Lab, Shenzhen, China (Email: {zhuangyuzheng, haojianye}@huawei.com)

The official repo of the survey, containing a curated list of papers on Vision-Language-Action Models for Embodied AI.

Architecture

Feel free to send us pull requests or emails to add papers!

If you find this repository useful, please consider citing, staring, and sharing with others!

Content

Definitions

  • Generalized VLA
    Input: state, instruction.
    Output: action.

  • Large VLA
    A special type of generalized VLA that is adapted from large VLMs. (Same as VLA defined by RT-2.)

Venn

Latest

Trends

We utilize various charts to visualize key aspects of VLA developments from 2020 to 2025. To supplement the VLAs discussed in the main text, we employed a hybrid approach combining automated scripting and manual searching to retrieve VLA-related papers published between January 2020 and December 2025. We queried the keywords "VLA", "Vision-language-action", and "Vision language action", filtering false positives based on their relevance to "embodied AI" and "robotics". This pipeline yielded approximately 400 VLA-related papers. Acknowledging the potential for automated errors, we welcome feedback and requests for corrections regarding the included data.

The raw data for these visualizations are available in the data folder:

Timeline_2025
Landscape
Stats
Institutes

VLA Development Platforms

Related Repositories

There are many other lists related to Embodied AI that are actively being updated. You may also want to check them out:

Related Surveys

A number of other survey papers on VLA models, embodied AI, robotics, etc. are also available:

VLA

  • "A Survey on Reinforcement Learning of Vision-Language-Action Models for Robotic Manipulation", Dec 2025 [Paper]
  • "An Anatomy of Vision-Language-Action Models- From Modules to Milestones and Challenges", Dec 2025 [Paper]
  • "Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications", Oct 2025 [Paper]
  • "Vision Language Action Models in Robotic Manipulation: A Systematic Review", Jul 2025 [Paper]
  • "A Survey on Vision-Language-Action Models: An Action Tokenization Perspective", Jul 2025 [Paper]
  • "Vision-Language-Action Models: Concepts, Progress, Applications and Challenges", May 2025 [Paper]
  • "Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models", Dec 2024 [Paper]

Robotics & Embodied AI

  • "Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI", Aug 2025 [Paper]
  • "Real-World Robot Applications of Foundation Models: A Review", Feb 2024 [Paper]
  • "Large Language Models for Robotics: Opportunities, Challenges, and Perspectives", Jan 2024 [Paper]
  • "Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis", Dec 2023 [Paper]
  • "Foundation Models in Robotics: Applications, Challenges, and the Future", Dec 2023 [Paper]
  • "A Survey of Embodied AI: From Simulators to Research Tasks", Jan 2022 [Paper]

Others

  • "Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond", May 2024 [Paper]
  • "Understanding the planning of LLM agents: A survey", Feb 2024 [Paper]
  • "Foundation Models for Decision Making: Problems, Methods, and Opportunities", Mar 2023 [Paper]
  • "Neural Fields in Robotics: A Survey", Oct 2024 [Paper]

Taxonomy

Taxonomy

Timelines

Timelines

Components of VLA

Reinforcement Learning

  • DT: "Decision Transformer: Reinforcement Learning via Sequence Modeling", NeurIPS, 2021 [Paper][Code]
  • Trajectory Transformer: "Offline Reinforcement Learning as One Big Sequence Modeling Problem", NeurIPS, 2021 [Paper][Code]
  • SEED: "Primitive Skill-based Robot Learning from Human Evaluative Feedback", IROS, 2023 [Paper][Code]
  • Reflexion: "Reflexion: Language Agents with Verbal Reinforcement Learning", NeurIPS, 2023 [Paper][Code]

Pretrained Visual Representations

  • "Learning Transferable Visual Models From Natural Language Supervision", ICML, 2021 [Paper][Website][Code]
  • MVP: "Real-World Robot Learning with Masked Visual Pre-training", CoRL, 2022 [Paper][Website][Code]
  • Voltron: "Language-Driven Representation Learning for Robotics", RSS, 2023 [Paper]
  • VC-1: "Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?", NeurIPS, 2023 [Paper][Website][Code]
  • "The (Un)surprising Effectiveness of Pre-Trained Vision Models for Control", ICML, 2022 [Paper]
  • R3M: "R3M: A Universal Visual Representation for Robot Manipulation", CoRL, 2022 [Paper][Website][Code]
  • VIP: "VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training", ICLR, 2023 [Paper][Website][Code]
  • DINOv2: "DINOv2: Learning Robust Visual Features without Supervision", Trans. Mach. Learn. Res., 2023 [Paper][Code]
  • RPT: "Robot Learning with Sensorimotor Pre-training", CoRL, 2023 [Paper][Website]
  • I-JEPA: "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture", CVPR, 2023 [Paper]
  • Theia: "Theia: Distilling Diverse Vision Foundation Models for Robot Learning", CoRL, 2024 [Paper]

  • HRP: "HRP: Human Affordances for Robotic Pre-Training", RSS, 2024 [Paper][Website][Code]

  • HPT: "Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers", NeurIPS, 2024 [Paper][Website][Code]

Video Representations

  • F3RM: "Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation", CoRL, 2023 [Paper][Website][Code]
  • PhysGaussian: "PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics", CVPR, 2024 [Paper][Website][Code]
  • UniGS: "UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting", ICLR, 2025 [Paper][Code]
  • That Sounds Right: "That Sounds Right: Auditory Self-Supervision for Dynamic Robot Manipulation", CoRL, 2023 [Paper][Code]

Dynamics Learning

  • MaskDP: "Masked Autoencoding for Scalable and Generalizable Decision Making", NeurIPS, 2022 [Paper][Code]
  • PACT: "PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training", IROS, 2023 [Paper]
  • GR-1: "Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation", ICLR, 2024 [Paper]
  • SMART: "SMART: Self-supervised Multi-task pretrAining with contRol Transformers", ICLR, 2023 [Paper]
  • MIDAS: "Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning", ICML, 2024 [Paper][Website]
  • Vi-PRoM: "Exploring Visual Pre-training for Robot Manipulation: Datasets, Models and Methods", IROS, 2023 [Paper][Website]
  • VPT: "Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos", NeurIPS, 2022 [Paper]

World Models

  • "A Path Towards Autonomous Machine Intelligence", OpenReview, 2022 [Paper]
  • DreamerV1: "Dream to Control: Learning Behaviors by Latent Imagination", ICLR, 2020 [Paper]
  • DreamerV2: "Mastering Atari with Discrete World Models", ICLR, 2021 [Paper]
  • DreamerV3: "Mastering Diverse Domains through World Models", arXiv, Jan 2023 [Paper]
  • DayDreamer: "DayDreamer: World Models for Physical Robot Learning", CoRL, 2022 [Paper]
  • TWM: "Transformer-based World Models Are Happy With 100k Interactions", ICLR, 2023 [Paper]
  • IRIS: "Transformers are Sample-Efficient World Models", ICLR, 2023 [Paper][Code]

LLM-induced World Models

  • DECKARD: "Do Embodied Agents Dream of Pixelated Sheep: Embodied Decision Making using Language Guided World Modelling", ICML, 2023 [Paper][Website][Code]
  • LLM-MCTS: "Large Language Models as Commonsense Knowledge for Large-Scale Task Planning", NeurIPS, 2023 [Paper]
  • RAP: "Reasoning with Language Model is Planning with World Model", EMNLP, 2023 [Paper]
  • LLM+P: "LLM+P: Empowering Large Language Models with Optimal Planning Proficiency", arXiv, Apr 2023 [Paper][Code]
  • LLM-DM: "Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning", NeurIPS, 2023 [Paper][Website][Code]

Visual World Models

  • E2WM: "Language Models Meet World Models: Embodied Experiences Enhance Language Models", NeurIPS, 2023 [Paper][Code]
  • Genie: "Genie: Generative Interactive Environments", ICML, 2024 [Paper][Website]
  • 3D-VLA: "3D-VLA: A 3D Vision-Language-Action Generative World Model", ICML, 2024 [Paper][Code]
  • UniSim: "Learning Interactive Real-World Simulators", ICLR, 2024 [Paper][Code]

Reasoning

  • ThinkBot: "ThinkBot: Embodied Instruction Following with Thought Chain Reasoning", arXiv, Dec 2023 [Paper]
  • ReAct: "ReAct: Synergizing Reasoning and Acting in Language Models", ICLR, 2023 [Paper]
  • RAT: "RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation", arXiv, Mar 2024 [Paper]
  • Tree-Planner: "Tree-Planner: Efficient Close-loop Task Planning with Large Language Models", ICLR, 2024 [Paper]
  • ECoT: "Robotic Control via Embodied Chain-of-Thought Reasoning", arXiv, Jul 2024 [Paper]
  • CoT-VLA: "CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models", CVPR, 2025 [Paper][Website]

Policy Steering

  • V-GPS: "Steering Your Generalists: Improving Robotic Foundation Models via Value Guidance", CoRL, 2024 [Paper][Website][Code]
  • RoboMonkey: "RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models", arXiv, Oct 2024 [Paper][Website][Code]

Low-level Control Policies

Control Policy Architectures

Non-Transformer Control Policies

  • Transporter Networks: "Transporter Networks: Rearranging the Visual World for Robotic Manipulation", CoRL, 2020 [Paper]
  • CLIPort: "CLIPort: What and Where Pathways for Robotic Manipulation", CoRL, 2021 [Paper][Website][Code]
  • BC-Z: "BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning", CoRL, 2021 [Paper][Website][Code]
  • HULC: "What Matters in Language Conditioned Robotic Imitation Learning over Unstructured Data", arXiv, Apr 2022 [Paper][Website][Code]
  • HULC++: "Grounding Language with Visual Affordances over Unstructured Data", ICRA, 2023 [Paper][Website][Paper]
  • MCIL: "Language Conditioned Imitation Learning over Unstructured Data", Robotics: Science and Systems, 2021 [Paper][Website][Paper]
  • UniPi: "Learning Universal Policies via Text-Guided Video Generation", NeurIPS, 2023 [Paper][Website]

Transformer-based Control Policies

  • RoboFlamingo: "Vision-Language Foundation Models as Effective Robot Imitators", arXiv, Jan 2025 [Paper][Website][Code]
  • ACT: "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware", Robotics: Science and Systems, 2023 [Paper]
  • RoboCat: "RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation", arXiv, Mar 2021 [Paper]
  • Gato: "A Generalist Agent", Trans. Mach. Learn. Res., 2022 [Paper]
  • RT-Trajectory: "RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches", ICLR, 2023 [Paper]
  • Q-Transformer: "Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions", arXiv, Sep 2023 [Paper]
  • Interactive Language: "Interactive Language: Talking to Robots in Real Time", arXiv, Oct 2022 [Paper]
  • RT-1: "RT-1: Robotics Transformer for Real-World Control at Scale", RSS, 2023 [Paper][Website]
  • MT-ACT: "RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking", ICRA, 2024 [Paper][Code][Code]
  • Hiveformer: "Instruction-driven history-aware policies for robotic manipulations", CoRL, 2022 [Paper][Website][Code]

Control Policies for Multimodal Instructions

  • VIMA: "VIMA: General Robot Manipulation with Multimodal Prompts", arXiv, Oct 2022 [Paper]
  • MOO: "Open-World Object Manipulation using Pre-trained Vision-Language Models", CoRL, 2023 [Paper]

Control Policies with 3D Vision

  • VER: "Volumetric Environment Representation for Vision-Language Navigation", CVPR, 2024 [Paper][Code]
  • RVT: "RVT: Robotic View Transformer for 3D Object Manipulation", CoRL, 2023 [Paper]
  • RVT-2: "RVT-2: Learning Precise Manipulation from Few Demonstrations", arXiv, Jun 2024 [Paper]
  • RoboUniView: "RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulaiton", [Code]
  • PerAct: "Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation", CoRL, 2022 [Paper]
  • Act3D: "Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation", CoRL, 2023 [Paper][Website][Code]

Diffusion-based Control Policies

  • MDT: "Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals", Robotics: Science and Systems, 2024 [Paper][Website][Code]
  • RDT-1B: "RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation", arXiv, Oct 2024 [Paper][Website][Code]
  • Diffusion Policy: "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion", Robotics: Science and Systems, 2023 [Paper][Website][Code]
  • Octo: "Octo: An Open-Source Generalist Robot Policy", Robotics: Science and Systems, 2024 [Paper][Website][Code]
  • SUDD: "Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition", CoRL, 2023 [Paper][Code]
  • ScaleDP: "Scaling Diffusion Policy in Transformer to 1 Billion Parameters for Robotic Manipulation", ICRA, 2025 [Paper][Website][Code]

Diffusion-based Control Policies with 3D Vision

  • 3D Diffuser Actor: "3D Diffuser Actor: Policy Diffusion with 3D Scene Representations", arXiv, Feb 2024 [Paper][Code]
  • DP3: "3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations", Proceedings of Robotics: Science and Systems (RSS), 2024 [Paper][Website][Code]

Control Policies for Motion Planning

  • VoxPoser: "VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models", CoRL, 2023 [Paper][Website][Code]
  • Language costs: "Correcting Robot Plans with Natural Language Feedback", Robotics: Science and Systems, 2022 [Paper][Website]
  • RoboTAP: "RoboTAP: Tracking Arbitrary Points for Few-Shot Visual Imitation", ICRA, 2024 [Paper][Website]

Control Policies with Point-based Action

  • ReKep: "ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation", arXiv, Sep 2024 [Paper][Website][Code]
  • RoboPoint: "RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics", arXiv, Jun 2024 [Paper][Website][Code]
  • PIVOT: "PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs", ICML, 2024 [Paper][Website]

Large VLA

  • RT-2: "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control", CoRL, 2023 [Paper][Website]
  • RT-H: "RT-H: Action Hierarchies Using Language", Robotics: Science and Systems, 2024 [Paper][Website]
  • RT-X, OXE: "Open X-Embodiment: Robotic Learning Datasets and RT-X Models", arXiv, Oct 2023 [Paper][Website][Code]
  • RT-A: "RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation", ICRA, 2025 [Paper][Website]
  • OpenVLA: "OpenVLA: An Open-Source Vision-Language-Action Model", CoRL, 2024 [Paper][Website][Code]
  • OpenVLA-OFT: "OpenVLA-OFT: Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success", arXiv, 2025 [Paper][Website][Code]
  • TraceVLA: "TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies", ICLR, 2025 [Paper]
  • π0: "π0: A Vision-Language-Action Flow Model for General Robot Control", arXiv, Oct 2024 [Paper][Website]
  • π0.5: "π0.5: a Vision-Language-Action Model with Open-World Generalization", arXiv, Apr 2025 [Paper][Website]

  • RoboMamba: "RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation", NeurIPS, 2024 [Paper][Website]

  • SpatialVLA: "SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model", arXiv, 2025 [Paper][Website]

  • LAPA: "Latent Action Pretraining from Videos", ICLR, 2025 [Paper][Website][Code]

  • TinyVLA: "TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation", arXiv, 2024 [Paper][Website][Code]

  • CogACT: "CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation", arXiv, 2024 [Paper][Website][Code]

  • DexVLA: "DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control", CoRL, 2025 [Paper][Website][Code]

  • HybridVLA: "HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model", arXiv, 2025 [Paper][Website][Code]

  • WorldVLA: "WorldVLA: Towards Autoregressive Action World Model", arXiv, Jun 2025 [Paper][Code]

  • UniVLA: "Unified Vision-Language-Action Model", arXiv, Jun 2025 [Paper][Website][Code]

  • Instruct2Act: "Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model", arXiv, 2023 [Paper][Code]

  • VLA-Adapter: "VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model", arXiv, 2025 [Paper][Website][Code]
  • SmolVLA: "SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics", arXiv, 2025 [Paper]
  • UP-VLA: "UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent", arXiv, Jan 2025 [Paper][Code]
  • DreamVLA: "DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge", arXiv, Jul 2025 [Paper][Website][Code]
  • HiMoE-VLA: "HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies", arXiv, Jul 2025 [Paper][Code]
  • InternVLA-M1: "InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy", arXiv, Oct 2025 [Paper][Website][Code]

High-level Task Planners

Hierarchical Policy

Monolithic Task Planners

Grounded Task Planners

  • (SL)^3: "Skill Induction and Planning with Latent Language", ACL, 2022 [Paper]
  • Translated <LM>: "Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents", ICML, 2022 [Paper][Code]
  • SayCan: "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances", CoRL, 2022 [Paper][Website][Code]

End-to-end Task Planners

  • EmbodiedGPT: "EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought", NeurIPS, 2023 [Paper][Code]
  • PaLM-E: "PaLM-E: An Embodied Multimodal Language Model", ICML, 2023 [Paper][Website]

End-to-end Task Planners with 3D Vision

  • MultiPLY: "MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World", CVPR, 2024 [Paper]
  • 3D-LLM: "3D-LLM: Injecting the 3D World into Large Language Models", NeurIPS, 2023 [Paper][Website]
  • LEO: "An Embodied Generalist Agent in 3D World", ICML, 2024 [Paper][Website][Code]
  • ShapeLLM: "ShapeLLM: Universal 3D Object Understanding for Embodied Interaction", ECCV, 2024 [Paper][Website][Code]

Modular Task Planners

Modular Task Planners

Language-based Task Planners

  • ReAct: "ReAct: Synergizing Reasoning and Acting in Language Models", ICLR, 2023 [Paper][Website][Code]
  • Socratic Models: "Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language", ICLR, 2023 [Paper]
  • LID: "Pre-Trained Language Models for Interactive Decision-Making", NeurIPS, 2022 [Paper][Website][Code]
  • Inner Monologue: "Inner Monologue: Embodied Reasoning through Planning with Language Models", arXiv, Jul 2022 [Paper][Website]
  • LLM-Planner: "LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models", ICCV, 2023 [Paper][Website][Website]

Code-based Task Planners

  • ChatGPT for Robotics: "ChatGPT for Robotics: Design Principles and Model Abilities", IEEE Access, 2023 [Paper][Website][Code]
  • DEPS: "Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents", arXiv, Feb 2023 [Paper][Code]
  • ConceptGraphs: "ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning", ICRA, 2023 [Paper][Website][Code]
  • CaP: "Code as Policies: Language Model Programs for Embodied Control", ICRA, 2023 [Paper][Website][Code]
  • ProgPrompt: "ProgPrompt: Generating Situated Robot Task Plans using Large Language Models", ICRA, 2023 [Paper][Website][Code]
  • COME-robot: "Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V", arXiv, Apr 2024 [Paper][Website]

Related Surveys

  • "Foundation Models in Robotics: Applications, Challenges, and the Future", arXiv, Dec 2023 [Paper]
  • "Real-World Robot Applications of Foundation Models: A Review", arXiv, Feb 2024 [Paper]
  • "Large Language Models for Robotics: Opportunities, Challenges, and Perspectives", arXiv, Jan 2024 [Paper]
  • "Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis", arXiv, Dec 2023 [Paper]
  • "Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond", arXiv, May 2024 [Paper]

Latest Developments

  • LLaRA: "LLaRA: Supercharging Robot Learning Data for Vision-Language Policy", ICLR, 2025 [Paper][Code]

  • Mobility VLA: "Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs", CoRL, 2024 [Paper]

Humanoid Robot

  • GR00T N1: "GR00T N1: An Open Foundation Model for Generalist Humanoid Robots", arXiv, Mar 2025 [Paper][Code]

  • Humanoid-VLA: "Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration", arXiv, Feb 2025 [Paper]

Quadruped Robot

  • QUAR-VLA: "QUAR-VLA: Vision-Language-Action Model for Quadruped Robots", ECCV, 2024 [Paper]

  • QUART-Online: "QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning", ICRA, 2025 [Paper][Website][Code]

  • MoRE: "MoRE: Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models", ICRA, 2025 [Paper]

Dexterous Hand

  • DexGraspVLA: "DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping", arXiv, Feb 2025 [Paper][Website][Code]

Star History

Star History Chart

Citation

Thank you for your interest! If you find our work helpful, please consider citing us with:

@article{DBLP:journals/corr/abs-2405-14093,
  author       = {Yueen Ma and
                  Zixing Song and
                  Yuzheng Zhuang and
                  Jianye Hao and
                  Irwin King},
  title        = {A Survey on Vision-Language-Action Models for Embodied {AI}},
  journal      = {CoRR},
  volume       = {abs/2405.14093},
  year         = {2024}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •