Skip to content

mll-lab-nu/Awesome-Spatial-Intelligence-in-VLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 

Repository files navigation

Awesome Spatial Intelligence in VLMs

This carefully curated list brings together key methods, datasets, and benchmarks in the field of spatial intelligence for VLMs.

With the development of multimodal models, evaluating and enhancing their spatial intelligence has become a key research frontier. This list aims to provide researchers and engineers with a quick index to track the latest advancements in the field.

We welcome contributions of excellent resources you find via Pull Request!

Table of Contents

Methods

Visual-based methods

Title Introduction Date Code

Abstract 3D Perception for Spatial Intelligence in Vision-Language Models
image 2025-11 -

Cambrian-S: Towards Spatial Supersensing in Video
image 2025-11 Github

Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images
image 2025-11 Github

SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards
image 2025-11 Github

Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
image 2025-10 Github

Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
image 2025-10 Github

Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
image 2025-10 Github

Euclid’s Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks
image 2025-10 Github

SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models
image 2025-10 Github

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km
image 2025-10 Github
Publish
See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model
image 2025-09 -

3D Aware Region Prompted Vision Language Model
image 2025-09 Github

UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding
image 2025-08 Github

SIFThinker: Spatially-Aware Image Focus for Visual Reasoning
image 2025-08 Github

Enhancing Spatial Reasoning through Visual and Textual Thinking
image 2025-07 -
Publish
MindJourney: Test-Time Scaling with World Models for Spatial Reasoning
image 2025-07 Github
Publish
Struct2D: A Perception-Guided Framework for Spatial Reasoning in Large Multimodal Models
image 2025-06 Github
Publish
Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs
image 2025-06 Github
Publish
SpatialLM: Training Large Language Models for Structured Indoor Modeling
image 2025-06 Github

Spatial Understanding from Videos: Structured Prompts Meet Simulation Data
image 2025-06 Github

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
image 2025-06 Github

Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
image 2025-05 Github
Publish
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics
image 2025-05 Github
Publish
Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors
image 2025-05 Github
Publish
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
image 2025-05 Github
Publish
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
image 2025-05 Github

STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs
image 2025-05 Github

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
image 2025-05 Github

LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding
image 2025-05 -
Publish
SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning
image 2025-05 Github
Publish
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation
image 2025-04 Github

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
image 2025-04 Github

Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning
image 2025-04 Github
Publish
SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning
image 2025-04 Github
Publish
ROSS3D: Reconstructive Visual Instruction Tuning with 3D-Awareness
image 2025-04 Github
Publish
SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data
image 2025-04 -

Visual Agentic AI for Spatial Reasoning with a Dynamic API
image 2025-02 Github

SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning
image 2025-01 Github

COARSE CORRESPONDENCES Boost Spatial-Temporal Reasoning in Multimodal Language Model
image 2024-08 Github
Star Publish
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models
image 2024-06 Github
Star Publish
SpatialBot: Precise Spatial Understanding with Vision Language Models
image 2024-06 Github
Publish
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
image 2024-04 Github

SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors
image 2024-03 -
Publish
Can Transformers Capture Spatial Relations between Objects?
image 2024-03 Github
Star Publish
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
image 2024-01 Github

Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis
image 2024-01 Github

3DAxiesPrompts: Unleashing the 3D Spatial Task Capabilities of GPT-4V
image 2023-12 -

Text-based methods

Title Introduction Date Code

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
image 2025-01 -

Dspy-based Neural-Symbolic Pipeline to Enhance Spatial Reasoning in LLMs
image 2024-11 -
Publish
Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Composite Spatial Reasoning
image 2024-10 -
Publish
SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models
image 2024-06 Github
Publish
Beyond Lines and Circles: Unveiling the Geometric Reasoning Gap in Large Language Models
image 2024-02 -
Publish
Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark
image 2024-01 Github

Datasets & Benchmarks

Visual-based data

Title Introduction Date Code

Visual Spatial Tuning
image 2025-11 Github

DSI-Bench: A Benchmark for Dynamic Spatial Intelligence
image 2025-10 Github

Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes
image 2025-10 -

NavSpace: How Navigation Agents Follow Spatial Intelligence Instructions
image 2025-10 Github

SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs
image 2025-09 Github

Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes
image 2025-09 Github

Why Do MLLMs Struggle with Spatial Understanding? A Systematic Analysis from Data to Architecture
image 2025-09 Github

SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
image 2025-09 Github
Publish
VLM4D: Towards Spatiotemporal Awareness in Vision Language Models
image 2025-08 Github

11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis
image 2025-08 -
Publish
Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting
image 2025-07 Github

Ascending the Infinite Ladder: Benchmarking Spatial Deformation Reasoning in Vision-Language Models
image 2025-07 -

SpatialViz-Bench: An MLLM Benchmark for Spatial Visualization
image 2025-07 Github

Spatial Mental Modeling from Limited Views
image 2025-06 Github

SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks
image 2025-06 -
Publish
IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering
image 2025-06 Github
Publish
PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly
image 2025-06 Github

Can Vision Language Models Infer Human Gaze Direction? A Controlled Study
image 2025-06 Github

SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence
image 2025-06 Github

Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations
image 2025-06 Github

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
image 2025-06 Github

InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models
image 2025-06 -

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence
image 2025-05 Github

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
image 2025-05 Github

SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding
image 2025-05 Github

MIRAGE:A Multi-modal Benchmark for Spatial Perception, Reasoning, and Intelligence
image 2025-05 Github

Can Multimodal Large Language Models Understand Spatial Relations
image 2025-05 Github

Visuospatial Cognitive Assistant
image 2025-05 Github

Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?
image 2025-05 Github

Vision language models have difficulty recognizing virtual objects
image 2025-05 -

ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models
image 2025-05 Github

Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames
image 2025-05 -
Publish
SITE: towards Spatial Intelligence Thorough Evaluation
image 2025-05 Github

CameraBench: Towards Understanding Camera Motions in Any Video
image 2025-04 Github

Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs
image 2025-04 Github

From Flatland to Space:Teaching Vision-Language Models to Perceive and Reason in 3D
image 2025-03 Github

MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLM
image 2025-03 -

Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space
image 2025-03 Github
Publish
STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
image 2025-03 Github
Publish
CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models
image 2025-03 Github

Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models
image 2025-03 Github

LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
image 2025-03 Github
Publish
Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models
image 2025-02 Github

FoREST: Frame of Reference Evaluation in Spatial Reasoning Tasks
image 2025-02 -

iVISPAR — An Interactive Visual-Spatial Reasoning Benchmark for VLMs
image 2025-02 Github

Defining and Evaluating Visual Language Models' Basic Spatial Abilities: A Perspective from Psychometrics
image 2025-02 -
Publish
SAT: Spatial Aptitude Training for Multimodal Language Models
image 2024-12 Github
Publish
SPHERE: A Hierarchical Evaluation on Spatial Perception and Reasoning for Vision-Language Models
image 2024-12 Github
Publish
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
image 2024-12 Github
StarPublish
Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces
image 2024-12 Github
Publish
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
image 2024-11 -
Publish
An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models
image 2024-11 -

Is ‘Right’ Right? Enhancing Object Orientation Understanding in Multimodal Language Models through Egocentric Instruction Tuning
image 2024-10 Github
Publish
ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models
image 2024-10 Github
Publish
DOES SPATIAL COGNITION EMERGE IN FRONTIER MODELS?
image 2024-10 -
Publish
Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities
image 2024-10 Github

R2D3: Imparting Spatial Reasoning by Reconstructing 3D Scenes from 2D Images
image 2024-10 Github
Publish
Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models
image 2024-09 Github
Publish
Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?
image 2024-09 -

VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs
image 2024-07 Github
Publish
EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models
image 2024-06 Github
Publish
TopViewRS: Vision-Language Models as Top-View Spatial Reasoners
image 2024-06 Github
Publish
Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models
image 2024-06 Github
Publish
GSR-Bench: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs
image 2024-06 -
Publish
Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning
image 2024-05 Github
PublishStar
Things not Written in Text: Exploring Spatial Commonsense from Visual Signals
image 2022-03 Github
Publish
SPARE3D: A Dataset for SPAtial REasoning on Three-View Line Drawings
image 2020-03 Github

Text-based data

Title Introduction Date Code

Do Multimodal Language Models Really Understand Direction? A Benchmark for Compass Direction Reasoning
image 2024-12 -

GRASP: A Grid-Based Benchmark for Evaluating Commonsense Spatial Reasoning
image 2024-07 -

Findings

Title Introduction Date Code

Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks
image 2025-10 Github

How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective
image 2025-09 Github

Has GPT-5 Achieved Spatial Intelligence? An Empirical Study
image 2025-08 -

A Call for New Recipes to Enhance Spatial Reasoning in MLLMs
image 2025-03 Github

Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas
image 2025-03 Github

Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models
image 2025-03 Github

Applications

Title Introduction Date Code

Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model
image 2025-10 Github

SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models
image 2025-05 -

InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning
image 2025-05 Github

EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks
image 2025-03 -

SOFAR: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation
image 2025-02 Github

ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning
image 2025-03 Github

VL-Nav: Real-time Vision-Language Navigation with Spatial Reasoning
image 2025-02 -

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Models
image 2025-01 Github
Publish
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection
image 2024-12 Github

EMMA-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning
image 2024-12 Github
Publish
Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning
image 2024-04 Github
Publish
Improving Vision-and-Language Reasoning via Spatial Relations Modeling
image 2023-11 -

About

A paper list for spatial reasoning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 7