MindCube: Spatial Mental Modeling from Limited Views

Baiqiao Yin^{1, 3*}, Qineng Wang^1*‡, Pingyue Zhang¹, Jianshu Zhang¹, Kangrui Wang¹, Zihan Wang¹, Jieyu Zhang⁴, Keshigeyan Chandrasegaran², Han Liu¹, Ranjay Krishna⁴, Saining Xie³, Manling Li†¹, Jiajun Wu^2†, Li Fei-Fei^2†

*Equal contribution, ‡Project Lead, †Equal advising

¹Northwestern University, ²Stanford University, ³New York University, ⁴University of Washington

📢 Updates

[2025-06-26] Our paper is available on arXiv, check it out here.
[2025-06-24] Our website is online, check it out here.
[2025-06-23] We open-source the MindCube framework and dataset.

🌟 Overview

MindCube is a modular framework for generating and evaluating spatial reasoning datasets for multimodal AI models. The project follows a complete pipeline from raw data to model evaluation, with specialized modules for scaffold data curation, prompt generation, model inference, training, and comprehensive evaluation.

⚙️ Environment Setup

Follow these steps to set up your development environment. This process will create an isolated Python environment with all necessary dependencies for running MindCube.

git clone [email protected]:mll-lab-nu/MindCube.git
cd MindCube

First, we'll create a dedicated conda environment to avoid conflicts with other projects:

conda create -n mindcube python=3.10 -y
conda activate mindcube

Next, install PyTorch with CUDA support. Make sure to adjust the CUDA version according to your system:

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124 # change to your cuda version

Finally, install the attention mechanism and other required dependencies:

pip install flash-attn==2.7.4.post1 --no-build-isolation
pip install -r requirements.txt

📥 Download `MindCube` Dataset

Once your environment is ready, download the MindCube dataset which contains the spatial reasoning questions and images:

bash scripts/bash_scripts/download_data.bash

🚀 Quick Start

📋 Eval Data Generation

The data generation process transforms raw spatial reasoning data into structured formats suitable for model training and evaluation.

Approach 1: One Command Line Generation for All Data

For convenience, use this single command to generate all required data formats:

bash scripts/bash_scripts/generate_eval_data.bash

Approach 2: Detailed Steps

If you prefer to understand each step or need fine-grained control, follow these detailed steps:

Step 1: Scaffold Data Generation

This step processes raw JSONL files and generates cognitive maps and reasoning chains that serve as scaffolds for spatial understanding:

python scripts/data_processing.py \
  --input data/raw/MindCube_train.jsonl \
  --task full_pipeline
python scripts/data_processing.py \
  --input data/raw/MindCube_tinybench.jsonl \
  --task full_pipeline

Step 2: General Prompts Generation

Now we create various prompt formats (8 different task types) that will be used for model training and evaluation:

python scripts/generate_prompts.py \
  --input data/scaffold/all/MindCube_train.jsonl \
  --all_tasks
python scripts/generate_prompts.py \
  --input data/scaffold/all/MindCube_tinybench.jsonl \
  --all_tasks

Step 3: Model Format Data Transformation

Finally, convert the general prompts into model-specific formats. Currently, we support Qwen2.5VL format:

python scripts/convert_to_sft.py \
  --input_dir data/prompts/general/ \
  --model qwen2.5vl # Currently, we only support Qwen2.5VL Format

Expected Output Directory

After completing these steps, you should see the following directory structure:

data/scaffold/all: 2 files

data/prompts/general: 16 files

data/prompts/training/qwen2.5vl: 16 files

🧊 Frozen VLM Inference

With your data prepared, you can now run inference using pre-trained vision-language models without any fine-tuning.

Approach 1: Batch Inference (All 6 Task Configurations)

Run inference on all task configurations simultaneously for comprehensive evaluation:

bash scripts/bash_scripts/run_frozen_vlm_all_tasks_qwen.sh --max-tasks-per-gpu 2 # You can adjust this number based on your GPU's capacity (Default is 1)

Approach 2: Run Inference Individually

For more control or when working with limited GPU memory, run inference on specific tasks:

python scripts/run_inference.py \
  --model-type qwen2.5vl \
  --input-file data/prompts/general/MindCube_tinybench_raw_qa.jsonl \
  --output-dir data/results/frozen_vlm
  # you can adjust input-file and output-dir here

Expected Output Directory

The inference results will be saved in structured directories for easy analysis:

data/results/frozen_vlm/: n jsonl files (based on your command)

logs/inference: All frozen inference logs

📊 Evaluation

After obtaining model predictions, evaluate the performance using our comprehensive evaluation metrics.

Approach 1: Batch Evaluation

Evaluate all inference results at once for a complete performance overview:

bash scripts/bash_scripts/run_batch_evaluation.sh data/results/frozen_vlm/ # You can adjust the path to the jsonl files you would like to evaluate

Approach 2: Run Evaluation Individually

For detailed analysis of specific models or tasks, run evaluation individually:

python scripts/run_evaluation.py \
  -i data/results/frozen_vlm/MindCube_tinybench_raw_qa_qwen2.5-vl-3b-instruct_responses.jsonl
  -o data/evaluate/frozen_vlm/MindCube_tinybench_raw_qa_qwen2.5-vl-3b-instruct_responses_eval_results.json
  # you can adjust the input and output here

Expected Output Directory

Evaluation results will be organized for easy interpretation and comparison:

data/evaluate/frozen_vlm: n json files (based on your command)

🏋️ SFT Training (from `Qwen2.5VL-3B-Instruct`)

This section guides you through supervised fine-tuning (SFT) to adapt pre-trained models specifically for spatial reasoning tasks.

(Optional) Step 0: Environment Setup

Install ffmpeg if you have not installed it yet:

conda install -c conda-forge ffmpeg -y # skip this if you already installed ffmpeg in your device

Step 1: Clone Qwen Repo

We need the specialized Qwen2.5-VL repository that contains our custom modifications for MindCube training:

git clone [email protected]:QinengWang-Aiden/Qwen2.5-VL-MindCube.git

Step 2: Add Training Patches into `Qwen2.5-VL-MindCube/qwen-vl-finetune/qwenvl/data/init.py`

This step integrates MindCube datasets into the Qwen training pipeline.

2.1 Verify the Patching status

First, let's check if the MindCube datasets are properly registered in the training system:

python experiments/sft/patch_qwen_data.py verify

Expected Output

Project root: /path/to/your/MindCube
Target file: /path/to/your/MindCube/Qwen2.5-VL-MindCube/qwen-vl-finetune/qwenvl/data/__init__.py
Command: verify

Found 0/6 MindCube datasets in data_dict:
Missing datasets:
  ❌ raw_qa
  ❌ plain_cgmap_out
  ❌ ff_rsn
  ❌ aug_cgmap_out
  ❌ aug_cgmap_ffr_out
  ❌ plain_cgmap_ffr_out

2.2 Patch the `init.py`

Now apply the patches to enable MindCube dataset support in the training pipeline:

python experiments/sft/patch_qwen_data.py patch

Expected Output

✅ Successfully patched Qwen __init__.py with MindCube datasets
Found 6/6 MindCube datasets in data_dict:
  ✅ raw_qa
  ✅ aug_cgmap_out
  ✅ plain_cgmap_out
  ✅ ff_rsn
  ✅ aug_cgmap_ffr_out
  ✅ plain_cgmap_ffr_out

(Optional) Step 3: Customize Your Configuration File

Before starting training, you may want to adjust the configuration based on your hardware and training preferences.

GPU Env Setup: `experiments/sft/config_hardware.sh`

Configure your GPU settings according to your available hardware:

# Hardware configuration
GPU_DEVICES="0"          # Modify based on available GPUs (e.g., "0,1,2,3" for 4 GPUs)
NUM_PROCESSES=1          # Should match number of GPUs
BATCH_SIZE=1             # Per-device batch size (adjust based on GPU memory)

Customize Your Task Hyperparameters (`config_raw_qa.sh` as Example)

Adjust training hyperparameters for optimal performance on your specific task:

# Training hyperparameters
LEARNING_RATE=1e-5
NUM_EPOCHS=3

# Output configuration
OUTPUT_BASE_DIR="experiments/sft/results"
RUN_NAME="qwen2vl-${TASK_NAME}_sft"

# Additional training arguments
MAX_PIXELS=90000
MIN_PIXELS=784
MODEL_MAX_LENGTH=8192
SAVE_STEPS=5
SAVE_TOTAL_LIMIT=12

Step 4: SFT Training

Now we're ready to start the actual training process. The model will learn to better understand spatial relationships through supervised fine-tuning.

Approach 1: Batch Training All Task Configurations (Run in a Sequence)

Train on all task types sequentially for comprehensive spatial reasoning capabilities:

bash scripts/bash_scripts/run_sft_all_tasks_qwen.sh

Approach 2: Run SFT Individually

For focused training or resource constraints, train on specific tasks:

bash experiments/sft/train_qwen_sft.sh config_raw_qa.sh # or you can replace with any legal task config here

Expected Output Directories

Training artifacts will be organized for easy access and model deployment:

checkpoints/sft/: List of all tasks saved checkpoints

logs/sft_training/: All training logs

Step 5: SFT Checkpoints Inference

After training, test your fine-tuned models on the evaluation datasets to measure improvement.

Approach 1: Batch Inference

Run inference using all trained checkpoints for comprehensive evaluation:

bash scripts/bash_scripts/run_sft_ckpt_inference_qwen.sh --max-tasks-per-gpu 2 # You can adjust this number based on your GPU's capacity (Default is 1)

Approach 2: Run Inference Individually

Test specific checkpoints for detailed analysis:

python scripts/run_inference.py \
  --model-type qwen2.5vl \
  --model-path checkpoints/sft/raw_qa/checkpoint-5
  --input-file data/prompts/general/MindCube_tinybench_raw_qa.jsonl \
  --output-dir data/results/sft/raw_qa
  # you can adjust input-file and output-dir here

Expected Output

Fine-tuned model results will be organized by task for easy comparison with baseline models:

data/results/sft/: a list of task name directories

data/results/sft/<task_name>: a list of jsonl files of inference results

Step 6: Evaluation

Finally, evaluate your fine-tuned models to quantify the improvement in spatial reasoning capabilities.

Approach 1: Batch Evaluation

Evaluate all fine-tuned model results comprehensively:

bash scripts/bash_scripts/run_batch_evaluation.sh data/results/sft/ # You can adjust the path to the jsonl files you would like to evaluate

Approach 2: Run Evaluation Individually

For detailed analysis of specific fine-tuned models:

python scripts/run_evaluation.py \
  -i data/results/sft/raw_qa/MindCube_tinybench_raw_qa_checkpoint-5_responses.jsonl
  -o data/evaluate/sft/raw_qa/MindCube_tinybench_raw_qa_checkpoint-5_responses_eval_results.json
  # you can adjust the input and output here

Expected Output Directory

Evaluation results will show the effectiveness of your fine-tuning approach:

data/evaluate/sft: a list of task name directories

data/results/sft/<task_name>: a list of json files as evaluation results

🔄 Processing Pipeline Overview

Raw Data → Scaffold Data → Model Prompts → SFT Training → Model Inference & Evaluation
    ↓           ↓              ↓             ↓                    ↓
  Step 1      Step 2        Step 3        Step 4              Step 5
 Input       Cogmap +      8 Task        Multi-Model         Performance
Processing   Reasoning     Variants      Training            Metrics

Pipeline Steps

Step 1: Raw Data Processing - Original question-answer pairs with spatial annotations
Step 2: Scaffold Data Generation - Cognitive maps and reasoning chains
Step 3: Model Prompt Generation - 8 task variants for comprehensive training
Step 4: SFT Training Data Generation - Multi-model format support (Qwen2.5-VL, LLaVA, InstructBLIP)
Step 5: Model Operations & Evaluation - Inference and comprehensive evaluation metrics

🛠️ Command Help

Get help for any script:

python scripts/data_processing.py --help
python scripts/generate_prompts.py --help
python scripts/run_inference.py --help
python scripts/run_evaluation.py --help

🗒️ Checklist

Add RL Training Description
Release RL Training Checkpoints

RL Data

You could find a raw version of our RL training data at here: https://huggingface.co/datasets/yinbq/MindCube_RL/tree/main

🔗 Other MLL-Lab Projects

Explore other exciting projects from our MLL-Lab:

EAI
RAGEN
VAGEN

📝 License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
configs		configs
experiments/sft		experiments/sft
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

mll-lab-nu/MindCube

Folders and files

Latest commit

History

Repository files navigation

MindCube: Spatial Mental Modeling from Limited Views

📢 Updates

🌟 Overview

⚙️ Environment Setup

📥 Download MindCube Dataset

🚀 Quick Start

📋 Eval Data Generation

Approach 1: One Command Line Generation for All Data

Approach 2: Detailed Steps

Expected Output Directory

🧊 Frozen VLM Inference

Approach 1: Batch Inference (All 6 Task Configurations)

Approach 2: Run Inference Individually

Expected Output Directory

📊 Evaluation

Approach 1: Batch Evaluation

Approach 2: Run Evaluation Individually

Expected Output Directory

🏋️ SFT Training (from Qwen2.5VL-3B-Instruct)

(Optional) Step 0: Environment Setup

Step 1: Clone Qwen Repo

Step 2: Add Training Patches into Qwen2.5-VL-MindCube/qwen-vl-finetune/qwenvl/data/__init__.py

2.1 Verify the Patching status

2.2 Patch the __init__.py

(Optional) Step 3: Customize Your Configuration File

GPU Env Setup: experiments/sft/config_hardware.sh

Customize Your Task Hyperparameters (config_raw_qa.sh as Example)

Step 4: SFT Training

Approach 1: Batch Training All Task Configurations (Run in a Sequence)

Approach 2: Run SFT Individually

Expected Output Directories

Step 5: SFT Checkpoints Inference

Approach 1: Batch Inference

Approach 2: Run Inference Individually

Expected Output

Step 6: Evaluation

Approach 1: Batch Evaluation

Approach 2: Run Evaluation Individually

Expected Output Directory

🔄 Processing Pipeline Overview

Pipeline Steps

🛠️ Command Help

🗒️ Checklist

RL Data

🔗 Other MLL-Lab Projects

📝 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

📥 Download `MindCube` Dataset

🏋️ SFT Training (from `Qwen2.5VL-3B-Instruct`)

Step 2: Add Training Patches into `Qwen2.5-VL-MindCube/qwen-vl-finetune/qwenvl/data/init.py`

2.2 Patch the `init.py`

GPU Env Setup: `experiments/sft/config_hardware.sh`

Customize Your Task Hyperparameters (`config_raw_qa.sh` as Example)

Packages