Baiqiao Yin1, 3*, Qineng Wang1*β‘, Pingyue Zhang1, Jianshu Zhang1, Kangrui Wang1, Zihan Wang1, Jieyu Zhang4, Keshigeyan Chandrasegaran2, Han Liu1, Ranjay Krishna4, Saining Xie3, Manling Liβ 1, Jiajun Wu2β , Li Fei-Fei2β
*Equal contribution, β‘Project Lead, β Equal advising
1Northwestern University, 2Stanford University, 3New York University, 4University of Washington
- [2025-06-26] Our paper is available on arXiv, check it out here.
- [2025-06-24] Our website is online, check it out here.
- [2025-06-23] We open-source the MindCube framework and dataset.
MindCube is a modular framework for generating and evaluating spatial reasoning datasets for multimodal AI models. The project follows a complete pipeline from raw data to model evaluation, with specialized modules for scaffold data curation, prompt generation, model inference, training, and comprehensive evaluation.
Follow these steps to set up your development environment. This process will create an isolated Python environment with all necessary dependencies for running MindCube.
git clone [email protected]:mll-lab-nu/MindCube.git
cd MindCubeFirst, we'll create a dedicated conda environment to avoid conflicts with other projects:
conda create -n mindcube python=3.10 -y
conda activate mindcubeNext, install PyTorch with CUDA support. Make sure to adjust the CUDA version according to your system:
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124 # change to your cuda versionFinally, install the attention mechanism and other required dependencies:
pip install flash-attn==2.7.4.post1 --no-build-isolation
pip install -r requirements.txtOnce your environment is ready, download the MindCube dataset which contains the spatial reasoning questions and images:
bash scripts/bash_scripts/download_data.bashThe data generation process transforms raw spatial reasoning data into structured formats suitable for model training and evaluation.
For convenience, use this single command to generate all required data formats:
bash scripts/bash_scripts/generate_eval_data.bashIf you prefer to understand each step or need fine-grained control, follow these detailed steps:
Step 1: Scaffold Data Generation
This step processes raw JSONL files and generates cognitive maps and reasoning chains that serve as scaffolds for spatial understanding:
python scripts/data_processing.py \
--input data/raw/MindCube_train.jsonl \
--task full_pipeline
python scripts/data_processing.py \
--input data/raw/MindCube_tinybench.jsonl \
--task full_pipelineStep 2: General Prompts Generation
Now we create various prompt formats (8 different task types) that will be used for model training and evaluation:
python scripts/generate_prompts.py \
--input data/scaffold/all/MindCube_train.jsonl \
--all_tasks
python scripts/generate_prompts.py \
--input data/scaffold/all/MindCube_tinybench.jsonl \
--all_tasksStep 3: Model Format Data Transformation
Finally, convert the general prompts into model-specific formats. Currently, we support Qwen2.5VL format:
python scripts/convert_to_sft.py \
--input_dir data/prompts/general/ \
--model qwen2.5vl # Currently, we only support Qwen2.5VL FormatAfter completing these steps, you should see the following directory structure:
data/scaffold/all: 2 files
data/prompts/general: 16 files
data/prompts/training/qwen2.5vl: 16 files
With your data prepared, you can now run inference using pre-trained vision-language models without any fine-tuning.
Run inference on all task configurations simultaneously for comprehensive evaluation:
bash scripts/bash_scripts/run_frozen_vlm_all_tasks_qwen.sh --max-tasks-per-gpu 2 # You can adjust this number based on your GPU's capacity (Default is 1)For more control or when working with limited GPU memory, run inference on specific tasks:
python scripts/run_inference.py \
--model-type qwen2.5vl \
--input-file data/prompts/general/MindCube_tinybench_raw_qa.jsonl \
--output-dir data/results/frozen_vlm
# you can adjust input-file and output-dir hereThe inference results will be saved in structured directories for easy analysis:
data/results/frozen_vlm/: n jsonl files (based on your command)
logs/inference: All frozen inference logs
After obtaining model predictions, evaluate the performance using our comprehensive evaluation metrics.
Evaluate all inference results at once for a complete performance overview:
bash scripts/bash_scripts/run_batch_evaluation.sh data/results/frozen_vlm/ # You can adjust the path to the jsonl files you would like to evaluateFor detailed analysis of specific models or tasks, run evaluation individually:
python scripts/run_evaluation.py \
-i data/results/frozen_vlm/MindCube_tinybench_raw_qa_qwen2.5-vl-3b-instruct_responses.jsonl
-o data/evaluate/frozen_vlm/MindCube_tinybench_raw_qa_qwen2.5-vl-3b-instruct_responses_eval_results.json
# you can adjust the input and output hereEvaluation results will be organized for easy interpretation and comparison:
data/evaluate/frozen_vlm: n json files (based on your command)
This section guides you through supervised fine-tuning (SFT) to adapt pre-trained models specifically for spatial reasoning tasks.
Install ffmpeg if you have not installed it yet:
conda install -c conda-forge ffmpeg -y # skip this if you already installed ffmpeg in your deviceWe need the specialized Qwen2.5-VL repository that contains our custom modifications for MindCube training:
git clone [email protected]:QinengWang-Aiden/Qwen2.5-VL-MindCube.gitThis step integrates MindCube datasets into the Qwen training pipeline.
First, let's check if the MindCube datasets are properly registered in the training system:
python experiments/sft/patch_qwen_data.py verifyExpected Output
Project root: /path/to/your/MindCube
Target file: /path/to/your/MindCube/Qwen2.5-VL-MindCube/qwen-vl-finetune/qwenvl/data/__init__.py
Command: verify
Found 0/6 MindCube datasets in data_dict:
Missing datasets:
β raw_qa
β plain_cgmap_out
β ff_rsn
β aug_cgmap_out
β aug_cgmap_ffr_out
β plain_cgmap_ffr_outNow apply the patches to enable MindCube dataset support in the training pipeline:
python experiments/sft/patch_qwen_data.py patchExpected Output
β
Successfully patched Qwen __init__.py with MindCube datasets
Found 6/6 MindCube datasets in data_dict:
β
raw_qa
β
aug_cgmap_out
β
plain_cgmap_out
β
ff_rsn
β
aug_cgmap_ffr_out
β
plain_cgmap_ffr_outBefore starting training, you may want to adjust the configuration based on your hardware and training preferences.
Configure your GPU settings according to your available hardware:
# Hardware configuration
GPU_DEVICES="0" # Modify based on available GPUs (e.g., "0,1,2,3" for 4 GPUs)
NUM_PROCESSES=1 # Should match number of GPUs
BATCH_SIZE=1 # Per-device batch size (adjust based on GPU memory)Adjust training hyperparameters for optimal performance on your specific task:
# Training hyperparameters
LEARNING_RATE=1e-5
NUM_EPOCHS=3
# Output configuration
OUTPUT_BASE_DIR="experiments/sft/results"
RUN_NAME="qwen2vl-${TASK_NAME}_sft"
# Additional training arguments
MAX_PIXELS=90000
MIN_PIXELS=784
MODEL_MAX_LENGTH=8192
SAVE_STEPS=5
SAVE_TOTAL_LIMIT=12Now we're ready to start the actual training process. The model will learn to better understand spatial relationships through supervised fine-tuning.
Train on all task types sequentially for comprehensive spatial reasoning capabilities:
bash scripts/bash_scripts/run_sft_all_tasks_qwen.shFor focused training or resource constraints, train on specific tasks:
bash experiments/sft/train_qwen_sft.sh config_raw_qa.sh # or you can replace with any legal task config hereTraining artifacts will be organized for easy access and model deployment:
checkpoints/sft/: List of all tasks saved checkpoints
logs/sft_training/: All training logs
After training, test your fine-tuned models on the evaluation datasets to measure improvement.
Run inference using all trained checkpoints for comprehensive evaluation:
bash scripts/bash_scripts/run_sft_ckpt_inference_qwen.sh --max-tasks-per-gpu 2 # You can adjust this number based on your GPU's capacity (Default is 1)Test specific checkpoints for detailed analysis:
python scripts/run_inference.py \
--model-type qwen2.5vl \
--model-path checkpoints/sft/raw_qa/checkpoint-5
--input-file data/prompts/general/MindCube_tinybench_raw_qa.jsonl \
--output-dir data/results/sft/raw_qa
# you can adjust input-file and output-dir hereFine-tuned model results will be organized by task for easy comparison with baseline models:
data/results/sft/: a list of task name directories
data/results/sft/<task_name>: a list of jsonl files of inference results
Finally, evaluate your fine-tuned models to quantify the improvement in spatial reasoning capabilities.
Evaluate all fine-tuned model results comprehensively:
bash scripts/bash_scripts/run_batch_evaluation.sh data/results/sft/ # You can adjust the path to the jsonl files you would like to evaluateFor detailed analysis of specific fine-tuned models:
python scripts/run_evaluation.py \
-i data/results/sft/raw_qa/MindCube_tinybench_raw_qa_checkpoint-5_responses.jsonl
-o data/evaluate/sft/raw_qa/MindCube_tinybench_raw_qa_checkpoint-5_responses_eval_results.json
# you can adjust the input and output hereEvaluation results will show the effectiveness of your fine-tuning approach:
data/evaluate/sft: a list of task name directories
data/results/sft/<task_name>: a list of json files as evaluation results
Raw Data β Scaffold Data β Model Prompts β SFT Training β Model Inference & Evaluation
β β β β β
Step 1 Step 2 Step 3 Step 4 Step 5
Input Cogmap + 8 Task Multi-Model Performance
Processing Reasoning Variants Training Metrics
- Step 1: Raw Data Processing - Original question-answer pairs with spatial annotations
- Step 2: Scaffold Data Generation - Cognitive maps and reasoning chains
- Step 3: Model Prompt Generation - 8 task variants for comprehensive training
- Step 4: SFT Training Data Generation - Multi-model format support (Qwen2.5-VL, LLaVA, InstructBLIP)
- Step 5: Model Operations & Evaluation - Inference and comprehensive evaluation metrics
Get help for any script:
python scripts/data_processing.py --help
python scripts/generate_prompts.py --help
python scripts/run_inference.py --help
python scripts/run_evaluation.py --help- Add RL Training Description
- Release RL Training Checkpoints
You could find a raw version of our RL training data at here: https://huggingface.co/datasets/yinbq/MindCube_RL/tree/main
Explore other exciting projects from our MLL-Lab:
This project is licensed under the MIT License.