MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

NeurIPS 2025

Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, Chuang Gan

Introduction

MindJourney is a test-time scaling framework that leverages the 3D imagination capability of World Models to strengthen spatial reasoning in Vision-Language Models (VLMs). We evaluate on the SAT dataset and provide a baseline pipeline, a Stable Virtual Camera (SVC) based spatial beam search pipeline, and a Search World Model (SWM) based spatial beam search pipeline.

News

2025/10: Updated codebase and released Search World Model.
2025/09: MindJourney is accepted to NeurIPS 2025!
2025/07: Inference code for SAT with Stable Virtual Camera released.
2025/07: Paper is on arXiv: https://arxiv.org/abs/2507.12508

Repository Structure

pipelines/
- pipeline_baseline.py: baseline inference without a world model.
- pipeline_svc_scaling_spatial_beam_search.py: SVC-based spatial beam search.
- pipeline_wan_scaling_beam_search_double_rank.py: SWM-based spatial beam search
scripts/
- pipeline_baseline.sh: baseline example script.
- inference_pipeline_svc_scaling_parallel_sat_test.sh: alternative SVC inference driver.
- inference_pipeline_wan_scaling_parallel_sat-test.sh: example driver for WAN-based experiments.
utils/
- api.py: Azure OpenAI wrapper and configuration.
- args.py: unified CLI arguments (pipeline + SVC).
- vlm_wrapper.py, prompt_formatting.py: VLM wrapper and prompt construction.
- data_process.py: SAT dataset preprocessing (download and JSON organization).
stable_virtual_camera/: SVC module (editable install; see pyproject.toml).
assets/: logo and figures.
wan2.2/: WAN-related experimental code.

Environment and Dependencies

We recommend two Conda environments to isolate the main runtime and SVC.

Main runtime (VLM + framework):

conda create -n mindjourney python=3.11 -y
conda activate mindjourney

# CUDA 12.6 builds of PyTorch (adjust if needed)
pip install torch==2.6.0+cu126 torchvision==0.21.0+cu126 torchaudio==2.6.0+cu126 \
  --extra-index-url https://download.pytorch.org/whl/cu126

# General dependencies
pip install -r requirements.txt

Stable Virtual Camera (separate env to avoid conflicts):

conda create -n mindjourney_svc python=3.10 -y
conda activate mindjourney_svc

# Editable install of the SVC module (dependencies defined in pyproject.toml)
pip install -e stable_virtual_camera/

# Optionally reuse shared utilities if needed
pip install -r requirements_svc.txt

Hardware suggestions:

NVIDIA GPU (80 GB VRAM or more recommended), CUDA 12.6 drivers
Sufficient disk space for intermediate videos and results

Configure Azure OpenAI (for GPT-family VLMs)

Set your Azure endpoint in utils/api.py:

File: MindJourney-dev-new/utils/api.py
Field: AzureConfig.azure_endpoint = "YOUR_API_ENDPOINT"

Export the API key:

export AZURE_OPENAI_API_KEY=YOUR_API_KEY

Supported models: gpt-4o, gpt-4.1, o4-mini, o1. You can also choose OpenGVLab/InternVL3-8B or OpenGVLab/InternVL3-14B (ensure adequate VRAM and dependencies).

Checkpoints Preparation

Search World Model (SWM)

Download Search World Model (SWM) from checkpoint

Update checkpoint path in the bash scripts in the .scripts/ folder.

Search World Model (SVC)

Request access to SVC weights and login:

# https://huggingface.co/stabilityai/stable-virtual-camera
huggingface-cli login

Data Preparation (SAT)

Prepare SAT from Hugging Face using the helper script:

python utils/data_process.py --split val
python utils/data_process.py --split test

Outputs under ./data/:

val.json / test.json: questions with choices, correct answers, and image paths
Per-type splits: val_<type>.json / test_<type>.json
Images: ./data/<split>/image_*.png

Per-question JSON fields: database_idx, question_type, question, answer_choices, correct_answer, img_paths.

Quickstart

Before running:

Export AZURE_OPENAI_API_KEY and set azure_endpoint in utils/api.py if using GPT models
Add repo root to PYTHONPATH: export PYTHONPATH=$PYTHONPATH:./
Set WORLD_MODEL_TYPE=svc for Stable Virtual Camera

Use scripts to run

Baseline:

bash scripts/pipeline_baseline.sh

SWM:

bash scripts/inference_pipeline_wan_scaling_parallel_sat-test.sh 0

SVC:

bash scripts/inference_pipeline_svc_scaling_parallel_sat-test.sh 0

Key arguments (see utils/args.py for full list):

--vlm_model_name / --vlm_qa_model_name: scoring and answering VLMs (Azure OpenAI or InternVL3)
--num_questions, --split: number of questions and split (val/test)
--max_steps_per_question: max iterations per question (beam search)
--num_beams, --num_top_candidates: beam width and candidate count
--helpful_score_threshold, --exploration_score_threshold: filtering thresholds
--max_images: max images per question (typically 1–2)

You may set num_question_chunks into $\ge 2$ to split the questions into chunks and run different question chunks in parallel.

Results and Logs

Outputs under --output_dir:

results.json: overall accuracy, per-type accuracy, skipped indices, parsing stats
/<qid>/: starting image(s) and gpt.json, timing.jsonlogs per question

Citation

If you find this repository helpful, please cite:

@misc{yang2025mindjourneytesttimescalingworld,
      title={MindJourney: Test-Time Scaling with World Models for Spatial Reasoning}, 
      author={Yuncong Yang and Jiageng Liu and Zheyuan Zhang and Siyuan Zhou and Reuben Tan and Jianwei Yang and Yilun Du and Chuang Gan},
      year={2025},
      eprint={2507.12508},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.12508}, 
}

Notes

This repository is under active development; interfaces and scripts may change.
Configure a valid Azure endpoint and API key if using GPT-family models; you are responsible for any API costs.
SVC weights require approval on Hugging Face and appreciable VRAM.
For issues with environments or arguments, see utils/args.py and code comments in pipelines/.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

Introduction

News

Repository Structure

Environment and Dependencies

Configure Azure OpenAI (for GPT-family VLMs)

Checkpoints Preparation

Search World Model (SWM)

Search World Model (SVC)

Data Preparation (SAT)

Quickstart

Use scripts to run

Results and Logs

Citation

Notes

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
assets		assets
data		data
pipelines		pipelines
scripts		scripts
stable_virtual_camera		stable_virtual_camera
utils		utils
wan2.2		wan2.2
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
requirements_svc.txt		requirements_svc.txt

UMass-Embodied-AGI/MindJourney

Folders and files

Latest commit

History

Repository files navigation

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

Introduction

News

Repository Structure

Environment and Dependencies

Configure Azure OpenAI (for GPT-family VLMs)

Checkpoints Preparation

Search World Model (SWM)

Search World Model (SVC)

Data Preparation (SAT)

Quickstart

Use scripts to run

Results and Logs

Citation

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages