NeurIPS 2025
Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, Chuang Gan
MindJourney is a test-time scaling framework that leverages the 3D imagination capability of World Models to strengthen spatial reasoning in Vision-Language Models (VLMs). We evaluate on the SAT dataset and provide a baseline pipeline, a Stable Virtual Camera (SVC) based spatial beam search pipeline, and a Search World Model (SWM) based spatial beam search pipeline.
- 2025/10: Updated codebase and released Search World Model.
- 2025/09: MindJourney is accepted to NeurIPS 2025!
- 2025/07: Inference code for SAT with Stable Virtual Camera released.
- 2025/07: Paper is on arXiv: https://arxiv.org/abs/2507.12508
pipelines/pipeline_baseline.py: baseline inference without a world model.pipeline_svc_scaling_spatial_beam_search.py: SVC-based spatial beam search.pipeline_wan_scaling_beam_search_double_rank.py: SWM-based spatial beam search
scripts/pipeline_baseline.sh: baseline example script.inference_pipeline_svc_scaling_parallel_sat_test.sh: alternative SVC inference driver.inference_pipeline_wan_scaling_parallel_sat-test.sh: example driver for WAN-based experiments.
utils/api.py: Azure OpenAI wrapper and configuration.args.py: unified CLI arguments (pipeline + SVC).vlm_wrapper.py,prompt_formatting.py: VLM wrapper and prompt construction.data_process.py: SAT dataset preprocessing (download and JSON organization).
stable_virtual_camera/: SVC module (editable install; seepyproject.toml).assets/: logo and figures.wan2.2/: WAN-related experimental code.
We recommend two Conda environments to isolate the main runtime and SVC.
- Main runtime (VLM + framework):
conda create -n mindjourney python=3.11 -y
conda activate mindjourney
# CUDA 12.6 builds of PyTorch (adjust if needed)
pip install torch==2.6.0+cu126 torchvision==0.21.0+cu126 torchaudio==2.6.0+cu126 \
--extra-index-url https://download.pytorch.org/whl/cu126
# General dependencies
pip install -r requirements.txt- Stable Virtual Camera (separate env to avoid conflicts):
conda create -n mindjourney_svc python=3.10 -y
conda activate mindjourney_svc
# Editable install of the SVC module (dependencies defined in pyproject.toml)
pip install -e stable_virtual_camera/
# Optionally reuse shared utilities if needed
pip install -r requirements_svc.txtHardware suggestions:
- NVIDIA GPU (80 GB VRAM or more recommended), CUDA 12.6 drivers
- Sufficient disk space for intermediate videos and results
- Set your Azure endpoint in
utils/api.py:
- File:
MindJourney-dev-new/utils/api.py - Field:
AzureConfig.azure_endpoint = "YOUR_API_ENDPOINT"
- Export the API key:
export AZURE_OPENAI_API_KEY=YOUR_API_KEYSupported models: gpt-4o, gpt-4.1, o4-mini, o1. You can also choose OpenGVLab/InternVL3-8B or OpenGVLab/InternVL3-14B (ensure adequate VRAM and dependencies).
Download Search World Model (SWM) from checkpoint
Update checkpoint path in the bash scripts in the .scripts/ folder.
Request access to SVC weights and login:
# https://huggingface.co/stabilityai/stable-virtual-camera
huggingface-cli loginPrepare SAT from Hugging Face using the helper script:
python utils/data_process.py --split val
python utils/data_process.py --split testOutputs under ./data/:
val.json/test.json: questions with choices, correct answers, and image paths- Per-type splits:
val_<type>.json/test_<type>.json - Images:
./data/<split>/image_*.png
Per-question JSON fields: database_idx, question_type, question, answer_choices, correct_answer, img_paths.
Before running:
- Export
AZURE_OPENAI_API_KEYand setazure_endpointinutils/api.pyif using GPT models - Add repo root to
PYTHONPATH:export PYTHONPATH=$PYTHONPATH:./ - Set
WORLD_MODEL_TYPE=svcfor Stable Virtual Camera
- Baseline:
bash scripts/pipeline_baseline.sh- SWM:
bash scripts/inference_pipeline_wan_scaling_parallel_sat-test.sh 0- SVC:
bash scripts/inference_pipeline_svc_scaling_parallel_sat-test.sh 0Key arguments (see utils/args.py for full list):
--vlm_model_name/--vlm_qa_model_name: scoring and answering VLMs (Azure OpenAI or InternVL3)--num_questions,--split: number of questions and split (val/test)--max_steps_per_question: max iterations per question (beam search)--num_beams,--num_top_candidates: beam width and candidate count--helpful_score_threshold,--exploration_score_threshold: filtering thresholds--max_images: max images per question (typically 1–2)
You may set num_question_chunks into
Outputs under --output_dir:
results.json: overall accuracy, per-type accuracy, skipped indices, parsing stats/<qid>/: starting image(s) andgpt.json,timing.jsonlogs per question
If you find this repository helpful, please cite:
@misc{yang2025mindjourneytesttimescalingworld,
title={MindJourney: Test-Time Scaling with World Models for Spatial Reasoning},
author={Yuncong Yang and Jiageng Liu and Zheyuan Zhang and Siyuan Zhou and Reuben Tan and Jianwei Yang and Yilun Du and Chuang Gan},
year={2025},
eprint={2507.12508},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.12508},
}
- This repository is under active development; interfaces and scripts may change.
- Configure a valid Azure endpoint and API key if using GPT-family models; you are responsible for any API costs.
- SVC weights require approval on Hugging Face and appreciable VRAM.
- For issues with environments or arguments, see
utils/args.pyand code comments inpipelines/.
