UniSandbox: Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward
Yuwei Niu¹²*, Weiyang Jin³*, Jiaqi Liao, Chaoran Feng¹, Peng Jin¹, Bin Lin¹, Zongjian Li¹, Bin Zhu¹, Weihao Yu¹, Li Yuan¹⁴†
¹Peking University, ²Chongqing University, ³HKU MMLab, ⁴PengCheng Laboratory
*Equal contribution, †Corresponding AuthorContact: [email protected], [email protected]
We introduce UniSandbox, a decoupled evaluation framework paired with controlled, synthetic datasets to avoid data leakage and enable detailed analysis. Our findings reveal a significant understanding-generation gap, which is mainly reflected in two key dimensions: reasoning generation and knowledge transfer. Specifically, for reasoning generation tasks, we observe that explicit Chain-of-Thought (CoT) in the understanding module effectively bridges the gap, and further demonstrate that a self-training approach can successfully internalize this ability, enabling implicit reasoning during generation. Additionally, for knowledge transfer tasks, we find that CoT assists the generative process by helping retrieve newly learned knowledge, and also discover that query-based architectures inherently exhibit latent CoT-like properties that affect this transfer. UniSandbox provides preliminary insights for designing future unified architectures and training strategies that truly bridge the gap between understanding and generation.
git clone https://github.com/PKU-YuanGroup/UniSandBox.git
cd UniSandBox
conda create -n unisandbox python=3.10 -y
conda activate unisandbox
pip install -r requirements.txt
pip install flash_attn==2.5.8 --no-build-isolationNote: The provided training scripts are configured for 8 GPUs (e.g., 8×A100 80G) via
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7. You can adjustNUM_GPUS, memory, and device settings in the shell scripts to fit your hardware.
This section focuses on evaluating and improving the model's ability to perform generation based on mathematical calculation or logical deduction using the STARS (Self-Training with Rejection Sampling) framework.
To train the model using the STARS framework:
- Training data
- Download the reasoning splits (math and mapping) from
Yuwei-Niu/UniSandbox. - Make sure
data/dataset_info.pycorrectly points to these folders (see themath*_reject_5kandmapping*_1w_rejectentries).
- Download the reasoning splits (math and mapping) from
- Benchmark JSONL
- Evaluation JSONL files for reasoning are stored in
benchmark/test_reasoning(e.g.,math_1.jsonl,math_2.jsonl,mapping1.jsonl, etc.). These are only used for inference/evaluation, not for training.
- Evaluation JSONL files for reasoning are stored in
- Run training
bash train_reasoning.sh # default: Math (see --dataset_config_file in the script)You can switch to other reasoning tasks (e.g., math2, mapping1) by changing --dataset_config_file in train_reasoning.sh to the corresponding YAML under data/configs/Math or data/configs/Mapping.
After training, you must run the code to perform weight completion and precision conversion.
python tool/CKPT_Transfer.py \
--src_dir path/to/checkpoints/BAGEL-7B-MoT \ # input: original checkpoint
--target_dirs path/to/results/checkpoints-xxx # your checkpoint_dirTo generate images on the reasoning benchmarks:
- Edit
batch_run_eval.sh:- Set
MODEL_PATHto your checkpoint_dir. - Set
OUTPUT_DIRto where you want to save generated images. - Set
FILESto the list of absolute paths of reasoning JSONL files, e.g.:
- Set
MODEL_PATH="/abs/path/to/results/checkpoints-math1"
OUTPUT_DIR="/abs/path/to/inference_results"
FILES=(
"/abs/path/to/UniSandBox/benchmark/test_reasoning/math_1.jsonl"
"/abs/path/to/UniSandBox/benchmark/test_reasoning/math_2.jsonl"
)- Run batch inference (normal + CoT/think mode):
bash batch_run_eval.shThis script wraps batch_inference.py and will:
- Automatically detect the file type (math vs. mapping) from the filename.
- Generate images for both normal (no explicit CoT) and think (with CoT) modes.
- Organize results under
OUTPUT_DIR/test/{normal,think}/{jsonl_stem}/(e.g.,.../test/normal/math_1/).
For more advanced usage (e.g., single-GPU, custom memory limits, regenerating existing images), please refer to the inline comments and argument descriptions in batch_inference.py.
We use vLLM and Qwen2.5-VL-7B-Instruct for evaluation.
Prerequisites:
- Install vLLM (refer to the vLLM Qwen2.5-VL docs).
- Download the vision-language model
Qwen/Qwen2.5-VL-7B-Instruct.
Step 1: Launch the vLLM Server
vllm serve path/to/Qwen2.5-VL-7B-Instruct \
--port 8000 \
--host 0.0.0.0 \
--dtype bfloat16Step 2: Run Evaluation Scripts
Once the server is running, evaluate both math and mapping benchmarks. Example commands:
# Math reasoning (e.g., math_1)
python eval/eval_math.py \
--jsonl benchmark/test_reasoning/math_1.jsonl \
--image-dir /abs/path/to/inference_results/test/normal/math_1 \
--model path/to/Qwen2.5-VL-7B-Instruct \
--overwrite
# Symbolic mapping (e.g., mapping1)
python eval/eval_mapping.py \
--jsonl benchmark/test_reasoning/mapping1.jsonl \
--image-dir /abs/path/to/inference_results/test/normal/mapping1 \
--model path/to/Qwen2.5-VL-7B-Instruct \
--overwriteThe scripts will:
- Call the vLLM server with the two-stage prompts.
- Write
evaluation_results.csvandevaluation_results.loginto the correspondingimage-dir.
For more detailed options (e.g., --max-workers, JSONL format), please check the top-of-file docstrings and inline comments in eval/eval_math.py and eval/eval_mapping.py.
This section evaluates whether the model can utilize newly injected knowledge (e.g., virtual character profiles) for visual generation.
To inject new knowledge into the understanding module:
- Training data
- The knowledge injection training JSONLs (e.g.,
Lysendria.jsonl,Aurelius_Nyxella.jsonl) are provided underdata/knowledge. - In
data/dataset_info.py, ensure each character / pair entry underDATASET_INFO["vlm_sft"]has:data_dir: folder containing rendered images for that character/pair.jsonl_path: path to the correspondingdata/knowledge/*.jsonl.
- The knowledge injection training JSONLs (e.g.,
- Benchmark JSONL
- Evaluation JSONLs for knowledge transfer are in
benchmark/test_knowledge(e.g.,Aurelius.jsonl,Aurelius_Nyxella.jsonl, etc.).
- Evaluation JSONLs for knowledge transfer are in
- Run training
bash train_knowledge.sh # default: Aurelius (see --dataset_config_file in the script)You can switch to other characters or pairwise knowledge injection by changing --dataset_config_file in train_knowledge.sh to the desired YAML under data/configs/Knowledge/Forward or data/configs/Knowledge/Inverse.
Similar to the reasoning section, process the checkpoint after training:
python tool/CKPT_Transfer.py \
--src_dir path/to/checkpoints/BAGEL-7B-MoT \ # input: original checkpoint
--target_dirs path/to/results/checkpoints-xxx # your checkpoint_dirRun inference on the knowledge transfer benchmarks using the same batch_run_eval.sh pipeline as in the reasoning section:
- Edit
MODEL_PATH,OUTPUT_DIR, andFILESinbatch_run_eval.shso that:MODEL_PATHpoints to your converted knowledge-augmented checkpoint.FILEScontains absolute paths to JSONLs underbenchmark/test_knowledge(e.g.,Aurelius.jsonl,Lysendria_Kaelorix.jsonl, etc.).
- Execute:
bash batch_run_eval.shImages will be saved under OUTPUT_DIR/test/{normal,think}/{jsonl_stem}/ (e.g., .../test/normal/Aurelius/).
Using the same vLLM setup as the Reasoning section:
Step 1: Launch vLLM Server (if not already running)
vllm serve path/to/Qwen2.5-VL-7B-Instruct \
--port 8000 \
--host 0.0.0.0 \
--dtype bfloat16Step 2: Run Evaluation Script
Example command:
python eval/eval_knowledge.py \
--jsonl benchmark/test_knowledge/Aurelius.jsonl \
--image-dir /abs/path/to/inference_results/test/normal/Aurelius \
--model path/to/Qwen2.5-VL-7B-Instruct \
--overwriteThe script will:
- Apply the strict person / flower / fruit captioning and evaluation prompts.
- Produce
evaluation_results.csvandevaluation_results.loginimage-dir.
For additional usage details (e.g., JSONL format, multi-threading), please refer to the docstring and comments at the top of eval/eval_knowledge.py.
This codebase is built upon BAGEL. We thank the authors for their great work and contribution to the community.
If you have any questions, feel free to contact Yuwei Niu at [email protected].
For issues related to the BAGEL codebase, please refer to ByteDance-Seed/Bagel.
If you find our paper and code useful in your research, please consider giving us a star ⭐ and citing our work 📝 :)
@article{niu2025doesunderstandinginformgeneration,
title={Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward},
author={Yuwei Niu and Weiyang Jin and Jiaqi Liao and Chaoran Feng and Peng Jin and Bin Lin and Zongjian Li and Bin Zhu and Weihao Yu and Li Yuan},
journal={arXiv preprint arXiv:2511.20561},
year={2025}
}