Emu3.5 Team, BAAI
| πΉ | Core Concept | Description |
|---|---|---|
| π§ | Unified World Modeling | Predicts the next state jointly across vision and language, enabling coherent world modeling and generation. |
| π§© | End-to-End Pretraining | Trained with a unified next-token prediction objective over interleaved visionβlanguage sequences. |
| π | Over 10T+ Multimodal Tokens | Pre-trained on over 10 trillion interleaved tokens from video frames and transcripts, capturing spatiotemporal structure. |
| π | Native Multimodal I/O | Processes and generates interleaved visualβtext sequences without modality adapters or task-specific heads. |
| π― | RL Post-Training | Large-scale reinforcement learning enhances reasoning, compositionality, and generation quality. |
| β‘ | Discrete Diffusion Adaptation (DiDA) | Converts sequential decoding β bidirectional parallel prediction, achieving β20Γ faster inference without performance loss. |
| πΌοΈ | Versatile Generation | Excels in long-horizon visionβlanguage generation, any-to-image (X2I) synthesis, and text-rich image creation. |
| π | Generalizable World Modeling | Enables spatiotemporally consistent world exploration, and open-world embodied manipulation across diverse scenarios. |
| π | Performance Benchmark | Matches Gemini 2.5 Flash Image (Nano Banana) on image generation/editing, and outperforms on interleaved generation tasks. |
- 2025-11-19 Β· π vLLM Offline Inference Released β Meet
inference_vllm.pywith a new cond/uncond batch scheduler, delivering 4β5Γ faster end-to-end generation on vLLM 0.11.0 across Emu3.5 tasks. Jump to #Run Inference with vLLM for setup guidance and see PR #47 for full details. - 2025-11-17 Β· ποΈ Gradio Demo (Transformers Backend) β Introduced
gradio_demo_image.pyandgradio_demo_interleave.pypresets for the standard Transformers runtime, providing turnkey T2I/X2I and interleaved generation experiences with streaming output. Try the commands in #Gradio Demo to launch both UIs locally.
| Model name | HF Weight |
|---|---|
| Emu3.5 | π€ HF link |
| Emu3.5-Image | π€ HF link |
| Emu3.5-VisionTokenizer | π€ HF link |
Note:
- Emu3.5 supports general-purpose multimodal predictions, including interleaved image-text generation and single-image generation (T2I/X2I) tasks.
- Emu3.5-Image is a model focused on T2I/X2I tasks for best performance on these scenarios.
- Both models are pure next-token predictors without DiDA acceleration (each image may take several minutes to generate).
- β‘ Stay tuned for DiDA-accelerated weights.
π‘ Usage tip:
For interleaved image-text generation, use Emu3.5.
For single-image generation (T2I and X2I), use Emu3.5-Image for the best quality.
# Requires Python 3.12 or higher.
git clone https://github.com/baaivision/Emu3.5
cd Emu3.5
pip install -r requirements/transformers.txt
pip install flash_attn==2.8.3 --no-build-isolationEdit configs/config.py to set:
- Paths:
model_path,vq_path - Task template:
task_type in {t2i, x2i, howto, story, explore, vla} - Input image:
use_image(True to provide reference images, controls <|IMAGE|> token); setreference_imagein each prompt to specify the image path. For x2i task, we recommand usingreference_imageas a list containing single/multiple image paths to be compatible with multi-image input. - Sampling:
sampling_params(classifier_free_guidance, temperature, top_k/top_p, etc.) - Aspect Ratio (for t2i task):
aspect_ratio("4:3", "21:9", "1:1", "auto" etc..)
python inference.py --cfg configs/config.pyBelow are example commands for different tasks. Make sure to set CUDA_VISIBLE_DEVICES according to your available GPUs.
# πΌοΈ Text-to-Image (T2I) task
CUDA_VISIBLE_DEVICES=0 python inference.py --cfg configs/example_config_t2i.py
# π Any-to-Image (X2I) task
CUDA_VISIBLE_DEVICES=0,1 python inference.py --cfg configs/example_config_x2i.py
# π― Visual Guidance task
CUDA_VISIBLE_DEVICES=0,1 python inference.py --cfg configs/example_config_visual_guidance.py
# π Visual Narrative task
CUDA_VISIBLE_DEVICES=0,1 python inference.py --cfg configs/example_config_visual_narrative.py
# After running inference, the model will generate results in protobuf format (.pb files) for each input prompt.Protobuf outputs are written to outputs/<exp_name>/proto/. For better throughput, we recommend β₯2 GPUs.
- [Optional Recommendation] Use a new virtual environment for vLLM backend.
conda create -n Emu3p5 python=3.12- Install vLLM and apply the patch files.
# Requires Python 3.12 or higher.
# Recommended: CUDA 12.8.
pip install -r requirements/vllm.txt
pip install flash_attn==2.8.3 --no-build-isolation
cd Emu3.5
python src/patch/apply.py# πΌοΈ Text-to-Image (T2I) task
CUDA_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_t2i.py
# π Any-to-Image (X2I) task
CUDA_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_x2i.py
# π― Visual Guidance task
CUDA_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_visual_guidance.py
# π Visual Narrative task
CUDA_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_visual_narrative.pyTo visualize generated protobuf files (--video: Generate video visualizations for interleaved output):
python src/utils/vis_proto.py --input <input_proto_path> [--output <output_dir>] [--video]--input: supports a single.pbfile or a directory; directories are scanned recursively.--output: optional; defaults to<input_dir>/results/<file_stem>for files, or<parent_dir_of_input>/resultsfor directories.
Expected output directory layout (example):
results/<pb_name>/
βββ 000_question.txt
βββ 000_global_cot.txt
βββ 001_text.txt
βββ 001_00_image.png
βββ 001_00_image_cot.txt
βββ 002_text.txt
βββ 002_00_image.png
βββ ...
βββ video.mp4 # only when --video is enabled
Each *_text.txt stores decoded segments, *_image.png stores generated frames, and matching *_image_cot.txt keeps image-level chain-of-thought notes when available.
We provide two Gradio Demos for different application scenarios:
Emu3.5-Image Demo ββ Interactive interface optimized for Text-to-Image (T2I) and Any-to-Image (X2I) tasks:
CUDA_VISIBLE_DEVICES=0,1 python gradio_demo_image.py --host 0.0.0.0 --port 7860Emu3.5-Interleave Demo ββ Launch Emu3.5 Interleave Tasks (Visual Guidance and Visual Narrate) Gradio Demo
CUDA_VISIBLE_DEVICES=0,1 python gradio_demo_interleave.py --host 0.0.0.0 --port 7860- Image Generation: Support Text-to-Image Generation and Multimodal Image Generation
- Interleaved Generation: Support long-sequence creation with alternating image and text generation
- Multiple Aspect Ratios for T2I: 9 preset aspect ratios (4:3, 16:9, 1:1, etc.) plus auto mode
- Chain-of-Thought Display: Automatically parse and format model's internal thinking process
- Real-time Streaming: Stream text and image generation with live updates
- Inference Code (NTP Version)
- Advanced Image Decoder
- Discrete Diffusion Adaptation (DiDA) Inference & Weights
@misc{cui2025emu35nativemultimodalmodels,
title={Emu3.5: Native Multimodal Models are World Learners},
author={Yufeng Cui and Honghao Chen and Haoge Deng and Xu Huang and Xinghang Li and Jirong Liu and Yang Liu and Zhuoyan Luo and Jinsheng Wang and Wenxuan Wang and Yueze Wang and Chengyuan Wang and Fan Zhang and Yingli Zhao and Ting Pan and Xianduo Li and Zecheng Hao and Wenxuan Ma and Zhuo Chen and Yulong Ao and Tiejun Huang and Zhongyuan Wang and Xinlong Wang},
year={2025},
eprint={2510.26583},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.26583},
}
