Skip to content

serigalamerahjambu/Lumina-DiMOO

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

[πŸ“‘ Technical Report (Coming Soon)]   [πŸ’œ Project Page (Demo & Benchmark)]   [πŸ€— Model ]

¹Shanghai AI Laboratory, ²Shanghai Innovation Institute, ³Shanghai Jiao Tong University, ⁴Nanjing University

⁡The University of Sydney, ⁢The Chinese University of Hong Kong, ⁷Tsinghua University

πŸ“š Introduction

We introduce Lumina-DiMOO, an omni foundational model for seamless multimodal generation and understanding. Lumina-DiMOO is distinguished by four key innovations:

  • Unified Discrete Diffusion Architecture: Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities.

  • Versatile Multimodal Capabilities: Lumina-DiMOO supports a broad spectrum of multimodal tasks, including text-to-image generation (allowing for arbitrary and high-resolution), image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), alongside advanced image understanding.

  • Higher Sampling Efficiency: Compared to previous AR or hybrid AR-diffusion paradigms, Lumina-DiMOO demonstrates remarkable sampling efficiency. Additionally, we design a bespoke caching method to further speed up the sampling speed by 2x.

  • Superior Performance: Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multimodal models, setting a new standard in the field.

πŸ”₯ News

  • [2025-09-12] πŸŽ‰πŸŽ‰πŸŽ‰ We have open-sourced Image Inpainting & Extrapolation code.
  • [2025-09-11] πŸŽ‰πŸŽ‰πŸŽ‰ We have open-sourced the Max Logit-based Cache solution, offering a 2x speed improvement for sampling.
  • [2025-09-10] πŸŽ‰πŸŽ‰πŸŽ‰ We release the initial version of Lumina-DiMOO, including:
    • 🎯 Model Checkpoints on HuggingFace!
    • 🎯 Text-to-Image & Image-to-Image Generation Inference code!
    • 🎯 Image Understanding Inference Code!
    • 🎯 Website & Demo on Project Page!

πŸ“ Open-Source Plan

  • Image Inpainting & Extrapolation Code
  • Fast Sampling with Max Logit-based Cache
  • Gradio Demo
  • Bechmark Evaluation Code
  • Fine-Tuning Code
  • Self-GRPO Training Code
  • Technical Report

πŸ“½οΈ Qualitative Results

Here we present some comparative generation results with other models. For additional visualization results, please see our Project Page.

Text-to-Image Comparison
Image Editing Comparison
Controllable & Subject-Driven Generation Comparison
Image Inpainting & Extrapolation

πŸ“Š Quantitative Performance

GenEval Benchmark
DPG Benchmark
OneIG-EN Benchmark
TIIF Benchmark
Image-to-Image Benchmark
Image Understanding Benchmark

πŸš€ Sampling Speed Analysis

  • Since text generation is performed in a block-wise manner, unlike image generation which uses a single global decoding step, its speed is influenced by both the number of blocks and the number of steps. Therefore, the speed improvement of image understanding is not as significant as that of image generation.

  • Lumina-DiMOO Settings: For image generation, we sample 64 steps. For image understanding, we set the block length to 256 and the number of sampling steps to 128.

πŸ“Œ Quick Start

βš™οΈ Installation

1. Create a conda environment

git clone https://github.com/Alpha-VLLM/Lumina-DiMOO.git && cd Lumina-DiMOO
conda create -n lumina_dimoo python=3.10 -y
conda activate lumina_dimoo

2. Install dependencies

pip install -r requirements.txt

β›½ Text-to-Image Generation

1. Normal Sampling

python scripts/inference_t2i.py\
    --checkpoint Alpha-VLLM/Lumina-DiMOO \
    --prompt "A striking photograph of a glass of orange juice on a wooden kitchen table, capturing a playful moment. The orange juice splashes out of the glass and forms the word \"Smile\" in a whimsical, swirling script just above the glass. The background is softly blurred, revealing a cozy, homely kitchen with warm lighting and a sense of comfort." \
    --height 768 \
    --width 1536 \
    --timesteps 64 \
    --cfg_scale 4.0 \
    --seed 65513 \
    --vae_ckpt Alpha-VLLM/Lumina-DiMOO \
    --output_dir output/results_text_to_image

2. DDP Sampling

To support large-scale sampling/testing, we provide additional ddp sampling scripts that support multi-GPU parallel sampling.

torchrun --nproc_per_node=8 scripts/inference_t2i_ddp.py \
    --checkpoint Alpha-VLLM/Lumina-DiMOO \
    --prompt_path /path/to/prompts.jsonl \
    --height 1024 \
    --width 1024 \
    --timesteps 64 \
    --cfg_scale 4.0 \
    --vae_ckpt Alpha-VLLM/Lumina-DiMOO \
    --output_dir output/results_image_to_image_ddp \
    --output_json output/results_image_to_image_ddp/results.json

3. Faster Sampling with Cache

  • Add --use-cache to accelerate sampling through max logit-based cache (ML-Cache). The efficiency-quality tradeoff can be tuned by cache_ratio (in (0,1); the higher the faster), warmup_ratio (in [0,1); the lower the faster), and refresh_interval (in (1, timesteps-int(warmup_ratio*timesteps)-1]; the higher the faster).
python scripts/inference_t2i.py\
    --checkpoint Alpha-VLLM/Lumina-DiMOO \
    --prompt "A striking photograph of a glass of orange juice on a wooden kitchen table, capturing a playful moment. The orange juice splashes out of the glass and forms the word \"Smile\" in a whimsical, swirling script just above the glass. The background is softly blurred, revealing a cozy, homely kitchen with warm lighting and a sense of comfort." \
    --height 768 \
    --width 1536 \
    --timesteps 64 \
    --cfg_scale 4.0 \
    --seed 65513 \
    --vae_ckpt Alpha-VLLM/Lumina-DiMOO \
    --output_dir output/results_text_to_image_usecache \
    --use-cache \
    --cache_ratio 0.9 \
    --warmup_ratio 0.3 \
    --refresh_interval 5
  • We provide the inference time and GPU memory on one A800 as a reference:
Method Inference Time Inference GPU Memory
Lumina-DiMOO 58.2s 38.9 GB
+ ML-Cache 32.2s 45.9 GB

🌟 Image-to-Image Inference

1. Controllable Generation: "hed_control", "depth_control", "openpose_control", "subject_driven".

python scripts/inference_i2i.py \
    --checkpoint Alpha-VLLM/Lumina-DiMOO \
    --prompt "A functional wooden printer stand.Nestled next to a brick wall in a bustling city street, it stands firm as pedestrians hustle by, illuminated by the warm glow of vintage street lamps." \
    --image_path examples/example_2.jpg \
    --edit_type depth_control \
    --timesteps 64 \
    --cfg_scale 2.5 \
    --cfg_img 4.0 \
    --vae_ckpt Alpha-VLLM/Lumina-DiMOO \
    --output_dir output/results_image_to_image

2. Subject-Driven Generation.

python scripts/inference_i2i.py \
    --checkpoint Alpha-VLLM/Lumina-DiMOO \
    --prompt "A creamy, rich-flavored dark beverage.Captured in a bustling urban street at twilight, this item is placed on an outdoor cafΓ© table, as city lights begin to twinkle and passersby create a lively atmosphere." \
    --image_path examples/example_3.jpg \
    --edit_type subject_driven \
    --timesteps 64 \
    --cfg_scale 2.5 \
    --cfg_img 4.0 \
    --vae_ckpt Alpha-VLLM/Lumina-DiMOO \
    --output_dir output/results_image_to_image

3. Image Editing: "edit_add", "edit_remove", "edit_replace", "edit_background", "edit_text_transfer".

python scripts/inference_i2i.py \
    --checkpoint Alpha-VLLM/Lumina-DiMOO \
    --prompt "Add a beige shed with brown trim and double doors with a diamond pattern in the center-right, occupying more than a third of the image." \
    --image_path examples/example_4.png \
    --edit_type edit_add \
    --timesteps 64 \
    --cfg_scale 2.5 \
    --cfg_img 4.0 \
    --vae_ckpt Alpha-VLLM/Lumina-DiMOO \
    --output_dir output/results_image_to_image

4. Style Transfer (An Image as Style Reference)

python scripts/inference_i2i.py \
    --checkpoint Alpha-VLLM/Lumina-DiMOO \
    --prompt "Transform the current image into the style of the provided image." \
    --image_path examples/example_5.png \
    --ref_image_path examples/example_5_style.png \
    --edit_type image_ref_transfer \
    --timesteps 64 \
    --cfg_scale 2.5 \
    --cfg_img 4.0 \
    --vae_ckpt Alpha-VLLM/Lumina-DiMOO \
    --output_dir output/results_image_to_image

5. Dense Prediction: "canny_pred", "hed_pred", "depth_pred", "openpose_pred", "canny_control".

python scripts/inference_i2i.py \
    --checkpoint Alpha-VLLM/Lumina-DiMOO \
    --prompt "Generate a canny edge map accroding to the image." \
    --image_path examples/example_1.png \
    --edit_type canny_pred \
    --timesteps 64 \
    --cfg_scale 2.5 \
    --cfg_img 4.0 \
    --vae_ckpt Alpha-VLLM/Lumina-DiMOO \
    --output_dir output/results_image_to_image

🌟 Image Inpainting & Extrapolation

1. Image Inpainting

python scripts/inference_t2i.py\
    --checkpoint Alpha-VLLM/Lumina-DiMOO \
    --prompt "Porsche showroom. Make there be a Porsche logo on the back wall behind the car." \
    --painting_mode inpainting \
    --painting_image examples/example_8.png \
    --mask_h_ratio 0.5 \
    --mask_w_ratio 0.5 \
    --timesteps 64 \
    --cfg_scale 4.0 \
    --seed 65513 \
    --vae_ckpt Alpha-VLLM/Lumina-DiMOO \
    --output_dir output/results_text_to_image

2. Image Extrapolation

python scripts/inference_t2i.py\
    --checkpoint Alpha-VLLM/Lumina-DiMOO \
    --prompt "A photograph showcasing a pale gold moon, partially veiled by wispy cirrus clouds, dominating a dramatic twilight sky. The moon's soft glow reflects on the tranquil surface of a lake below, creating a shimmering mirror effect, while a small wooden rowboat gently bobs on the water's edge. Dark silhouettes of tall, ancient pine trees encircle the lake, their branches reaching towards the sky like skeletal fingers, as a gentle mist hangs low, diffusing the moonlight and adding a sense of serene mystery. The scene is bathed in soft, cool lighting, creating an ethereal and captivating atmosphere." \
    --painting_mode outpainting \
    --painting_image examples/example_7.png \
    --mask_h_ratio 1 \
    --mask_w_ratio 0.2 \
    --timesteps 64 \
    --cfg_scale 4.0 \
    --seed 65513 \
    --vae_ckpt Alpha-VLLM/Lumina-DiMOO \
    --output_dir output/results_text_to_image

🌟 Image Understanding Inference

python scripts/inference_mmu.py \
    --checkpoint Alpha-VLLM/Lumina-DiMOO \
    --prompt "Please describe this image." \
    --image_path examples/example_6.jpg \
    --steps 128 \
    --gen_length 128 \
    --block_length 32 \
    --vae_ckpt Alpha-VLLM/Lumina-DiMOO \
    --output_dir output/outputs_text_understanding

πŸ’¬ Discussion

You can reach us with this WeChat QR code!


πŸ“œ Acknowledgements

This work was also supported and implemented by MindSpeed MM, an open-source training framework for large-scale multimodal models designed for distributed training, developed and maintained by Huawei's Computing Product Line. Specifically Optimized for Huaweiβ€˜s Ascend AI chips, MindSpeed MM offers comprehensive support for distributed training and is tailored for a wide range of multimodal tasks.

🌟 Star History

Star History Chart

πŸ“– BibTeX

@misc{lumina-dimoo,
      title={Lumina-DiMOO: A Unified Masked Diffusion Model for Multi-Modal Generation and Understanding},
      author={Alpha VLLM Team},
      year={2025},
      url={https://github.com/Alpha-VLLM/Lumina-DiMOO},
}

About

Lumina-DiMOO - An Open-Sourced Multi-Modal Large Diffusion Language Model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%