Awesome Unified Multimodal Models

This is a repository for organizing papers, codes and other resources related to unified multimodal models.

🤔 What are unified multimodal models?

Traditional multimodal models can be broadly categorized into two types: multimodal understanding and multimodal generation. Unified multimodal models aim to integrate these two tasks within a single framework. Such models are also referred to as Any-to-Any generation in the community. These models operate on the principle of multimodal input and multimodal output, enabling them to process and generate content across various modalities seamlessly.

🔆 This project is still on-going, pull requests are welcomed!!

If you have any suggestions (missing papers, new papers, or typos), please feel free to edit and pull a request. Just letting us know the title of papers can also be a great contribution to us. You can do this by open issue or contact us directly via email.

⭐ If you find this repo useful, please star it!!!

Unified Multimodal Understanding and Generation

Show-o2: Improved Native Unified Multimodal Models (May. 2025, arXiv)
Emerging Properties in Unified Multimodal Pretraining (May. 2025, arXiv)
UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation (May. 2025, arXiv)
BLIP3-o: A Family of Fully Open Unified Multimodal Models—Architecture, Training and Dataset (May. 2025, arXiv)
Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction (May. 2025, arXiv)
Nexus-Gen: A Unified Model for Image Understanding, Generation, and Editing (Apr. 2025, arXiv)
Transfer between Modalities with MetaQueries (Apr. 2025, arXiv)
UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding (Apr. 2025, arXiv)
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation (Mar. 2025, arXiv)
ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement (Apr. 2025, arXiv)
UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning (Mar. 2025, arXiv)
OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models (Mar. 2025, arXiv)
MMGen: Unified Multi-modal Image Generation and Understanding in One Go (Mar. 2025, arXiv)
MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation (Mar. 2025, arXiv)
UniTok: A Unified Tokenizer for Visual Generation and Understanding (Feb. 2025, arXiv)
HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation (Jan. 2025, arXiv)
Dual Diffusion for Unified Image Generation and Understanding (Jan. 2025, arXiv)
LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation (Dec. 2024, arXiv)
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning (Dec. 2024, arXiv)
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding (Dec. 2024, arXiv)
Liquid: Language Models are Scalable Multi-modal Generators (Dec. 2024, arXiv)
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation (Dec. 2024, arXiv)
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows (Dec. 2024, arXiv)
Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads (Nov. 2024, arXiv)
JetFormer: An Autoregressive Generative Model of Raw Images and Text (Nov. 2024, arXiv)
MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding (Nov. 2024, arXiv)
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation (Nov. 2024, arXiv)
Spider: Any-to-Many Multimodal LLM (Nov. 2024, arXiv)
MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding (Oct. 2024, arXiv)
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation (Oct. 2024, arXiv)
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation (Oct. 2024, arXiv)
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling (Oct. 2024, arXiv)
Emu3: Next-Token Prediction is All You Need (Sep. 2024, arXiv)
MIO: A Foundation Model on Multimodal Tokens (Sep. 2024, arXiv)
MonoFormer: One Transformer for Both Diffusion and Autoregression (Sep. 2024, arXiv)
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation (Sep. 2024, arXiv)
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation (Aug. 2024, arXiv)
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model (Aug. 2024, arXiv)
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation (Jul. 2024, arXiv)
X-VILA: Cross-Modality Alignment for Large Language Model (May. 2024, arXiv)
Chameleon: Mixed-Modal Early-Fusion Foundation Models (May 2024, arXiv)
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation (Apr. 2024, arXiv)
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models (Mar. 2024, arXiv)
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling (Feb. 2024, arXiv)
World Model on Million-Length Video And Language With Blockwise RingAttention (Feb. 2024, arXiv)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization (Feb. 2024, arXiv)
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer (Jan. 2024, arXiv)
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action (Dec. 2023, arXiv)
Emu2: Generative Multimodal Models are In-Context Learners (Jul. 2023, CVPR)
Gemini: A Family of Highly Capable Multimodal Models (Dec. 2023, arXiv)
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation (Dec. 2023, arXiv)
DreamLLM: Synergistic Multimodal Comprehension and Creation (Dec. 2023, ICLR)
Making LLaMA SEE and Draw with SEED Tokenizer (Oct. 2023, ICLR)
NExT-GPT: Any-to-Any Multimodal LLM (Sep. 2023, ICML)
LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization (Sep. 2023, ICLR)
Planting a SEED of Vision in Large Language Model (Jul. 2023, arXiv)
Emu: Generative Pretraining in Multimodality (Jul. 2023, ICLR)
CoDi: Any-to-Any Generation via Composable Diffusion (May. 2023, NeurIPS)
Multimodal unified attention networks for vision-and-language interactions (Aug. 2019)
UniMuMo: Unified Text, Music, and Motion Generation (Oct. 2024, arXiv)
MedViLaM: A multimodal large language model with advanced generalizability and explainability for medical data understanding and generation (Oct. 2024, arXiv)
ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions (Dec. 2024, arXiv)

Acknowledgements

This template is provided by Awesome-Video-Diffusion and Awesome-MLLM-Hallucination.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome Unified Multimodal Models

🤔 What are unified multimodal models?

🔆 This project is still on-going, pull requests are welcomed!!

⭐ If you find this repo useful, please star it!!!

Unified Multimodal Understanding and Generation

Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 9

Uh oh!

showlab/Awesome-Unified-Multimodal-Models

Folders and files

Latest commit

History

Repository files navigation

Awesome Unified Multimodal Models

🤔 What are unified multimodal models?

🔆 This project is still on-going, pull requests are welcomed!!

⭐ If you find this repo useful, please star it!!!

Unified Multimodal Understanding and Generation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Uh oh!

Packages