Skip to content

ALEEEHU/World-Simulator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Simulating the Real World: Survey & Resources

Awesome Hits Maintenance PRs Welcome Contribution Welcome Stars

This repository is divided into two main sections:

Our Survey Paper Collection - This section presents our survey, "Simulating the Real World: A Unified Survey of Multimodal Generative Models", which systematically unify the study of 2D, video, 3D and 4D generation within a single framework.

Text2X Resources – This section continues the original Awesome-Text2X-Resources, an open collection of state-of-the-art (SOTA) and novel Text-to-X (X can be everything) methods, including papers, codes, and datasets. The goal is to track the rapid progress in this field and provide researchers with up-to-date references.

⭐ If you find this repository useful for your research or work, a star is highly appreciated!

💗 This repository is continuously updated. If you find relevant papers, blog posts, videos, or other resources that should be included, feel free to submit a pull request (PR) or open an issue. Community contributions are always welcome!

Table of Contents

📜 Our Survey Paper Collection

𝐒𝐢𝐦𝐮𝐥𝐚𝐭𝐢𝐧𝐠 𝐭𝐡𝐞 𝐑𝐞𝐚𝐥 𝐖𝐨𝐫𝐥𝐝: 𝐀 𝐔𝐧𝐢𝐟𝐢𝐞𝐝 𝐒𝐮𝐫𝐯𝐞𝐲 𝐨𝐟 𝐌𝐮𝐥𝐭𝐢𝐦𝐨𝐝𝐚𝐥 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐌𝐨𝐝𝐞𝐥𝐬

arXiv

Abstract

Understanding and replicating the real world is a critical challenge in Artificial General Intelligence (AGI) research. To achieve this, many existing approaches, such as world models, aim to capture the fundamental principles governing the physical world, enabling more accurate simulations and meaningful interactions. However, current methods often treat different modalities, including 2D (images), videos, 3D, and 4D representations, as independent domains, overlooking their interdependencies. Additionally, these methods typically focus on isolated dimensions of reality without systematically integrating their connections. In this survey, we present a unified survey for multimodal generative models that investigate the progression of data dimensionality in real-world simulation. Specifically, this survey starts from 2D generation (appearance), then moves to video (appearance+dynamics) and 3D generation (appearance+geometry), and finally culminates in 4D generation that integrate all dimensions. To the best of our knowledge, this is the first attempt to systematically unify the study of 2D, video, 3D and 4D generation within a single framework. To guide future research, we provide a comprehensive review of datasets, evaluation metrics and future directions, and fostering insights for newcomers. This survey serves as a bridge to advance the study of multimodal generative models and real-world simulation within a unified framework.

⭐ Citation

If you find this paper and repo helpful for your research, please cite it below:

@article{hu2025simulating,
  title={Simulating the Real World: A Unified Survey of Multimodal Generative Models},
  author={Hu, Yuqi and Wang, Longguang and Liu, Xian and Chen, Ling-Hao and Guo, Yuwei and Shi, Yukai and Liu, Ce and Rao, Anyi and Wang, Zeyu and Xiong, Hui},
  journal={arXiv preprint arXiv:2503.04641},
  year={2025}
}

Paradigms

Tip

Feel free to pull requests or contact us if you find any related papers that are not included here. The process to submit a pull request is as follows:

  • a. Fork the project into your own repository.
  • b. Add the Title, Paper link, Conference, Project/GitHub link in README.md using the following format:
[Origin] **Paper Title** [[Paper](Paper Link)] [[GitHub](GitHub Link)] [[Project Page](Project Page Link)]
  • c. Submit the pull request to this branch.

2D Generation

Text-to-Image Generation.

Here are some seminal papers and models.

  • Imagen: [NeurIPS 2022] Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding [Paper] [Project Page]
  • DALL-E: [ICML 2021] Zero-shot text-to-image generation [Paper] [GitHub]
  • DALL-E 2: [arXiv 2022] Hierarchical Text-Conditional Image Generation with CLIP Latents [Paper]
  • DALL-E 3: [Platform Link]
  • DeepFloyd IF: [GitHub]
  • Stable Diffusion: [CVPR 2022] High-Resolution Image Synthesis with Latent Diffusion Models [Paper] [GitHub]
  • SDXL: [ICLR 2024 spotlight] SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis [Paper] [GitHub]
  • FLUX.1: [Platform Link]

Video Generation

Text-to-video generation models adapt text-to-image frameworks to handle the additional dimension of dynamics in the real world. We classify these models into three categories based on different generative machine learning architectures.

Survey
  • [AIRC 2023] A Survey of AI Text-to-Image and AI Text-to-Video Generators [Paper]
  • [arXiv 2024] Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation [Paper]

Video Algorithms

(1) VAE- and GAN-based Approaches.

VAE-based Approaches.

GAN-based Approaches.

  • [CVPR 2018] MoCoGAN: Decomposing Motion and Content for Video Generation [Paper] [GitHub]
  • [CVPR 2022] StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2 [Paper] [GitHub] [Project Page]
  • DIGAN: [ICLR 2022] Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks [Paper] [GitHub] [Project Page]
  • [ICCV 2023] StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation [Paper] [GitHub] [Project Page]
(2) Diffusion-based Approaches.

U-Net-based Architectures.

  • [NeurIPS 2022] Video Diffusion Models [Paper] [Project Page]
  • [arXiv 2022] Imagen Video: High Definition Video Generation with Diffusion Models [Paper] [Project Page]
  • [arXiv 2022] MagicVideo: Efficient Video Generation With Latent Diffusion Models [Paper] [Project Page]
  • [ICLR 2023 Poster] Make-A-Video: Text-to-Video Generation without Text-Video Data [Paper] [Project Page]
  • GEN-1: [ICCV 2023] Structure and Content-Guided Video Synthesis with Diffusion Models [Paper] [Project Page]
  • PYoCo: [ICCV 2023] Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models [Paper] [Project Page]
  • [CVPR 2023] Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models [Paper] [Project Page]
  • [IJCV 2024] Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation [Paper] [GitHub] [Project Page]
  • [NeurIPS 2024] VideoComposer: Compositional Video Synthesis with Motion Controllability [Paper] [GitHub] [Project Page]
  • [ICLR 2024 Spotlight] AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning [Paper] [GitHub] [Project Page]
  • [CVPR 2024] Make Pixels Dance: High-Dynamic Video Generation [Paper] [Project Page]
  • [ECCV 2024] Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning [Paper] [Project Page]
  • [SIGGRAPH Asia 2024] Lumiere: A Space-Time Diffusion Model for Video Generation [Paper] [Project Page]

Transformer-based Architectures.

  • [ICLR 2024 Poster] VDT: General-purpose Video Diffusion Transformers via Mask Modeling [Paper] [GitHub] [Project Page]
  • W.A.L.T: [ECCV 2024] Photorealistic Video Generation with Diffusion Models [Paper] [Project Page]
  • [CVPR 2024] Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis [Paper] [Project Page]
  • [CVPR 2024] GenTron: Diffusion Transformers for Image and Video Generation [Paper] [Project Page]
  • [ICLR 2025 Poster] CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer [Paper] [GitHub]
  • [ICLR 2025 Spotlight] Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers [Paper] [GitHub]
(3) Autoregressive-based Approaches.
  • VQ-GAN: [CVPR 2021 Oral] Taming Transformers for High-Resolution Image Synthesis [Paper] [GitHub]
  • [CVPR 2023 Highlight] MAGVIT: Masked Generative Video Transformer [Paper] [GitHub] [Project Page]
  • [ICLR 2023 Poster] CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers [Paper] [GitHub]
  • [ICML 2024] VideoPoet: A Large Language Model for Zero-Shot Video Generation [Paper] [Project Page]
  • [ICLR 2024 Poster] Language Model Beats Diffusion - Tokenizer is key to visual generation [Paper]
  • [arXiv 2024] Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation [Paper] [GitHub]
  • [arXiv 2024] Emu3: Next-Token Prediction is All You Need [Paper] [GitHub] [Project Page]
  • [ICLR 2025 Poster] Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding [Paper] [GitHub]

Video Applications

Video Editing.
  • [ICCV 2023] Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation [Paper] [GitHub] [Project Page]
  • [ICCV 2023] Pix2Video: Video Editing using Image Diffusion [Paper] [GitHub] [Project Page]
  • [CVPR 2024] VidToMe: Video Token Merging for Zero-Shot Video Editing [Paper] [GitHub] [Project Page]
  • [CVPR 2024] Video-P2P: Video Editing with Cross-attention Control [Paper] [GitHub] [Project Page]
  • [CVPR 2024 Highlight] CoDeF: Content Deformation Fields for Temporally Consistent Video Processing [Paper] [GitHub] [Project Page]
  • [NeurIPS 2024] Towards Consistent Video Editing with Text-to-Image Diffusion Models [Paper]
  • [ICLR 2024 Poster] Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models [Paper] [GitHub] [Project Page]
  • [arXiv 2024] UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing [Paper] [GitHub] [Project Page]
  • [TMLR 2024] AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks [Paper] [GitHub] [Project Page]
Novel View Synthesis.
  • [arXiv 2024] ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis [Paper] [GitHub] [Project Page]
  • [CVPR 2024 Highlight] ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models [Paper] [GitHub] [Project Page]
  • [ICLR 2025 Poster] CameraCtrl: Enabling Camera Control for Video Diffusion Models [Paper] [GitHub] [Project Page]
  • [ICLR 2025 Poster] NVS-Solver: Video Diffusion Model as Zero-Shot Novel View Synthesizer [Paper] [GitHub]
Human Animation in Videos.
  • [ICCV 2019] Everybody Dance Now [Paper] [GitHub] [Project Page]
  • [ICCV 2019] Liquid Warping GAN: A Unified Framework for Human Motion Imitation, Appearance Transfer and Novel View Synthesis [Paper] [GitHub] [Project Page] [Dataset]
  • [NeurIPS 2019] First Order Motion Model for Image Animation [Paper] [GitHub] [Project Page]
  • [ICCV 2023] Adding Conditional Control to Text-to-Image Diffusion Models [Paper] [GitHub]
  • [ICCV 2023] HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation [Paper] [GitHub] [Project Page]
  • [CVPR 2023] Learning Locally Editable Virtual Humans [Paper] [GitHub] [Project Page] [Dataset]
  • [CVPR 2023] Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation [Paper] [GitHub] [Project Page]
  • [CVPRW 2024] LatentMan: Generating Consistent Animated Characters using Image Diffusion Models [Paper] [GitHub] [Project Page]
  • [IJCAI 2024] Zero-shot High-fidelity and Pose-controllable Character Animation [Paper]
  • [arXiv 2024] UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation [Paper] [GitHub] [Project Page]
  • [arXiv 2024] MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling [Paper] [GitHub] [Project Page]

3D Generation

3D Algorithms

Text-to-3D Generation.
Survey
  • [arXiv 2023] Generative AI meets 3D: A Survey on Text-to-3D in AIGC Era [Paper]
  • [arXiv 2024] Advances in 3D Generation: A Survey [Paper]
  • [arXiv 2024] A Survey On Text-to-3D Contents Generation In The Wild [Paper]
Feedforward Approaches.
  • [arXiv 2022] 3D-LDM: Neural Implicit 3D Shape Generation with Latent Diffusion Models [Paper] [GitHub]
  • [arXiv 2022] Point-E: A System for Generating 3D Point Clouds from Complex Prompts [Paper] [GitHub]
  • [arXiv 2023] Shap-E: Generating Conditional 3D Implicit Functions [Paper] [GitHub]
  • [NeurIPS 2023] Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation [Paper] [GitHub] [Project Page]
  • [ICCV 2023] ATT3D: Amortized Text-to-3D Object Synthesis [Paper] [Project Page]
  • [ICLR 2023 Spotlight] MeshDiffusion: Score-based Generative 3D Mesh Modeling [Paper] [GitHub] [Project Page]
  • [CVPR 2023] Diffusion-SDF: Text-to-Shape via Voxelized Diffusion [Paper] [GitHub] [Project Page]
  • [ICML 2024] HyperFields:Towards Zero-Shot Generation of NeRFs from Text [Paper] [GitHub] [Project Page]
  • [ECCV 2024] LATTE3D: Large-scale Amortized Text-To-Enhanced3D Synthesis [Paper] [Project Page]
  • [arXiv 2024] AToM: Amortized Text-to-Mesh using 2D Diffusion [Paper] [GitHub] [Project Page]
Optimization-based Approaches.
  • [ICLR 2023 notable top 5%] DreamFusion: Text-to-3D using 2D Diffusion [Paper] [Project Page]
  • [CVPR 2023 Highlight] Magic3D: High-Resolution Text-to-3D Content Creation [Paper] [Project Page]
  • [CVPR 2023] Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models [Paper] [Project Page]
  • [ICCV 2023] Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation [Paper] [GitHub] [Project Page]
  • [NeurIPS 2023 Spotlight] ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation [Paper] [GitHub] [Project Page]
  • [ICLR 2024 Poster] MVDream: Multi-view Diffusion for 3D Generation [Paper] [GitHub] [Project Page]
  • [ICLR 2024 Oral] DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation [Paper] [GitHub] [Project Page]
  • [CVPR 2024] PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion [Paper]
  • [CVPR 2024] VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation [Paper] [Project Page]
  • [CVPR 2024] GSGEN: Text-to-3D using Gaussian Splatting [Paper] [GitHub] [Project Page]
  • [CVPR 2024] GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models [Paper] [GitHub] [Project Page]
  • [CVPR 2024] Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior [Paper] [GitHub] [Project Page]
MVS-based Approaches.
  • [ICLR 2024 Poster] Instant3D: Fast Text-to-3D with Sparse-view Generation and Large Reconstruction Model [Paper] [Project Page]
  • [CVPR 2024] Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion [Paper] [GitHub] [Project Page]
  • [CVPR 2024] Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior [Paper] [GitHub] [Project Page]
Image-to-3D Generation.
Feedforward Approaches.
  • [arXiv 2023] 3DGen: Triplane Latent Diffusion for Textured Mesh Generation [Paper]
  • [NeurIPS 2023] Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation [Paper] [GitHub] [Project Page]
  • [NeurIPS 2024] Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer [Paper] [GitHub] [Project Page]
  • [SIGGRAPH 2024 Best Paper Honorable Mention] CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets [Paper] [GitHub] [Project Page]
  • [arXiv 2024] CraftsMan: High-fidelity Mesh Generation with 3D Native Generation and Interactive Geometry Refiner [Paper] [GitHub] [Project Page]
  • [arXiv 2024] Structured 3D Latents for Scalable and Versatile 3D Generation [Paper] [GitHub] [Project Page]
Optimization-based Approaches.
  • [arXiv 2023] Consistent123: Improve Consistency for One Image to 3D Object Synthesis [Paper] [Project Page]
  • [arXiv 2023] ImageDream: Image-Prompt Multi-view Diffusion for 3D Generation [Paper] [GitHub] [Project Page]
  • [CVPR 2023] RealFusion: 360° Reconstruction of Any Object from a Single Image [Paper] [GitHub] [Project Page]
  • [ICCV 2023] Zero-1-to-3: Zero-shot One Image to 3D Object [Paper] [GitHub] [Project Page]
  • [ICLR 2024 Poster] Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors [Paper] [GitHub] [Project Page]
  • [ICLR 2024 Poster] TOSS: High-quality Text-guided Novel View Synthesis from a Single Image [Paper] [GitHub] [Project Page]
  • [ICLR 2024 Spotlight] SyncDreamer: Generating Multiview-consistent Images from a Single-view Image [Paper] [GitHub] [Project Page]
  • [CVPR 2024] Wonder3D: Single Image to 3D using Cross-Domain Diffusion [Paper] [GitHub] [Project Page]
  • [ICLR 2025] IPDreamer: Appearance-Controllable 3D Object Generation with Complex Image Prompts [Paper] [GitHub]
MVS-based Approaches.
  • [NeurIPS 2023] One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization [Paper] [GitHub] [Project Page]
  • [ECCV 2024] CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model [Paper] [GitHub] [Project Page]
  • [arXiv 2024] InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models [Paper] [GitHub]
  • [ICLR 2024 Oral] LRM: Large Reconstruction Model for Single Image to 3D [Paper] [Project Page]
  • [NeurIPS 2024] Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image [Paper] [GitHub] [Project Page]
Video-to-3D Generation.
  • [CVPR 2024 Highlight] ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models [Paper] [GitHub] [Project Page]
  • [ICML 2024] IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation [Paper] [Project Page]
  • [arXiv 2024] V3D: Video Diffusion Models are Effective 3D Generators [Paper] [GitHub] [Project Page]
  • [ECCV 2024 Oral] SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image Using Latent Video Diffusion [Paper] [Project Page]
  • [NeurIPS 2024 Oral] CAT3D: Create Anything in 3D with Multi-View Diffusion Models [Paper] [Project Page]

3D Applications

Avatar Generation.
  • [CVPR 2023] Zero-Shot Text-to-Parameter Translation for Game Character Auto-Creation [Paper]
  • [SIGGRAPH 2023] DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance [Paper] [Project Page]
  • [NeurIPS 2023] Headsculpt: Crafting 3d head avatars with text [Paper] [GitHub] [Project Page]
  • [NeurIPS 2023] DreamWaltz: Make a Scene with Complex 3D Animatable Avatars [Paper] [GitHub] [Project Page]
  • [NeurIPS 2023 Spotlight] DreamHuman: Animatable 3D Avatars from Text [Paper] [Project Page]
Scene Generation.
  • [ACM MM 2023] RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture [Paper]
  • [TVCG 2024] Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields [Paper] [GitHub] [Project Page]
  • [ECCV 2024] DreamScene: 3D Gaussian-based Text-to-3D Scene Generation via Formation Pattern Sampling [Paper] [GitHub] [Project Page]
  • [ECCV 2024] DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting [Paper] [GitHub] [Project Page]
  • [arXiv 2024] Urban Architect: Steerable 3D Urban Scene Generation with Layout Prior [Paper] [GitHub] [Project Page]
  • [arXiv 2024] CityCraft: A Real Crafter for 3D City Generation [Paper] [GitHub]
3D Editing.

4D Generation

4D Algorithms

Feedforward Approaches.
Optimization-based Approaches.
  • [arXiv 2023] Text-To-4D Dynamic Scene Generation [Paper] [Project Page]
  • [CVPR 2024] 4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling [Paper] [GitHub] [Project Page]
  • [CVPR 2024] A Unified Approach for Text- and Image-guided 4D Scene Generation [Paper] [GitHub] [Project Page]
  • [CVPR 2024] Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models [Paper] [Project Page]
  • [ECCV 2024] TC4D: Trajectory-Conditioned Text-to-4D Generation [Paper] [GitHub] [Project Page]
  • [ECCV 2024] SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer [Paper] [GitHub] [Project Page]
  • [ECCV 2024] STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians [Paper] [GitHub] [Project Page]
  • [NeurIPS 2024] 4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models [Paper] [Project Page]
  • [NeurIPS 2024] Compositional 3D-aware Video Generation with LLM Director [Paper] [Project Page]
  • [NeurIPS 2024] DreamScene4D: Dynamic Multi-Object Scene Generation from Monocular Videos [Paper] [GitHub] [Project Page]
  • [NeurIPS 2024] DreamMesh4D: Video-to-4D Generation with Sparse-Controlled Gaussian-Mesh Hybrid Representation [Paper] [GitHub] [Project Page]
  • [arXiv 2024] Trans4D: Realistic Geometry-Aware Transition for Compositional Text-to-4D Synthesis [Paper] [GitHub]

4D Applications

4D Editing.
  • [CVPR 2024] Control4D: Efficient 4D Portrait Editing with Text [Paper] [Project Page]
  • [CVPR 2024] Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion [Paper] [GitHub] [Project Page]
Human Animation.
  • [SIGGRAPH 2020] Robust Motion In-betweening [Paper]
  • [CVPR 2022] Generating Diverse and Natural 3D Human Motions from Text [Paper] [GitHub] [Project Page]
  • [SCA 2023] Motion In-Betweening with Phase Manifolds [Paper] [GitHub]
  • [CVPR 2023] T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations [Paper] [GitHub] [Project Page]
  • [ICLR 2023 notable top 25%] Human Motion Diffusion Model [Paper] [GitHub] [Project Page]
  • [NeurIPS 2023] MotionGPT: Human Motion as a Foreign Language [Paper] [GitHub] [Project Page]
  • [ICML 2024] HumanTOMATO: Text-aligned Whole-body Motion Generation [Paper] [GitHub] [Project Page]
  • [CVPR 2024] MoMask: Generative Masked Modeling of 3D Human Motions [Paper] [GitHub] [Project Page]
  • [CVPR 2024] Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives [Paper] [GitHub] [Project Page]

Other Related Resources

World Foundation Model Platform

  • NVIDIA Cosmos ([GitHub] [Paper]): NVIDIA Cosmos is a world foundation model platform for accelerating the development of physical AI systems.

    • Cosmos-Transfer1:a world-to-world transfer model designed to bridge the perceptual divide between simulated and real-world environments.
    • Cosmos-Predict1: a collection of general-purpose world foundation models for Physical AI that can be fine-tuned into customized world models for downstream applications.
    • Cosmos-Reason1: a model that understands the physical common sense and generate appropriate embodied decisions in natural language through long chain-of-thought reasoning processes.

🔥 Awesome Text2X Resources

An open collection of state-of-the-art (SOTA), novel Text to X (X can be everything) methods (papers, codes and datasets), intended to keep pace with the anticipated surge of research.

Awesome

Update Logs

  • 2025.03.10 - CVPR 2025 Accepted Papers🎉
  • 2025.02.28 - update several papers status "CVPR 2025" to accepted papers, congrats to all 🎉
2025 Update Logs:
* `2025.01.23` - update several papers status "ICLR 2025" to accepted papers, congrats to all 🎉 * `2025.01.09` - update layout.
Previous 2024 Update Logs: * `2024.12.21` adjusted the layouts of several sections and _Happy Winter Solstice_ ⚪🥣. * `2024.09.26` - update several papers status "NeurIPS 2024" to accepted papers, congrats to all 🎉 * `2024.09.03` - add one new section 'text to model'. * `2024.06.30` - add one new section 'text to video'. * `2024.07.02` - update several papers status "ECCV 2024" to accepted papers, congrats to all 🎉 * `2024.06.21` - add one hot Topic about _AIGC 4D Generation_ on the section of __Suvery and Awesome Repos__. * `2024.06.17` - an awesome repo for CVPR2024 [Link](https://github.com/52CV/CVPR-2024-Papers) 👍🏻 * `2024.04.05` adjusted the layout and added accepted lists and ArXiv lists to each section. * `2024.04.05` - an awesome repo for CVPR2024 on 3DGS and NeRF [Link](https://github.com/Yubel426/NeRF-3DGS-at-CVPR-2024) 👍🏻 * `2024.03.25` - add one new survey paper of 3D GS into the section of "Survey and Awesome Repos--Topic 1: 3D Gaussian Splatting". * `2024.03.12` - add a new section "Dynamic Gaussian Splatting", including Neural Deformable 3D Gaussians, 4D Gaussians, Dynamic 3D Gaussians. * `2024.03.11` - CVPR 2024 Accpeted Papers [Link](https://cvpr.thecvf.com/Conferences/2024/AcceptedPapers) * update some papers accepted by CVPR 2024! Congratulations🎉

Text to 4D

(Also, Image/Video to 4D)

🎉 4D Accepted Papers

Year Title Venue Paper Code Project Page
2025 GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking CVPR 2025 Link Link Link
Accepted Papers References
%accepted papers

@article{bian2025gsdit,
  title={GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking},
  author={Bian, Weikang and Huang, Zhaoyang and Shi, Xiaoyu and and Li, Yijin and Wang, Fu-Yun and Li, Hongsheng},
  journal={arXiv preprint arXiv:2501.02690},
  year={2025}
}

💡 4D ArXiv Papers

1. AR4D: Autoregressive 4D Generation from Monocular Videos

Hanxin Zhu, Tianyu He, Xiqian Yu, Junliang Guo, Zhibo Chen, Jiang Bian (University of Science and Technology of China, Microsoft Research Asia)

Abstract Recent advancements in generative models have ignited substantial interest in dynamic 3D content creation (\ie, 4D generation). Existing approaches primarily rely on Score Distillation Sampling (SDS) to infer novel-view videos, typically leading to issues such as limited diversity, spatial-temporal inconsistency and poor prompt alignment, due to the inherent randomness of SDS. To tackle these problems, we propose AR4D, a novel paradigm for SDS-free 4D generation. Specifically, our paradigm consists of three stages. To begin with, for a monocular video that is either generated or captured, we first utilize pre-trained expert models to create a 3D representation of the first frame, which is further fine-tuned to serve as the canonical space. Subsequently, motivated by the fact that videos happen naturally in an autoregressive manner, we propose to generate each frame's 3D representation based on its previous frame's representation, as this autoregressive generation manner can facilitate more accurate geometry and motion estimation. Meanwhile, to prevent overfitting during this process, we introduce a progressive view sampling strategy, utilizing priors from pre-trained large-scale 3D reconstruction models. To avoid appearance drift introduced by autoregressive generation, we further incorporate a refinement stage based on a global deformation field and the geometry of each frame's 3D representation. Extensive experiments have demonstrated that AR4D can achieve state-of-the-art 4D generation without SDS, delivering greater diversity, improved spatial-temporal consistency, and better alignment with input prompts.

2. WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes

Ling Yang, Kaixin Zhu, Juanxi Tian, Bohan Zeng, Mingbao Lin, Hongjuan Pei, Wentao Zhang, Shuicheng Yan (Peking University, University of the Chinese Academy of Sciences, National University of Singapore)

Abstract With the rapid development of 3D reconstruction technology, research in 4D reconstruction is also advancing, existing 4D reconstruction methods can generate high-quality 4D scenes. However, due to the challenges in acquiring multi-view video data, the current 4D reconstruction benchmarks mainly display actions performed in place, such as dancing, within limited scenarios. In practical scenarios, many scenes involve wide-range spatial movements, highlighting the limitations of existing 4D reconstruction datasets. Additionally, existing 4D reconstruction methods rely on deformation fields to estimate the dynamics of 3D objects, but deformation fields struggle with wide-range spatial movements, which limits the ability to achieve high-quality 4D scene reconstruction with wide-range spatial movements. In this paper, we focus on 4D scene reconstruction with significant object spatial movements and propose a novel 4D reconstruction benchmark, WideRange4D. This benchmark includes rich 4D scene data with large spatial variations, allowing for a more comprehensive evaluation of the generation capabilities of 4D generation methods. Furthermore, we introduce a new 4D reconstruction method, Progress4D, which generates stable and high-quality 4D results across various complex 4D scene reconstruction tasks. We conduct both quantitative and qualitative comparison experiments on WideRange4D, showing that our Progress4D outperforms existing state-of-the-art 4D reconstruction methods.

Year Title ArXiv Time Paper Code Project Page
2025 AR4D: Autoregressive 4D Generation from Monocular Videos 3 Jan 2025 Link -- Link
2025 WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes 17 Mar 2025 Link Link Dataset Page
ArXiv Papers References
%axiv papers

@misc{zhu2025ar4dautoregressive4dgeneration,
      title={AR4D: Autoregressive 4D Generation from Monocular Videos}, 
      author={Hanxin Zhu and Tianyu He and Xiqian Yu and Junliang Guo and Zhibo Chen and Jiang Bian},
      year={2025},
      eprint={2501.01722},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.01722}, 
}

@article{yang2025widerange4d,
  title={WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes},
  author={Yang, Ling and Zhu, Kaixin and Tian, Juanxi and Zeng, Bohan and Lin, Mingbao and Pei, Hongjuan and Zhang, Wentao and Yan, Shuichen},
  journal={arXiv preprint arXiv:2503.13435},
  year={2025}
}

Previous Papers

Year 2023

In 2023, tasks classified as text/Image to 4D and video to 4D generally involve producing four-dimensional data from text/Image or video input. For more details, please check the 2023 4D Papers, including 6 accepted papers and 3 arXiv papers.

Year 2024

For more details, please check the 2024 4D Papers, including 20 accepted papers and 14 arXiv papers.


Text to Video

🎉 T2V Accepted Papers

Year Title Venue Paper Code Project Page
2025 TransPixar: Advancing Text-to-Video Generation with Transparency CVPR 2025 Link Link Link
2025 BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations CVPR 2025 Link -- Link
Accepted Papers References
%accepted papers

@misc{wang2025transpixar,
     title={TransPixar: Advancing Text-to-Video Generation with Transparency}, 
     author={Luozhou Wang and Yijun Li and Zhifei Chen and Jui-Hsien Wang and Zhifei Zhang and He Zhang and Zhe Lin and Yingcong Chen},
     year={2025},
     eprint={2501.03006},
     archivePrefix={arXiv},
     primaryClass={cs.CV},
     url={https://arxiv.org/abs/2501.03006}, 
}

@article{feng2025blobgen,
  title={BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations},
  author={Feng, Weixi and Liu, Chao and Liu, Sifei and Wang, William Yang and Vahdat, Arash and Nie, Weili},
  journal={arXiv preprint arXiv:2501.07647},
  year={2025}
}

💡 T2V ArXiv Papers

1. Multi-subject Open-set Personalization in Video Generation

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, Sergey Tulyakov

(Snap Inc., UC Merced, CMU)

Abstract Video personalization methods allow us to synthesize videos with specific concepts such as people, pets, and places. However, existing methods often focus on limited domains, require time-consuming optimization per subject, or support only a single subject. We present Video Alchemist − a video model with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt with cross-attention layers. Developing such a large model presents two main challenges: dataset and evaluation. First, as paired datasets of reference images and videos are extremely hard to collect, we sample selected video frames as reference images and synthesize a clip of the target video. However, while models can easily denoise training videos given reference frames, they fail to generalize to new contexts. To mitigate this issue, we design a new automatic data construction pipeline with extensive image augmentations. Second, evaluating open-set video personalization is a challenge in itself. To address this, we introduce a personalization benchmark that focuses on accurate subject fidelity and supports diverse personalization scenarios. Finally, our extensive experiments show that our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.
Year Title ArXiv Time Paper Code Project Page
2025 Multi-subject Open-set Personalization in Video Generation 10 Jan 2025 Link -- Link
ArXiv Papers References
%axiv papers

@misc{chen2025multisubjectopensetpersonalizationvideo,
      title={Multi-subject Open-set Personalization in Video Generation}, 
      author={Tsai-Shien Chen and Aliaksandr Siarohin and Willi Menapace and Yuwei Fang and Kwot Sin Lee and Ivan Skorokhodov and Kfir Aberman and Jun-Yan Zhu and Ming-Hsuan Yang and Sergey Tulyakov},
      year={2025},
      eprint={2501.06187},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.06187}, 
}


Video Other Additional Info

Previous Papers

Year 2024

For more details, please check the 2024 T2V Papers, including 20 accepted papers and 7 arXiv papers.

  • OSS video generation models: Mochi 1 preview is an open state-of-the-art video generation model with high-fidelity motion and strong prompt adherence.
  • Survey: The Dawn of Video Generation: Preliminary Explorations with SORA-like Models, arXiv, Project Page, GitHub Repo

📚 Dataset Works

1. VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, Hao Li

(Fudan University, ShangHai Academy of AI for Science)

Abstract The quality of video-text pairs fundamentally determines the upper bound of text-to-video models. Currently, the datasets used for training these models suffer from significant shortcomings, including low temporal consistency, poor-quality captions, substandard video quality, and imbalanced data distribution. The prevailing video curation process, which depends on image models for tagging and manual rule-based curation, leads to a high computational load and leaves behind unclean data. As a result, there is a lack of appropriate training datasets for text-to-video models. To address this problem, we present VidGen-1M, a superior training dataset for text-to-video models. Produced through a coarse-to-fine curation strategy, this dataset guarantees high-quality videos and detailed captions with excellent temporal consistency. When used to train the video generation model, this dataset has led to experimental results that surpass those obtained with other models.
Year Title ArXiv Time Paper Code Project Page
2024 VidGen-1M: A Large-Scale Dataset for Text-to-video Generation 5 Aug 2024 Link Link Link
References
%axiv papers

@article{tan2024vidgen,
  title={VidGen-1M: A Large-Scale Dataset for Text-to-video Generation},
  author={Tan, Zhiyu and Yang, Xiaomeng, and Qin, Luozheng and Li Hao},
  booktitle={arXiv preprint arxiv:2408.02629},
  year={2024}
}



Text to Scene

💡 3D Scene ArXiv Papers

1. LAYOUTDREAMER: Physics-guided Layout for Text-to-3D Compositional Scene Generation

Yang Zhou, Zongjin He, Qixuan Li, Chao Wang (ShangHai University)

Abstract Recently, the field of text-guided 3D scene generation has garnered significant attention. High-quality generation that aligns with physical realism and high controllability is crucial for practical 3D scene applications. However, existing methods face fundamental limitations: (i) difficulty capturing complex relationships between multiple objects described in the text, (ii) inability to generate physically plausible scene layouts, and (iii) lack of controllability and extensibility in compositional scenes. In this paper, we introduce LayoutDreamer, a framework that leverages 3D Gaussian Splatting (3DGS) to facilitate high-quality, physically consistent compositional scene generation guided by text. Specifically, given a text prompt, we convert it into a directed scene graph and adaptively adjust the density and layout of the initial compositional 3D Gaussians. Subsequently, dynamic camera adjustments are made based on the training focal point to ensure entity-level generation quality. Finally, by extracting directed dependencies from the scene graph, we tailor physical and layout energy to ensure both realism and flexibility. Comprehensive experiments demonstrate that LayoutDreamer outperforms other compositional scene generation quality and semantic alignment methods. Specifically, it achieves state-of-the-art (SOTA) performance in the multiple objects generation metric of T3Bench.
Year Title ArXiv Time Paper Code Project Page
2025 LAYOUTDREAMER: Physics-guided Layout for Text-to-3D Compositional Scene Generation 4 Feb 2025 Link -- --
ArXiv Papers References
%axiv papers

@article{zhou2025layoutdreamer,
  title={LAYOUTDREAMER: Physics-guided Layout for Text-to-3D Compositional Scene Generation},
  author={Zhou, Yang and He, Zongjin and Li, Qixuan and Wang, Chao},
  journal={arXiv preprint arXiv:2502.01949},
  year={2025}
}

Previous Papers

Year 2023-2024

For more details, please check the 2023-2024 3D Scene Papers, including 21 accepted papers and 10 arXiv papers.


Text to Human Motion

💡 Motion ArXiv Papers

1. MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm

Ziyan Guo, Zeyu Hu, Na Zhao, De Wen Soh

(Singapore University of Technology and Design, LightSpeed Studios)

Abstract Human motion generation and editing are key components of computer graphics and vision. However, current approaches in this field tend to offer isolated solutions tailored to specific tasks, which can be inefficient and impractical for real-world applications. While some efforts have aimed to unify motion-related tasks, these methods simply use different modalities as conditions to guide motion generation. Consequently, they lack editing capabilities, fine-grained control, and fail to facilitate knowledge sharing across tasks. To address these limitations and provide a versatile, unified framework capable of handling both human motion generation and editing, we introduce a novel paradigm: Motion-Condition-Motion, which enables the unified formulation of diverse tasks with three concepts: source motion, condition, and target motion. Based on this paradigm, we propose a unified framework, MotionLab, which incorporates rectified flows to learn the mapping from source motion to target motion, guided by the specified conditions. In MotionLab, we introduce the 1) MotionFlow Transformer to enhance conditional generation and editing without task-specific modules; 2) Aligned Rotational Position Encoding} to guarantee the time synchronization between source motion and target motion; 3) Task Specified Instruction Modulation; and 4) Motion Curriculum Learning for effective multi-task learning and knowledge sharing across tasks. Notably, our MotionLab demonstrates promising generalization capabilities and inference efficiency across multiple benchmarks for human motion.
Year Title ArXiv Time Paper Code Project Page
2025 MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm 6 Feb 2025 Link Link Link
ArXiv Papers References
%axiv papers

@article{guo2025motionlab,
  title={MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm},
  author={Guo, Ziyan and Hu, Zeyu and Zhao, Na and Soh, De Wen},
  journal={arXiv preprint arXiv:2502.02358},
  year={2025}
}


Motion Other Additional Info

Previous Papers

Year 2023-2024

For more details, please check the 2023-2024 Text to Human Motion Papers, including 32 accepted papers and 12 arXiv papers.

📚 Dataset Works

Datasets

Motion Info URL Others
AIST AIST Dance Motion Dataset Link --
AIST++ AIST++ Dance Motion Dataset Link dance video database with SMPL annotations
AMASS optical marker-based motion capture datasets Link --

Additional Info

AMASS

AMASS is a large database of human motion unifying different optical marker-based motion capture datasets by representing them within a common framework and parameterization. AMASS is readily useful for animation, visualization, and generating training data for deep learning.

Survey


Text to 3D Human

💡 Human ArXiv Papers

1. Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars

Tobias Kirschstein, Javier Romero, Artem Sevastopolsky, Matthias Nießner, Shunsuke Saito

(Technical University of Munich, Meta Reality Labs)

Abstract Traditionally, creating photo-realistic 3D head avatars requires a studio-level multi-view capture setup and expensive optimization during test-time, limiting the use of digital human doubles to the VFX industry or offline renderings. To address this shortcoming, we present Avat3r, which regresses a high-quality and animatable 3D head avatar from just a few input images, vastly reducing compute requirements during inference. More specifically, we make Large Reconstruction Models animatable and learn a powerful prior over 3D human heads from a large multi-view video dataset. For better 3D head reconstructions, we employ position maps from DUSt3R and generalized feature maps from the human foundation model Sapiens. To animate the 3D head, our key discovery is that simple cross-attention to an expression code is already sufficient. Finally, we increase robustness by feeding input images with different expressions to our model during training, enabling the reconstruction of 3D head avatars from inconsistent inputs, e.g., an imperfect phone capture with accidental movement, or frames from a monocular video. We compare Avat3r with current state-of-the-art methods for few-input and single-input scenarios, and find that our method has a competitive advantage in both tasks. Finally, we demonstrate the wide applicability of our proposed model, creating 3D head avatars from images of different sources, smartphone captures, single images, and even out-of-domain inputs like antique busts.
Year Title ArXiv Time Paper Code Project Page
2025 Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars 27 Feb 2025 Link -- Link
ArXiv Papers References
%axiv papers

@misc{kirschstein2025avat3r,
      title={Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars},
      author={Tobias Kirschstein and Javier Romero and Artem Sevastopolsky and Matthias Nie\ss{}ner and Shunsuke Saito},
      year={2025},
      eprint={2502.20220},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.20220},
}

Additional Info

Previous Papers

Year 2023-2024

For more details, please check the 2023-2024 3D Human Papers, including 17 accepted papers and 6 arXiv papers.

Survey and Awesome Repos

Survey

Awesome Repos

Pretrained Models
Pretrained Models (human body) Info URL
SMPL smpl model (smpl weights) Link
SMPL-X smpl model (smpl weights) Link
human_body_prior vposer model (smpl weights) Link
SMPL

SMPL is an easy-to-use, realistic, model of the of the human body that is useful for animation and computer vision.

  • version 1.0.0 for Python 2.7 (female/male, 10 shape PCs)
  • version 1.1.0 for Python 2.7 (female/male/neutral, 300 shape PCs)
  • UV map in OBJ format
SMPL-X

SMPL-X, that extends SMPL with fully articulated hands and facial expressions (55 joints, 10475 vertices)


Related Resources

Text to 'other tasks'

(Here other tasks refer to CAD, Model and Music etc.)

Text to CAD

  • 2024 | CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM | arXiv 7 Nov 2024 | Paper | Code | Project Page
  • 2024 | Text2CAD: Generating Sequential CAD Designs from Beginner-to-Expert Level Text Prompts | NeurIPS 2024 Spotlight | Paper | Project Page

Text to Music

Text to Model

  • 2024 | Text-to-Model: Text-Conditioned Neural Network Diffusion for Train-Once-for-All Personalization | arXiv 23 May 2024 | Paper

Survey and Awesome Repos

🔥 Topic 1: 3D Gaussian Splatting

Survey

Awesome Repos

🔥 Topic 2: AIGC 3D

Survey

Awesome Repos

Benchmars

🔥 Topic 3: LLM 3D

Awesome Repos

3D Human

🔥 Topic 4: AIGC 4D

Awesome Repos

Dynamic Gaussian Splatting
Neural Deformable 3D Gaussians

(CVPR 2024) Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction Paper Code Page

(CVPR 2024) 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering Paper Code Page

(CVPR 2024) SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes Paper Code Page

(CVPR 2024, Highlight) 3DGStream: On-the-Fly Training of 3D Gaussians for Efficient Streaming of Photo-Realistic Free-Viewpoint Videos Paper Code Page

4D Gaussians

(ArXiv 2024.02.07) 4D Gaussian Splatting: Towards Efficient Novel View Synthesis for Dynamic Scenes Paper

(ICLR 2024) Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting Paper Code Page

Dynamic 3D Gaussians

(CVPR 2024) Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle Paper Page

(3DV 2024) Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis Paper Code Page


License

This repo is released under the MIT license.

✉️ Any additions or suggestions, feel free to contact us.