Vision-to-Music Generation: A Survey (ISMIR 2025)
Zhaokai Wang, Chenxi Bao, Le Zhuo, Jingrui Han, Yang Yue, Yihong Tang, Victor Shea-Jay Huang, Yue Liao
We provide a comprehensive survey on vision-to-music generation (V2M), including video-to-music and image-to-music generation. This survey aims to inspire further innovation in vision-to-music generation and the broader field of AI music generation in both academic research and industrial applications. In this repository, we have listed relevant papers related to methods, datasets, and evaluation of V2M. Notably, we list demo links for all papers. This collection will be continuously updated.
| Method | Paper Link | Demo Link | Date | Input Type | Modality | Music Length | Vision-Music Relationships | Vision Encoding | Vision-Music Projection | Music Generation |
|---|---|---|---|---|---|---|---|---|---|---|
| CMT | Paper | Demo | 2021/11 | General Video | Symbolic | 3min | Rhythm | - | Elements | AR (CP) |
| V-MusProd | Paper | Demo | 2022/11 | General Video | Symbolic | 6min | Semantics, Rhythm | CLIP2Video, Histogan | Feature | AR (CP) |
| V2Meow | Paper | Demo | 2023/05 | General Video | Audio | 10sec | Semantics, Rhythm | CLIP, I3D Flow, ViT-VQGAN | Feature | AR |
| MuMu-LLaMA | Paper | Demo | 2023/11 | General Video, Image | Audio | 30sec | Semantics | ViT, ViViT | Adapter | AR (LLaMA2) |
| Video2Music | Paper | Demo | 2023/11 | General Video | Symbolic | 5min | Semantics, Rhythm | CLIP | Feature | AR |
| EIMG | Paper | Demo | 2023/12 | Image | Symbolic | 15sec | Semantics | ALAE, Ξ²-VAE, VQ-VAE | Adapter | VAE (FNT, LSR) |
| Diff-BGM | Paper | Demo | 2024/05 | General Video | Symbolic | 5min | Semantics | VideoCLIP | Feature | Diff. (Polyffusion) |
| Mozart's Touch | Paper | Demo | 2024/05 | General Video, Image | Audio | 10sec | Semantics | BLIP | Text | AR (MusicGen) |
| MeLFusion | Paper | Demo | 2024/06 | Image | Audio | 10sec | Semantics | DDIM + T2I LDM | Feature | Diff. |
| VidMuse | Paper | Demo | 2024/06 | General Video | Audio | 20sec | Semantics | CLIP | Adapter | AR (MusicGen) |
| S2L2-V2M | Paper | Demo | 2024/08 | General Video | Audio | 10sec | Semantics | Enhanced Video Mamba | Adapter | AR (LLaMA2) |
| VMAS | Paper | Demo | 2024/09 | General Video | Audio | 10sec | Semantics, Rhythm | Hiera | Feature | AR |
| MuVi | Paper | Demo | 2024/10 | General Video | Audio | 20sec | Semantics, Rhythm | VideoMAE V2 | Adapter | Diff. (DiT) |
| SONIQUE | Paper | Demo | 2024/10 | General Video | Audio | 20sec | Semantics, Rhythm | Video-LLaMA, CLAP | Text | Diff. (Stable Audio) |
| VEH | Paper | - | 2024/10 | General Video | Symbolic | 30sec | Semantics | VideoChat | Text | AR (T5) |
| M2M-Gen | Paper | Demo | 2024/10 | Image (Manga) | Audio | 1min | Semantics | CLIP, GPT-4 | Text | AR (MusicLM) |
| HPM | Paper | Demo | 2024/11 | General Video | Audio | 10sec | Semantics | CLIP, TAVAR, WECL | Feature | Diff. (AudioLDM) |
| VidMusician | Paper | Demo | 2024/12 | General Video | Audio | 30sec | Semantics, Rhythm | CLIP, T5 | Adapter | AR (MusicGen) |
| MTM | Paper | Demo | 2024/12 | General Video, Image | Audio | 30sec | Semantics | InternVL2 | Text | Diff. (Stable Audio Open) |
| XMusic | Paper | Demo | 2025/01 | General Video, Image | Symbolic | 20sec | Semantics, Rhythm | ResNet, CLIP | Elements | AR (CP) |
| GVMGen | Paper | Demo | 2025/01 | General Video | Audio | 15sec | Semantics | CLIP | Adapter | AR (MusicGen) |
| AudioX | Paper | Demo | 2025/03 | General Video | Audio | 10sec | Semantics | CLIP | Feature | Diff. (Stable Audio Open) |
| FilmComposer | Paper | Demo | 2025/03 | General Video | Audio | 15sec | Semantics, Rhythm | Controllable Rhythm Transformer, GPT-4v, Motion Detector | Text | AR (MusicGen) |
| DyViM | Paper | - | 2025/04 | General Video | Audio | 10sec | Semantics, Dynamics | Optical flow, CLIP | Adaptor, Cross-attention | AR (MusicGen) |
| Method | Paper Link | Demo Link | Date | Input Type | Modality | Music Length | Vision-Music Relationships | Vision Encoding | Vision-Music Projection | Music Generation |
|---|---|---|---|---|---|---|---|---|---|---|
| Audeo | Paper | Demo | 2020/06 | Performance Video | Symbolic | 30sec | Rhythm | ResNet | Feature | GAN |
| Foley Music | Paper | Demo | 2020/07 | Performance Video | Symbolic | 10sec | Rhythm | 2D Body Keypoints | Feature | AR |
| Multi-Instrument Net | Paper | - | 2020/12 | Performance Video | Audio | 10sec | Rhythm | 2D Body Keypoints | Feature | VAE |
| RhythmicNet | Paper | Demo | 2021/06 | Dance Video | Symbolic | 10sec | Rhythm | 2D Body Keypoints | Feature | AR (REMI) |
| Dance2Music | Paper | Demo | 2021/07 | Dance Video | Symbolic | 12sec | Rhythm | 2D Body Keypoints | Feature | AR |
| D2M-GAN | Paper | Demo | 2022/04 | Dance Video | Audio | 2sec | Rhythm | 2D Body Keypoints, I3D | Feature | GAN |
| CDCD | Paper | Demo | 2022/06 | Dance Video | Audio | 2sec | Rhythm | 2D Body Keypoints, I3D | Feature | Diff. |
| LORIS | Paper | Demo | 2023/05 | Movement Video | Audio | 50sec | Rhythm | 2D Body Keypoints, I3D | Feature | Diff. |
| VisBeatNet | Paper | - | 2024/01 | Dance Video | Symbolic | Realtime | Rhythm | 2D Body Keypoints | Feature | AR |
| UniMuMo | Paper | Demo | 2024/10 | Dance Video | Audio | 10sec | Rhythm | 2D Body Keypoints | Feature | Diff. |
| Dataset | Paper Link | Dataset Link | Date | Source | Modality | Size | Total Length (hr) | Avg. Length (sec) | Annotations |
|---|---|---|---|---|---|---|---|---|---|
| HIMV-200K | Paper | Link | 2017/04 | Music Video (Youtube-8M) | Audio | 200K | - | - | - |
| MVED | Paper | Link | 2020/09 | Music Video | Audio | 1.9K | 16.5 | 30 | Emotion |
| SymMV | Paper | Link | 2022/11 | Music Video | MIDI, Audio | 1.1K | 76.5 | 241 | Lyrics, Genre, Chord, Melody, Tonality, Beat |
| MV100K | Paper | - | 2023/05 | Music Video (Youtube-8M) | Audio | 110K | 5000 | 163 | Genre |
| MusicCaps | Paper | Link | 2023/01 | Diverse Videos (AudioSet) | Audio | 5.5K | 15.3 | 10 | Genre, Caption, Emotion, Tempo, Instrument, ... |
| EmoMV | Paper | Link | 2023/03 | Music Video (MVED, AudioSet) | Audio | 6K | 44.3 | 27 | Emotion |
| MUVideo | Paper | Link | 2023/11 | Diverse Videos (Balanced-AudioSet) | Audio | 14.5K | 40.3 | 10 | Instructions |
| MuVi-Sync | Paper | Link | 2023/11 | Music Video | MIDI, Audio | 784 | - | - | Scene Offset, Emotion, Motion, Semantic, Chord, Key, Loudness, Density, ... |
| BGM909 | Paper | Link | 2024/05 | Music Video | MIDI | 909 | - | - | Caption, Style, Chord, Melody, Beat, Shot |
| V2M | Paper | - | 2024/06 | Diverse Videos | Audio | 360K | 18000 | 180 | Genre |
| DISCO-MV | Paper | - | 2024/09 | Music Video (DISCO-10M) | Audio | 2200K | 47000 | 77 | Genre |
| FilmScoreDB | Paper | - | 2024/11 | Film Video | Audio | 32K | 90.3 | 10 | Movie Title |
| DVMSet | Paper | - | 2024/12 | Diverse Videos | Audio | 3.8K | - | - | - |
| HarmonySet | Paper | Link | 2025/03 | Diverse Videos | Audio | 48K | 458.8 | 32 | Description |
| MusicPro-7k | Paper | Link | 2025/03 | Film Video | Audio | 7K | - | - | Description, Melody, Rhythm Spots |
| Dataset | Paper Link | Dataset Link | Date | Source | Modality | Size | Total Length (hr) | Avg. Length (sec) | Annotations |
|---|---|---|---|---|---|---|---|---|---|
| URMP | Paper | Link | 2016/12 | Performance Video | MIDI, Audio | 44 | 1.3 | 106 | Instruments |
| MUSIC | Paper | Link | 2018/04 | Performance Video | Audio | 685 | 45.7 | 239 | Instruments |
| AIST++ | Paper | Link | 2021/01 | Dance Video (AIST) | Audio | 1.4K | 5.2 | 13 | 3D Motion |
| TikTok Dance-Music | Paper | Link | 2022/04 | Dance Video | Audio | 445 | 1.5 | 12 | - |
| LORIS | Paper | Link | 2023/05 | Dance/Sports Video (AIST, FisV, FS1000) | Audio | 16K | 86.43 | 19 | 2D Pose |
| Dataset | Paper Link | Dataset Link | Date | Source | Modality | Size | Total Length (min) | Avg. Length (sec) | Annotations |
|---|---|---|---|---|---|---|---|---|---|
| Music-Image | Paper | Link | 2016/07 | Image (Music Video) | Audio | 22.6K | 377 | 60 | Lyrics |
| Shuttersong | Paper | Link | 2017/08 | Image (Shuttersong App) | Audio | 586 | - | - | Lyrics |
| IMAC | Paper | Link | 2019/04 | Image (FI) | Audio | 3.8K | 63.3 | 60 | Emotion |
| MUImage | Paper | Link | 2023/11 | Image (Balanced-AudioSet) | Audio | 14.5K | 40.3 | 10 | Instructions |
| EIMG | Paper | Link | 2023/12 | Image (IAPS, NAPS) | MIDI | 3K | 12.5 | 15 | VA Value |
| MeLBench | Paper | Link | 2024/06 | Image (Diverse Videos) | Audio | 11.2K | 31.2 | 10 | Genre, Caption |
| Metric | Modality | Type |
|---|---|---|
| Scale Consistency | MIDI | Pitch |
| Pitch Entropy | MIDI | Pitch |
| Pitch Class Histogram Entropy | MIDI | Pitch |
| Empty Beat Rate | MIDI | Rhythm |
| Average Inter-Onset Interval | MIDI | Rhythm |
| Grooving Pattern Similarity | MIDI | Rhythm |
| Structure Indicator | MIDI | Rhythm |
| Frechet Audio Distance (FAD) | Audio | Fidelity |
| Frechet Distance (FD) | Audio | Fidelity |
| Kullback-Leibler Divergence (KL) | Audio | Fidelity |
| Beats Coverage Score (BCS) | Audio | Rhythm |
| Beats Hit Score (BHS) | Audio | Rhythm |
| Inception Score (IS) | Audio | Fidelity |
| Metric | Modality | Type |
|---|---|---|
| ImageBind Score / Rank | Audio, Video/Image | Semantic |
| CLAP Score | Audio, Audio/Text | Semantic |
| Video-Music CLIP Precision (VMCP) | Audio, Video | Semantic |
| Video-Music Correspondence | Audio, Video | Semantic |
| Cross-modal Relevance | Audio, Video | Semantic |
| Temporal Alignment | Audio, Video | Rhythmic |
| Rhythm Alignment | Audio, Video | Rhythmic |
| Metric |
|---|
| Music Melody |
| Music Rhythm |
| Music Richness |
| Audio Quality |
| Overall Music Quality |
| Metric |
|---|
| Semantic Consistency |
| Rhythm Consistency |
| Emotion Consistency |
| Overall Correspondence |
-
[2016/07] [TMM 2016] Bridging Music and Image via Cross-Modal Ranking Analysis [paper]
-
[2016/12] [TMM 2018] Creating A Multi-track Classical Music Performance Dataset for Multi-modal Music Analysis: Challenges, Insights, and Applications [paper]
-
[2017/04] [ICMR 2018] Content-Based VideoβMusic Retrieval Using Soft Intra-Modal Structure Constraint [paper]
-
[2017/08] [ICCV 2017] Image2song: Song Retrieval via Bridging Image Content and Lyric Words [paper]
-
[2018/04] [ECCV 2018] The Sound of Pixels [paper]
-
[2019/04] [ICASSP 2019] Learning Affective Correspondence between Music and Image [paper]
-
[2020/04] [ICASSP 2020] Sight to Sound: An End-to-End Approach for Visual Piano Transcription [paper]
-
[2020/06] [NeurIPS 2020] Audeo: Audio Generation for a Silent Performance Video [paper]
-
[2020/07] [ECCV 2020] Foley Music: Learning to Generate Music from Videos [paper]
-
[2020/07] [ICCC 2020] Automated Music Generation for Visual Art through Emotion [[paper]][https://computationalcreativity.net/iccc20/papers/137-iccc20.pdf]
-
[2020/09] [Multimedia Tools and Applications 2021] Deep learning-based late fusion of multimodal information for emotion classification of music video [paper]
-
[2020/12] Multi-Instrumentalist Net: Unsupervised Generation of Music from Body Movements [paper]
-
[2021/06] [NeurIPS 2021] How Does it Sound? Generation of Rhythmic Soundtracks for Human Movement Videos [paper]
-
[2021/07] Dance2Music: Automatic Dance-driven Music Generation [paper]
-
[2021/07] [ICCC 2021] MuSyFI - Music Synthesis From Images [paper]
-
[2021/11] [ACMMM 2021] Video Background Music Generation with Controllable Music Transformer [paper]
-
[2021/12] InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformer [paper]
-
[2021/12] [GCCE 2021] Semi-automatic music piece creation based on impression words extracted from object and background in color image [paper]
-
[2022/04] [ECCV 2022] Quantized GAN for Complex Music Generation from Dance Videos [paper]
-
[2022/05] [Applied Science 2022] Double Linear Transformer for Background Music Generation from Videos [paper]
-
[2022/06] [ICLR 2023] Discrete Contrastive Diffusion for Crossmodal Music and Image Generation [paper]
-
[2022/11] [ICCV 2023] Video Background Music Generation: Dataset, Method and Evaluation [paper]
-
[2022/11] Vis2Mus: Exploring Multimodal Representation Mapping for Controllable Music Generation [paper]
-
[2023/01] MusicLM: Generating Music From Text [paper]
-
[2023/01] [ISM 2022] Retaining Semantics in Image to Music Conversion [paper]
-
[2023/01] [Computational Visual Media 2024] Dance2MIDI: Dance-driven multi-instruments music generation [paper]
-
[2023/03] [Information Fusion 2023] EmoMV: Affective music-video correspondence learning datasets for classification and retrieval [paper]
-
[2023/05] [AAAI 2024] V2Meow: Meowing to the Visual Beat via Video-to-Music Generation [paper]
-
[2023/05] [ICML 2023] Long-Term Rhythmic Video Soundtracker [paper]
-
[2023/11] MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models [paper]
-
[2023/11] [Expert Systems with Applications 2024] Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model [paper]
-
[2023/11] [SIGGRAPH Asia 2023] Motion to Dance Music Generation using Latent Diffusion Model [paper]
-
[2023/12] [TMM 2023] Continuous Emotion-Based Image-to-Music Generation [paper]
-
[2023/12] CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling [paper]
-
[2024/01] [WACV 2024] Let the Beat Follow You - Creating Interactive Drum Sounds From Body Rhy [paper]
-
[2024/01] [SIGGRAPH Asia 2024] Dance-to-Music Generation with Encoder-based Textual Inversion [paper]
-
[2024/05] [CVPR 2024] Diff-BGM: A Diffusion Model for Video Background Music Generation [paper]
-
[2024/05] [NeurIPS 2024] M3GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation [paper]
-
[2024/05] [TMM 2024] DanceComposer: Dance-to-Music Generation Using a Progressive Conditional Music Generator [paper]
-
[2024/05] Mozartβs Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models [paper]
-
[2024/06] [CVPR 2024] MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [paper]
-
[2024/06] [CVPR 2025] VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling [paper]
-
[2024/07] [ICMEW 2024] Popular Hooks: A Multimodal Dataset of Musical Hooks for Music Understanding and Generation [paper]
-
[2024/07] [Array 2024] D2MNet for music generation joint driven by facial expressions and dance movements [paper]
-
[2024/07] MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [paper]
-
[2024/08] [The Visual Computer 2024] Video-driven musical composition using large language model with memory-augmented state spac [paper]
-
[2024/09] [WACV 2025] VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos [paper]
-
[2024/09] [EURASIP Journal on Audio, Speech, and Music Processing] Dance2Music-Diffusion: leveraging latent diffusion models for music generation from dance videos [paper]
-
[2024/10] MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization [paper]
-
[2024/10] [ICASSP 2025] SONIQUE: Video Background Music Generation Using Unpaired Audio-Visual Data [paper]
-
[2024/10] [TCSS 2024] Video Echoed in Harmony: Learning and Sampling Video-Integrated Chord Progression Sequences for Controllable Video Background Music [paper]
-
[2024/10] [TII 2025] Application and Research of Music Generation System Based on CVAE and Transformer-XL in Video Background Music [paper]
-
[2024/10] M2M-Gen: A Multimodal Framework for Automated Background Music Generation in Japanese Manga Using Large Language Models [paper]
-
[2024/10] UniMuMo: Unified Text, Music and Motion Generation [paper]
-
[2024/11] Harmonizing Pixels and Melodies: Maestro-Guided Film Score Generation and Composition Style Transfer [paper]
-
[2024/11] [NeurIPS 2024] MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence [paper]
-
[2024/12] VidMusician: Video-to-Music Generation with Semantic-Rhythmic Alignment via Hierarchical Visual Features [paper]
-
[2024/12] Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation [paper]
-
[2025/01] XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework [paper]
-
[2025/01] GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions [paper]
-
[2025/03] AudioX: Diffusion Transformer for Anything-to-Audio Generation [paper]
-
[2025/03] [CVPR 2025] FilmComposer: LLM-Driven Music Production for Silent Film Clips [paper]
-
[2025/03] [CVPR 2025] HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization [paper]
-
[2025/04] Extending Visual Dynamics for Video-to-Music Generation [paper]
The repo is being updated activelyπ. Please let us know if you notice any mistakes or would like any work to be included in our list through GitHub pull requests or e-mail: [email protected].
If you find our work valuable for your research or applications, we would greatly appreciate a star β and a citation using the BibTeX entry provided below.
@article{Wang2025VisionToMusic,
title={Vision-to-Music Generation: A Survey},
author={Wang, Zhaokai and Bao, Chenxi and Zhuo, Le and Han, Jingrui and Yue, Yang and Tang, Yihong and Huang, Victor Shea-Jay and Liao, Yue},
journal={arXiv preprint arXiv:2503.21254},
year={2025}
}

