Skip to content

wzk1015/Awesome-Vision-to-Music-Generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎬 β†’ 🎡 Awesome Vision-to-Music Generation

Vision-to-Music Generation: A Survey (ISMIR 2025)

Zhaokai Wang, Chenxi Bao, Le Zhuo, Jingrui Han, Yang Yue, Yihong Tang, Victor Shea-Jay Huang, Yue Liao

[πŸ“š Paper] [🎬 Video]

We provide a comprehensive survey on vision-to-music generation (V2M), including video-to-music and image-to-music generation. This survey aims to inspire further innovation in vision-to-music generation and the broader field of AI music generation in both academic research and industrial applications. In this repository, we have listed relevant papers related to methods, datasets, and evaluation of V2M. Notably, we list demo links for all papers. This collection will be continuously updated.

overview

trend

πŸ“– Table of Contents

🧩 V2M Methods

πŸ”» General Videos and Images

Method Paper Link Demo Link Date Input Type Modality Music Length Vision-Music Relationships Vision Encoding Vision-Music Projection Music Generation
CMT Paper Demo 2021/11 General Video Symbolic 3min Rhythm - Elements AR (CP)
V-MusProd Paper Demo 2022/11 General Video Symbolic 6min Semantics, Rhythm CLIP2Video, Histogan Feature AR (CP)
V2Meow Paper Demo 2023/05 General Video Audio 10sec Semantics, Rhythm CLIP, I3D Flow, ViT-VQGAN Feature AR
MuMu-LLaMA Paper Demo 2023/11 General Video, Image Audio 30sec Semantics ViT, ViViT Adapter AR (LLaMA2)
Video2Music Paper Demo 2023/11 General Video Symbolic 5min Semantics, Rhythm CLIP Feature AR
EIMG Paper Demo 2023/12 Image Symbolic 15sec Semantics ALAE, Ξ²-VAE, VQ-VAE Adapter VAE (FNT, LSR)
Diff-BGM Paper Demo 2024/05 General Video Symbolic 5min Semantics VideoCLIP Feature Diff. (Polyffusion)
Mozart's Touch Paper Demo 2024/05 General Video, Image Audio 10sec Semantics BLIP Text AR (MusicGen)
MeLFusion Paper Demo 2024/06 Image Audio 10sec Semantics DDIM + T2I LDM Feature Diff.
VidMuse Paper Demo 2024/06 General Video Audio 20sec Semantics CLIP Adapter AR (MusicGen)
S2L2-V2M Paper Demo 2024/08 General Video Audio 10sec Semantics Enhanced Video Mamba Adapter AR (LLaMA2)
VMAS Paper Demo 2024/09 General Video Audio 10sec Semantics, Rhythm Hiera Feature AR
MuVi Paper Demo 2024/10 General Video Audio 20sec Semantics, Rhythm VideoMAE V2 Adapter Diff. (DiT)
SONIQUE Paper Demo 2024/10 General Video Audio 20sec Semantics, Rhythm Video-LLaMA, CLAP Text Diff. (Stable Audio)
VEH Paper - 2024/10 General Video Symbolic 30sec Semantics VideoChat Text AR (T5)
M2M-Gen Paper Demo 2024/10 Image (Manga) Audio 1min Semantics CLIP, GPT-4 Text AR (MusicLM)
HPM Paper Demo 2024/11 General Video Audio 10sec Semantics CLIP, TAVAR, WECL Feature Diff. (AudioLDM)
VidMusician Paper Demo 2024/12 General Video Audio 30sec Semantics, Rhythm CLIP, T5 Adapter AR (MusicGen)
MTM Paper Demo 2024/12 General Video, Image Audio 30sec Semantics InternVL2 Text Diff. (Stable Audio Open)
XMusic Paper Demo 2025/01 General Video, Image Symbolic 20sec Semantics, Rhythm ResNet, CLIP Elements AR (CP)
GVMGen Paper Demo 2025/01 General Video Audio 15sec Semantics CLIP Adapter AR (MusicGen)
AudioX Paper Demo 2025/03 General Video Audio 10sec Semantics CLIP Feature Diff. (Stable Audio Open)
FilmComposer Paper Demo 2025/03 General Video Audio 15sec Semantics, Rhythm Controllable Rhythm Transformer, GPT-4v, Motion Detector Text AR (MusicGen)
DyViM Paper - 2025/04 General Video Audio 10sec Semantics, Dynamics Optical flow, CLIP Adaptor, Cross-attention AR (MusicGen)

πŸ”» Human Movement Videos

Method Paper Link Demo Link Date Input Type Modality Music Length Vision-Music Relationships Vision Encoding Vision-Music Projection Music Generation
Audeo Paper Demo 2020/06 Performance Video Symbolic 30sec Rhythm ResNet Feature GAN
Foley Music Paper Demo 2020/07 Performance Video Symbolic 10sec Rhythm 2D Body Keypoints Feature AR
Multi-Instrument Net Paper - 2020/12 Performance Video Audio 10sec Rhythm 2D Body Keypoints Feature VAE
RhythmicNet Paper Demo 2021/06 Dance Video Symbolic 10sec Rhythm 2D Body Keypoints Feature AR (REMI)
Dance2Music Paper Demo 2021/07 Dance Video Symbolic 12sec Rhythm 2D Body Keypoints Feature AR
D2M-GAN Paper Demo 2022/04 Dance Video Audio 2sec Rhythm 2D Body Keypoints, I3D Feature GAN
CDCD Paper Demo 2022/06 Dance Video Audio 2sec Rhythm 2D Body Keypoints, I3D Feature Diff.
LORIS Paper Demo 2023/05 Movement Video Audio 50sec Rhythm 2D Body Keypoints, I3D Feature Diff.
VisBeatNet Paper - 2024/01 Dance Video Symbolic Realtime Rhythm 2D Body Keypoints Feature AR
UniMuMo Paper Demo 2024/10 Dance Video Audio 10sec Rhythm 2D Body Keypoints Feature Diff.

🎬 V2M Datasets

πŸ”» General Videos

Dataset Paper Link Dataset Link Date Source Modality Size Total Length (hr) Avg. Length (sec) Annotations
HIMV-200K Paper Link 2017/04 Music Video (Youtube-8M) Audio 200K - - -
MVED Paper Link 2020/09 Music Video Audio 1.9K 16.5 30 Emotion
SymMV Paper Link 2022/11 Music Video MIDI, Audio 1.1K 76.5 241 Lyrics, Genre, Chord, Melody, Tonality, Beat
MV100K Paper - 2023/05 Music Video (Youtube-8M) Audio 110K 5000 163 Genre
MusicCaps Paper Link 2023/01 Diverse Videos (AudioSet) Audio 5.5K 15.3 10 Genre, Caption, Emotion, Tempo, Instrument, ...
EmoMV Paper Link 2023/03 Music Video (MVED, AudioSet) Audio 6K 44.3 27 Emotion
MUVideo Paper Link 2023/11 Diverse Videos (Balanced-AudioSet) Audio 14.5K 40.3 10 Instructions
MuVi-Sync Paper Link 2023/11 Music Video MIDI, Audio 784 - - Scene Offset, Emotion, Motion, Semantic, Chord, Key, Loudness, Density, ...
BGM909 Paper Link 2024/05 Music Video MIDI 909 - - Caption, Style, Chord, Melody, Beat, Shot
V2M Paper - 2024/06 Diverse Videos Audio 360K 18000 180 Genre
DISCO-MV Paper - 2024/09 Music Video (DISCO-10M) Audio 2200K 47000 77 Genre
FilmScoreDB Paper - 2024/11 Film Video Audio 32K 90.3 10 Movie Title
DVMSet Paper - 2024/12 Diverse Videos Audio 3.8K - - -
HarmonySet Paper Link 2025/03 Diverse Videos Audio 48K 458.8 32 Description
MusicPro-7k Paper Link 2025/03 Film Video Audio 7K - - Description, Melody, Rhythm Spots

πŸ”» Human Movement Videos

Dataset Paper Link Dataset Link Date Source Modality Size Total Length (hr) Avg. Length (sec) Annotations
URMP Paper Link 2016/12 Performance Video MIDI, Audio 44 1.3 106 Instruments
MUSIC Paper Link 2018/04 Performance Video Audio 685 45.7 239 Instruments
AIST++ Paper Link 2021/01 Dance Video (AIST) Audio 1.4K 5.2 13 3D Motion
TikTok Dance-Music Paper Link 2022/04 Dance Video Audio 445 1.5 12 -
LORIS Paper Link 2023/05 Dance/Sports Video (AIST, FisV, FS1000) Audio 16K 86.43 19 2D Pose

πŸ”» Images

Dataset Paper Link Dataset Link Date Source Modality Size Total Length (min) Avg. Length (sec) Annotations
Music-Image Paper Link 2016/07 Image (Music Video) Audio 22.6K 377 60 Lyrics
Shuttersong Paper Link 2017/08 Image (Shuttersong App) Audio 586 - - Lyrics
IMAC Paper Link 2019/04 Image (FI) Audio 3.8K 63.3 60 Emotion
MUImage Paper Link 2023/11 Image (Balanced-AudioSet) Audio 14.5K 40.3 10 Instructions
EIMG Paper Link 2023/12 Image (IAPS, NAPS) MIDI 3K 12.5 15 VA Value
MeLBench Paper Link 2024/06 Image (Diverse Videos) Audio 11.2K 31.2 10 Genre, Caption

πŸ“Š V2M Evaluation

🎯 Objective Metrics

πŸ”» Music-only

Metric Modality Type
Scale Consistency MIDI Pitch
Pitch Entropy MIDI Pitch
Pitch Class Histogram Entropy MIDI Pitch
Empty Beat Rate MIDI Rhythm
Average Inter-Onset Interval MIDI Rhythm
Grooving Pattern Similarity MIDI Rhythm
Structure Indicator MIDI Rhythm
Frechet Audio Distance (FAD) Audio Fidelity
Frechet Distance (FD) Audio Fidelity
Kullback-Leibler Divergence (KL) Audio Fidelity
Beats Coverage Score (BCS) Audio Rhythm
Beats Hit Score (BHS) Audio Rhythm
Inception Score (IS) Audio Fidelity

πŸ”» Vision-Music Correspondence

Metric Modality Type
ImageBind Score / Rank Audio, Video/Image Semantic
CLAP Score Audio, Audio/Text Semantic
Video-Music CLIP Precision (VMCP) Audio, Video Semantic
Video-Music Correspondence Audio, Video Semantic
Cross-modal Relevance Audio, Video Semantic
Temporal Alignment Audio, Video Rhythmic
Rhythm Alignment Audio, Video Rhythmic

🎧 Subjective Metrics

πŸ”» Music-only

Metric
Music Melody
Music Rhythm
Music Richness
Audio Quality
Overall Music Quality

πŸ”» Vision-Music Correspondence

Metric
Semantic Consistency
Rhythm Consistency
Emotion Consistency
Overall Correspondence

πŸ“š Full List

  1. [2016/07] [TMM 2016] Bridging Music and Image via Cross-Modal Ranking Analysis [paper]

  2. [2016/12] [TMM 2018] Creating A Multi-track Classical Music Performance Dataset for Multi-modal Music Analysis: Challenges, Insights, and Applications [paper]

  3. [2017/04] [ICMR 2018] Content-Based Video–Music Retrieval Using Soft Intra-Modal Structure Constraint [paper]

  4. [2017/08] [ICCV 2017] Image2song: Song Retrieval via Bridging Image Content and Lyric Words [paper]

  5. [2018/04] [ECCV 2018] The Sound of Pixels [paper]

  6. [2019/04] [ICASSP 2019] Learning Affective Correspondence between Music and Image [paper]

  7. [2020/04] [ICASSP 2020] Sight to Sound: An End-to-End Approach for Visual Piano Transcription [paper]

  8. [2020/06] [NeurIPS 2020] Audeo: Audio Generation for a Silent Performance Video [paper]

  9. [2020/07] [ECCV 2020] Foley Music: Learning to Generate Music from Videos [paper]

  10. [2020/07] [ICCC 2020] Automated Music Generation for Visual Art through Emotion [[paper]][https://computationalcreativity.net/iccc20/papers/137-iccc20.pdf]

  11. [2020/09] [Multimedia Tools and Applications 2021] Deep learning-based late fusion of multimodal information for emotion classification of music video [paper]

  12. [2020/12] Multi-Instrumentalist Net: Unsupervised Generation of Music from Body Movements [paper]

  13. [2021/06] [NeurIPS 2021] How Does it Sound? Generation of Rhythmic Soundtracks for Human Movement Videos [paper]

  14. [2021/07] Dance2Music: Automatic Dance-driven Music Generation [paper]

  15. [2021/07] [ICCC 2021] MuSyFI - Music Synthesis From Images [paper]

  16. [2021/11] [ACMMM 2021] Video Background Music Generation with Controllable Music Transformer [paper]

  17. [2021/12] InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformer [paper]

  18. [2021/12] [GCCE 2021] Semi-automatic music piece creation based on impression words extracted from object and background in color image [paper]

  19. [2022/04] [ECCV 2022] Quantized GAN for Complex Music Generation from Dance Videos [paper]

  20. [2022/05] [Applied Science 2022] Double Linear Transformer for Background Music Generation from Videos [paper]

  21. [2022/06] [ICLR 2023] Discrete Contrastive Diffusion for Crossmodal Music and Image Generation [paper]

  22. [2022/11] [ICCV 2023] Video Background Music Generation: Dataset, Method and Evaluation [paper]

  23. [2022/11] Vis2Mus: Exploring Multimodal Representation Mapping for Controllable Music Generation [paper]

  24. [2023/01] MusicLM: Generating Music From Text [paper]

  25. [2023/01] [ISM 2022] Retaining Semantics in Image to Music Conversion [paper]

  26. [2023/01] [Computational Visual Media 2024] Dance2MIDI: Dance-driven multi-instruments music generation [paper]

  27. [2023/03] [Information Fusion 2023] EmoMV: Affective music-video correspondence learning datasets for classification and retrieval [paper]

  28. [2023/05] [AAAI 2024] V2Meow: Meowing to the Visual Beat via Video-to-Music Generation [paper]

  29. [2023/05] [ICML 2023] Long-Term Rhythmic Video Soundtracker [paper]

  30. [2023/11] MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models [paper]

  31. [2023/11] [Expert Systems with Applications 2024] Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model [paper]

  32. [2023/11] [SIGGRAPH Asia 2023] Motion to Dance Music Generation using Latent Diffusion Model [paper]

  33. [2023/12] [TMM 2023] Continuous Emotion-Based Image-to-Music Generation [paper]

  34. [2023/12] CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling [paper]

  35. [2024/01] [WACV 2024] Let the Beat Follow You - Creating Interactive Drum Sounds From Body Rhy [paper]

  36. [2024/01] [SIGGRAPH Asia 2024] Dance-to-Music Generation with Encoder-based Textual Inversion [paper]

  37. [2024/05] [CVPR 2024] Diff-BGM: A Diffusion Model for Video Background Music Generation [paper]

  38. [2024/05] [NeurIPS 2024] M3GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation [paper]

  39. [2024/05] [TMM 2024] DanceComposer: Dance-to-Music Generation Using a Progressive Conditional Music Generator [paper]

  40. [2024/05] Mozart’s Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models [paper]

  41. [2024/06] [CVPR 2024] MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [paper]

  42. [2024/06] [CVPR 2025] VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling [paper]

  43. [2024/07] [ICMEW 2024] Popular Hooks: A Multimodal Dataset of Musical Hooks for Music Understanding and Generation [paper]

  44. [2024/07] [Array 2024] D2MNet for music generation joint driven by facial expressions and dance movements [paper]

  45. [2024/07] MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [paper]

  46. [2024/08] [The Visual Computer 2024] Video-driven musical composition using large language model with memory-augmented state spac [paper]

  47. [2024/09] [WACV 2025] VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos [paper]

  48. [2024/09] [EURASIP Journal on Audio, Speech, and Music Processing] Dance2Music-Diffusion: leveraging latent diffusion models for music generation from dance videos [paper]

  49. [2024/10] MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization [paper]

  50. [2024/10] [ICASSP 2025] SONIQUE: Video Background Music Generation Using Unpaired Audio-Visual Data [paper]

  51. [2024/10] [TCSS 2024] Video Echoed in Harmony: Learning and Sampling Video-Integrated Chord Progression Sequences for Controllable Video Background Music [paper]

  52. [2024/10] [TII 2025] Application and Research of Music Generation System Based on CVAE and Transformer-XL in Video Background Music [paper]

  53. [2024/10] M2M-Gen: A Multimodal Framework for Automated Background Music Generation in Japanese Manga Using Large Language Models [paper]

  54. [2024/10] UniMuMo: Unified Text, Music and Motion Generation [paper]

  55. [2024/11] Harmonizing Pixels and Melodies: Maestro-Guided Film Score Generation and Composition Style Transfer [paper]

  56. [2024/11] [NeurIPS 2024] MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence [paper]

  57. [2024/12] VidMusician: Video-to-Music Generation with Semantic-Rhythmic Alignment via Hierarchical Visual Features [paper]

  58. [2024/12] Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation [paper]

  59. [2025/01] XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework [paper]

  60. [2025/01] GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions [paper]

  61. [2025/03] AudioX: Diffusion Transformer for Anything-to-Audio Generation [paper]

  62. [2025/03] [CVPR 2025] FilmComposer: LLM-Driven Music Production for Silent Film Clips [paper]

  63. [2025/03] [CVPR 2025] HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization [paper]

  64. [2025/04] Extending Visual Dynamics for Video-to-Music Generation [paper]

βœ‰οΈ Contacts

The repo is being updated activelyπŸš€. Please let us know if you notice any mistakes or would like any work to be included in our list through GitHub pull requests or e-mail: [email protected].

πŸ“Ž Citation

If you find our work valuable for your research or applications, we would greatly appreciate a star ⭐ and a citation using the BibTeX entry provided below.

@article{Wang2025VisionToMusic,
  title={Vision-to-Music Generation: A Survey},
  author={Wang, Zhaokai and Bao, Chenxi and Zhuo, Le and Han, Jingrui and Yue, Yang and Tang, Yihong and Huang, Victor Shea-Jay and Liao, Yue},
  journal={arXiv preprint arXiv:2503.21254},
  year={2025}
}

About

[ISMIR 2025] A curated list of vision-to-music generation: methods, datasets, evaluation and challenges.

Topics

Resources

License

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •