🎬 → 🎵 Awesome Vision-to-Music Generation

Vision-to-Music Generation: A Survey (ISMIR 2025)

Zhaokai Wang, Chenxi Bao, Le Zhuo, Jingrui Han, Yang Yue, Yihong Tang, Victor Shea-Jay Huang, Yue Liao

[📚 Paper] [🎬 Video]

We provide a comprehensive survey on vision-to-music generation (V2M), including video-to-music and image-to-music generation. This survey aims to inspire further innovation in vision-to-music generation and the broader field of AI music generation in both academic research and industrial applications. In this repository, we have listed relevant papers related to methods, datasets, and evaluation of V2M. Notably, we list demo links for all papers. This collection will be continuously updated.

🧩 V2M Methods

🔻 General Videos and Images

Method	Paper Link	Demo Link	Date	Input Type	Modality	Music Length	Vision-Music Relationships	Vision Encoding	Vision-Music Projection	Music Generation
CMT	Paper	Demo	2021/11	General Video	Symbolic	3min	Rhythm	-	Elements	AR (CP)
V-MusProd	Paper	Demo	2022/11	General Video	Symbolic	6min	Semantics, Rhythm	CLIP2Video, Histogan	Feature	AR (CP)
V2Meow	Paper	Demo	2023/05	General Video	Audio	10sec	Semantics, Rhythm	CLIP, I3D Flow, ViT-VQGAN	Feature	AR
MuMu-LLaMA	Paper	Demo	2023/11	General Video, Image	Audio	30sec	Semantics	ViT, ViViT	Adapter	AR (LLaMA2)
Video2Music	Paper	Demo	2023/11	General Video	Symbolic	5min	Semantics, Rhythm	CLIP	Feature	AR
EIMG	Paper	Demo	2023/12	Image	Symbolic	15sec	Semantics	ALAE, β-VAE, VQ-VAE	Adapter	VAE (FNT, LSR)
Diff-BGM	Paper	Demo	2024/05	General Video	Symbolic	5min	Semantics	VideoCLIP	Feature	Diff. (Polyffusion)
Mozart's Touch	Paper	Demo	2024/05	General Video, Image	Audio	10sec	Semantics	BLIP	Text	AR (MusicGen)
MeLFusion	Paper	Demo	2024/06	Image	Audio	10sec	Semantics	DDIM + T2I LDM	Feature	Diff.
VidMuse	Paper	Demo	2024/06	General Video	Audio	20sec	Semantics	CLIP	Adapter	AR (MusicGen)
S2L2-V2M	Paper	Demo	2024/08	General Video	Audio	10sec	Semantics	Enhanced Video Mamba	Adapter	AR (LLaMA2)
VMAS	Paper	Demo	2024/09	General Video	Audio	10sec	Semantics, Rhythm	Hiera	Feature	AR
MuVi	Paper	Demo	2024/10	General Video	Audio	20sec	Semantics, Rhythm	VideoMAE V2	Adapter	Diff. (DiT)
SONIQUE	Paper	Demo	2024/10	General Video	Audio	20sec	Semantics, Rhythm	Video-LLaMA, CLAP	Text	Diff. (Stable Audio)
VEH	Paper	-	2024/10	General Video	Symbolic	30sec	Semantics	VideoChat	Text	AR (T5)
M2M-Gen	Paper	Demo	2024/10	Image (Manga)	Audio	1min	Semantics	CLIP, GPT-4	Text	AR (MusicLM)
HPM	Paper	Demo	2024/11	General Video	Audio	10sec	Semantics	CLIP, TAVAR, WECL	Feature	Diff. (AudioLDM)
VidMusician	Paper	Demo	2024/12	General Video	Audio	30sec	Semantics, Rhythm	CLIP, T5	Adapter	AR (MusicGen)
MTM	Paper	Demo	2024/12	General Video, Image	Audio	30sec	Semantics	InternVL2	Text	Diff. (Stable Audio Open)
XMusic	Paper	Demo	2025/01	General Video, Image	Symbolic	20sec	Semantics, Rhythm	ResNet, CLIP	Elements	AR (CP)
GVMGen	Paper	Demo	2025/01	General Video	Audio	15sec	Semantics	CLIP	Adapter	AR (MusicGen)
AudioX	Paper	Demo	2025/03	General Video	Audio	10sec	Semantics	CLIP	Feature	Diff. (Stable Audio Open)
FilmComposer	Paper	Demo	2025/03	General Video	Audio	15sec	Semantics, Rhythm	Controllable Rhythm Transformer, GPT-4v, Motion Detector	Text	AR (MusicGen)
DyViM	Paper	-	2025/04	General Video	Audio	10sec	Semantics, Dynamics	Optical flow, CLIP	Adaptor, Cross-attention	AR (MusicGen)

🔻 Human Movement Videos

Method	Paper Link	Demo Link	Date	Input Type	Modality	Music Length	Vision-Music Relationships	Vision Encoding	Vision-Music Projection	Music Generation
Audeo	Paper	Demo	2020/06	Performance Video	Symbolic	30sec	Rhythm	ResNet	Feature	GAN
Foley Music	Paper	Demo	2020/07	Performance Video	Symbolic	10sec	Rhythm	2D Body Keypoints	Feature	AR
Multi-Instrument Net	Paper	-	2020/12	Performance Video	Audio	10sec	Rhythm	2D Body Keypoints	Feature	VAE
RhythmicNet	Paper	Demo	2021/06	Dance Video	Symbolic	10sec	Rhythm	2D Body Keypoints	Feature	AR (REMI)
Dance2Music	Paper	Demo	2021/07	Dance Video	Symbolic	12sec	Rhythm	2D Body Keypoints	Feature	AR
D2M-GAN	Paper	Demo	2022/04	Dance Video	Audio	2sec	Rhythm	2D Body Keypoints, I3D	Feature	GAN
CDCD	Paper	Demo	2022/06	Dance Video	Audio	2sec	Rhythm	2D Body Keypoints, I3D	Feature	Diff.
LORIS	Paper	Demo	2023/05	Movement Video	Audio	50sec	Rhythm	2D Body Keypoints, I3D	Feature	Diff.
VisBeatNet	Paper	-	2024/01	Dance Video	Symbolic	Realtime	Rhythm	2D Body Keypoints	Feature	AR
UniMuMo	Paper	Demo	2024/10	Dance Video	Audio	10sec	Rhythm	2D Body Keypoints	Feature	Diff.

🎬 V2M Datasets

🔻 General Videos

Dataset	Paper Link	Dataset Link	Date	Source	Modality	Size	Total Length (hr)	Avg. Length (sec)	Annotations
HIMV-200K	Paper	Link	2017/04	Music Video (Youtube-8M)	Audio	200K	-	-	-
MVED	Paper	Link	2020/09	Music Video	Audio	1.9K	16.5	30	Emotion
SymMV	Paper	Link	2022/11	Music Video	MIDI, Audio	1.1K	76.5	241	Lyrics, Genre, Chord, Melody, Tonality, Beat
MV100K	Paper	-	2023/05	Music Video (Youtube-8M)	Audio	110K	5000	163	Genre
MusicCaps	Paper	Link	2023/01	Diverse Videos (AudioSet)	Audio	5.5K	15.3	10	Genre, Caption, Emotion, Tempo, Instrument, ...
EmoMV	Paper	Link	2023/03	Music Video (MVED, AudioSet)	Audio	6K	44.3	27	Emotion
MUVideo	Paper	Link	2023/11	Diverse Videos (Balanced-AudioSet)	Audio	14.5K	40.3	10	Instructions
MuVi-Sync	Paper	Link	2023/11	Music Video	MIDI, Audio	784	-	-	Scene Offset, Emotion, Motion, Semantic, Chord, Key, Loudness, Density, ...
BGM909	Paper	Link	2024/05	Music Video	MIDI	909	-	-	Caption, Style, Chord, Melody, Beat, Shot
V2M	Paper	-	2024/06	Diverse Videos	Audio	360K	18000	180	Genre
DISCO-MV	Paper	-	2024/09	Music Video (DISCO-10M)	Audio	2200K	47000	77	Genre
FilmScoreDB	Paper	-	2024/11	Film Video	Audio	32K	90.3	10	Movie Title
DVMSet	Paper	-	2024/12	Diverse Videos	Audio	3.8K	-	-	-
HarmonySet	Paper	Link	2025/03	Diverse Videos	Audio	48K	458.8	32	Description
MusicPro-7k	Paper	Link	2025/03	Film Video	Audio	7K	-	-	Description, Melody, Rhythm Spots

🔻 Human Movement Videos

Dataset	Paper Link	Dataset Link	Date	Source	Modality	Size	Total Length (hr)	Avg. Length (sec)	Annotations
URMP	Paper	Link	2016/12	Performance Video	MIDI, Audio	44	1.3	106	Instruments
MUSIC	Paper	Link	2018/04	Performance Video	Audio	685	45.7	239	Instruments
AIST++	Paper	Link	2021/01	Dance Video (AIST)	Audio	1.4K	5.2	13	3D Motion
TikTok Dance-Music	Paper	Link	2022/04	Dance Video	Audio	445	1.5	12	-
LORIS	Paper	Link	2023/05	Dance/Sports Video (AIST, FisV, FS1000)	Audio	16K	86.43	19	2D Pose

🔻 Images

Dataset	Paper Link	Dataset Link	Date	Source	Modality	Size	Total Length (min)	Avg. Length (sec)	Annotations
Music-Image	Paper	Link	2016/07	Image (Music Video)	Audio	22.6K	377	60	Lyrics
Shuttersong	Paper	Link	2017/08	Image (Shuttersong App)	Audio	586	-	-	Lyrics
IMAC	Paper	Link	2019/04	Image (FI)	Audio	3.8K	63.3	60	Emotion
MUImage	Paper	Link	2023/11	Image (Balanced-AudioSet)	Audio	14.5K	40.3	10	Instructions
EIMG	Paper	Link	2023/12	Image (IAPS, NAPS)	MIDI	3K	12.5	15	VA Value
MeLBench	Paper	Link	2024/06	Image (Diverse Videos)	Audio	11.2K	31.2	10	Genre, Caption

📊 V2M Evaluation

🎯 Objective Metrics

🔻 Music-only

Metric	Modality	Type
Scale Consistency	MIDI	Pitch
Pitch Entropy	MIDI	Pitch
Pitch Class Histogram Entropy	MIDI	Pitch
Empty Beat Rate	MIDI	Rhythm
Average Inter-Onset Interval	MIDI	Rhythm
Grooving Pattern Similarity	MIDI	Rhythm
Structure Indicator	MIDI	Rhythm
Frechet Audio Distance (FAD)	Audio	Fidelity
Frechet Distance (FD)	Audio	Fidelity
Kullback-Leibler Divergence (KL)	Audio	Fidelity
Beats Coverage Score (BCS)	Audio	Rhythm
Beats Hit Score (BHS)	Audio	Rhythm
Inception Score (IS)	Audio	Fidelity

🔻 Vision-Music Correspondence

Metric	Modality	Type
ImageBind Score / Rank	Audio, Video/Image	Semantic
CLAP Score	Audio, Audio/Text	Semantic
Video-Music CLIP Precision (VMCP)	Audio, Video	Semantic
Video-Music Correspondence	Audio, Video	Semantic
Cross-modal Relevance	Audio, Video	Semantic
Temporal Alignment	Audio, Video	Rhythmic
Rhythm Alignment	Audio, Video	Rhythmic

🎧 Subjective Metrics

🔻 Music-only

Metric
Music Melody
Music Rhythm
Music Richness
Audio Quality
Overall Music Quality

🔻 Vision-Music Correspondence

Metric
Semantic Consistency
Rhythm Consistency
Emotion Consistency
Overall Correspondence

📚 Full List

[2016/07] [TMM 2016] Bridging Music and Image via Cross-Modal Ranking Analysis [paper]
[2016/12] [TMM 2018] Creating A Multi-track Classical Music Performance Dataset for Multi-modal Music Analysis: Challenges, Insights, and Applications [paper]
[2017/04] [ICMR 2018] Content-Based Video–Music Retrieval Using Soft Intra-Modal Structure Constraint [paper]
[2017/08] [ICCV 2017] Image2song: Song Retrieval via Bridging Image Content and Lyric Words [paper]
[2018/04] [ECCV 2018] The Sound of Pixels [paper]
[2019/04] [ICASSP 2019] Learning Affective Correspondence between Music and Image [paper]
[2020/04] [ICASSP 2020] Sight to Sound: An End-to-End Approach for Visual Piano Transcription [paper]
[2020/06] [NeurIPS 2020] Audeo: Audio Generation for a Silent Performance Video [paper]
[2020/07] [ECCV 2020] Foley Music: Learning to Generate Music from Videos [paper]
[2020/07] [ICCC 2020] Automated Music Generation for Visual Art through Emotion [[paper]][https://computationalcreativity.net/iccc20/papers/137-iccc20.pdf]
[2020/09] [Multimedia Tools and Applications 2021] Deep learning-based late fusion of multimodal information for emotion classification of music video [paper]
[2020/12] Multi-Instrumentalist Net: Unsupervised Generation of Music from Body Movements [paper]
[2021/06] [NeurIPS 2021] How Does it Sound? Generation of Rhythmic Soundtracks for Human Movement Videos [paper]
[2021/07] Dance2Music: Automatic Dance-driven Music Generation [paper]
[2021/07] [ICCC 2021] MuSyFI - Music Synthesis From Images [paper]
[2021/11] [ACMMM 2021] Video Background Music Generation with Controllable Music Transformer [paper]
[2021/12] InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformer [paper]
[2021/12] [GCCE 2021] Semi-automatic music piece creation based on impression words extracted from object and background in color image [paper]
[2022/04] [ECCV 2022] Quantized GAN for Complex Music Generation from Dance Videos [paper]
[2022/05] [Applied Science 2022] Double Linear Transformer for Background Music Generation from Videos [paper]
[2022/06] [ICLR 2023] Discrete Contrastive Diffusion for Crossmodal Music and Image Generation [paper]
[2022/11] [ICCV 2023] Video Background Music Generation: Dataset, Method and Evaluation [paper]
[2022/11] Vis2Mus: Exploring Multimodal Representation Mapping for Controllable Music Generation [paper]
[2023/01] MusicLM: Generating Music From Text [paper]
[2023/01] [ISM 2022] Retaining Semantics in Image to Music Conversion [paper]
[2023/01] [Computational Visual Media 2024] Dance2MIDI: Dance-driven multi-instruments music generation [paper]
[2023/03] [Information Fusion 2023] EmoMV: Affective music-video correspondence learning datasets for classification and retrieval [paper]
[2023/05] [AAAI 2024] V2Meow: Meowing to the Visual Beat via Video-to-Music Generation [paper]
[2023/05] [ICML 2023] Long-Term Rhythmic Video Soundtracker [paper]
[2023/11] MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models [paper]
[2023/11] [Expert Systems with Applications 2024] Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model [paper]
[2023/11] [SIGGRAPH Asia 2023] Motion to Dance Music Generation using Latent Diffusion Model [paper]
[2023/12] [TMM 2023] Continuous Emotion-Based Image-to-Music Generation [paper]
[2023/12] CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling [paper]
[2024/01] [WACV 2024] Let the Beat Follow You - Creating Interactive Drum Sounds From Body Rhy [paper]
[2024/01] [SIGGRAPH Asia 2024] Dance-to-Music Generation with Encoder-based Textual Inversion [paper]
[2024/05] [CVPR 2024] Diff-BGM: A Diffusion Model for Video Background Music Generation [paper]
[2024/05] [NeurIPS 2024] M3GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation [paper]
[2024/05] [TMM 2024] DanceComposer: Dance-to-Music Generation Using a Progressive Conditional Music Generator [paper]
[2024/05] Mozart’s Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models [paper]
[2024/06] [CVPR 2024] MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [paper]
[2024/06] [CVPR 2025] VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling [paper]
[2024/07] [ICMEW 2024] Popular Hooks: A Multimodal Dataset of Musical Hooks for Music Understanding and Generation [paper]
[2024/07] [Array 2024] D2MNet for music generation joint driven by facial expressions and dance movements [paper]
[2024/07] MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [paper]
[2024/08] [The Visual Computer 2024] Video-driven musical composition using large language model with memory-augmented state spac [paper]
[2024/09] [WACV 2025] VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos [paper]
[2024/09] [EURASIP Journal on Audio, Speech, and Music Processing] Dance2Music-Diffusion: leveraging latent diffusion models for music generation from dance videos [paper]
[2024/10] MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization [paper]
[2024/10] [ICASSP 2025] SONIQUE: Video Background Music Generation Using Unpaired Audio-Visual Data [paper]
[2024/10] [TCSS 2024] Video Echoed in Harmony: Learning and Sampling Video-Integrated Chord Progression Sequences for Controllable Video Background Music [paper]
[2024/10] [TII 2025] Application and Research of Music Generation System Based on CVAE and Transformer-XL in Video Background Music [paper]
[2024/10] M2M-Gen: A Multimodal Framework for Automated Background Music Generation in Japanese Manga Using Large Language Models [paper]
[2024/10] UniMuMo: Unified Text, Music and Motion Generation [paper]
[2024/11] Harmonizing Pixels and Melodies: Maestro-Guided Film Score Generation and Composition Style Transfer [paper]
[2024/11] [NeurIPS 2024] MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence [paper]
[2024/12] VidMusician: Video-to-Music Generation with Semantic-Rhythmic Alignment via Hierarchical Visual Features [paper]
[2024/12] Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation [paper]
[2025/01] XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework [paper]
[2025/01] GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions [paper]
[2025/03] AudioX: Diffusion Transformer for Anything-to-Audio Generation [paper]
[2025/03] [CVPR 2025] FilmComposer: LLM-Driven Music Production for Silent Film Clips [paper]
[2025/03] [CVPR 2025] HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization [paper]
[2025/04] Extending Visual Dynamics for Video-to-Music Generation [paper]

✉️ Contacts

The repo is being updated actively🚀. Please let us know if you notice any mistakes or would like any work to be included in our list through GitHub pull requests or e-mail: [email protected].

📎 Citation

If you find our work valuable for your research or applications, we would greatly appreciate a star ⭐ and a citation using the BibTeX entry provided below.

@article{Wang2025VisionToMusic,
  title={Vision-to-Music Generation: A Survey},
  author={Wang, Zhaokai and Bao, Chenxi and Zhuo, Le and Han, Jingrui and Yue, Yang and Tang, Yihong and Huang, Victor Shea-Jay and Liao, Yue},
  journal={arXiv preprint arXiv:2503.21254},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎬 → 🎵 Awesome Vision-to-Music Generation

📖 Table of Contents

🧩 V2M Methods

🔻 General Videos and Images

🔻 Human Movement Videos

🎬 V2M Datasets

🔻 General Videos

🔻 Human Movement Videos

🔻 Images

📊 V2M Evaluation

🎯 Objective Metrics

🔻 Music-only

🔻 Vision-Music Correspondence

🎧 Subjective Metrics

🔻 Music-only

🔻 Vision-Music Correspondence

📚 Full List

✉️ Contacts

📎 Citation

About

Uh oh!

Contributors 3

Uh oh!

License

wzk1015/Awesome-Vision-to-Music-Generation

Folders and files

Latest commit

History

Repository files navigation

🎬 → 🎵 Awesome Vision-to-Music Generation

📖 Table of Contents

🧩 V2M Methods

🔻 General Videos and Images

🔻 Human Movement Videos

🎬 V2M Datasets

🔻 General Videos

🔻 Human Movement Videos

🔻 Images

📊 V2M Evaluation

🎯 Objective Metrics

🔻 Music-only

🔻 Vision-Music Correspondence

🎧 Subjective Metrics

🔻 Music-only

🔻 Vision-Music Correspondence

📚 Full List

✉️ Contacts

📎 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 3

Uh oh!