Skip to content

video-SALMONN 2 is a powerful audio-visual large language model (LLM) that generates high-quality audio-visual video captions, which is developed by the Department of Electronic Engineering at Tsinghua University and ByteDance.

License

Notifications You must be signed in to change notification settings

bytedance/video-SALMONN-2

Repository files navigation

video-SALMONN 2: Caption-Enhanced Audio-Visual Large Language Models

🚀🚀 Welcome to the repo of video-SALMONN 2!

video-SALMONN 2 is a powerful audio-visual large language model (LLM) that generates high-quality audio-visual video captions, which is developed by the Department of Electronic Engineering at Tsinghua University and ByteDance.

🔥 News

  • 2025-09-26: A new version (Version-2509) of video-SALMONN 2+ is released, containing minor code revision, an update for 7B model and 72B model, as well as the addition of 3B model. The upgraded video-SALMONN 2+ further enhances audio-visual and visual-only understanding capability on various benchmarks.
  • 2025-07-17: We release the code and checkpoint of video-SALMONN 2+ at video-SALMONN 2+ (Version-2507). video-SALMONN 2+ achieves SOTA results on Video-MME benchmark.
  • 2025-07-08: We release the 7B version of video-SALMONN 2.
  • 2025-06-18: We release the code of video-SALMONN 2.

⚡️ Results

We evaluate the models on audio-visual QA benchmarks including Video-MME, WorldSense, AVUT, Video-Holmes, and DailyOmni, and visual-only benchmarks including MLVU and LVBench. Our 3B and 7B models achieve SOTA results at comparable scales, while the 72B model surpasses all other open-source systems.

image

🌈 How to Use

How to train video-SALMONN 2

  1. Prepare the dataset following scripts/example_sft.json and scripts/example_dpo.json.
  2. Download LLaVA-OneVision Model from huggingface.
  3. Modify the parameters in scripts/train_sft.sh and scripts/train_dpo.sh.
  4. Run bash scripts/train_sft.sh or bash scripts/train_dpo.sh.

How to evaluate a checkpoint

  1. Prepare the dataset following scripts/example_sft.json.
  2. Modify the parameters in scripts/eval.sh.
  3. Run bash scripts/eval.sh.

For video-SALMONN 2+, please refer to video_SALMONN2_plus

👀 Team

Team Tsinghua: Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Chao Zhang

Team ByteDance: Wei Li, Zejun Ma

✨ Citation

If you find video-SALMONN 2 useful, please cite the paper:

@article{tang2025video,
    title={{video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models}}, 
    author={Changli Tang and Yixuan Li and Yudong Yang and Jimin Zhuang and Guangzhi Sun and Wei Li and Zejun Ma and Chao Zhang},
    journal={arXiv preprint arXiv:2506.15220},
    year={2025},
}

About

video-SALMONN 2 is a powerful audio-visual large language model (LLM) that generates high-quality audio-visual video captions, which is developed by the Department of Electronic Engineering at Tsinghua University and ByteDance.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published