🎬 WhisperVideo

Visually grounded speaker transcription for long videos

Track who speaks, and align speech to faces

🖼️ Demo

WhisperX	WhisperVideo (Ours)
Text-only transcript	Visually grounded panel video_with_panel_compressed.mp4

🎙️ Overview

WhisperVideo is a clean demo for long-form, multi-speaker videos. It links speech to on-screen speakers and keeps identities consistent. It is built for real conversations, not short clips.

🔎 SAM3 video segmentation for robust face masks
🗣️ Active speaker detection with TalkNet (audio-visual)
🧠 Identity memory with visual embeddings and track clustering
📝 Aligned subtitles with speaker IDs and panel overlays
🎥 Panel visualization for compact review and demo videos

✨ Features

Visually grounded speaker attribution
Long-video friendly
Identity memory and clean speaker labels
Panel view and subtitles for review

🧩 Install and Run

1. Create / use environment

We recommend using the existing environment:

/home/siyuan/miniconda3/envs/whisperv/bin/python -V

2. Optional dependencies

If you need to (re)install packages, install the core stack:

pip install torch torchvision torchaudio
pip install whisperx pyannote.audio scenedetect opencv-python python_speech_features pysrt

TalkNet checkpoint auto-download uses gdown (included in whisperv/requirement.txt):

pip install gdown

3. Set HF token

Create a .env file at repo root:

HF_TOKEN=your_huggingface_token

🚀 Quick Start

/home/siyuan/miniconda3/envs/whisperv/bin/python whisperv/inference_folder_sam3.py \
  --videoFolder demos/your_video_folder \
  --renderPanel \
  --panelTheme twitter \
  --panelCompose subtitles \
  --subtitle

📦 Outputs

The main results are written under:

<videoFolder>/pyavi/video_with_panel.mp4
<videoFolder>/pywork/*.pckl

📌 Notes

The TalkNet checkpoint will auto-download if missing.
A HuggingFace token is required for diarization.
For best results, use a CUDA GPU.

🙏 Acknowledgements

SAM3, WhisperX, TalkNet, and Pyannote: SAM 3, WhisperX, TalkNet, Pyannote
Open-source video processing tools: FFmpeg, SceneDetect

🎓 Citations

If you find our work helpful, please kindly consider citing our paper. Thank you!

@misc{whispervideo,
    title={WhisperVideo},
    url={https://github.com/showlab/whisperVideo},
    author={Siyuan Hu*, Kevin Qinghong Lin*, Mike Zheng SHOU},
    publisher    = {Zenodo},
    version      = {v0.1.0},
    month={January},
    year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
whisperv		whisperv
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎬 WhisperVideo

🖼️ Demo

🎙️ Overview

✨ Features

🧩 Install and Run

1. Create / use environment

2. Optional dependencies

3. Set HF token

🚀 Quick Start

📦 Outputs

📌 Notes

🙏 Acknowledgements

🎓 Citations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

showlab/whisperVideo

Folders and files

Latest commit

History

Repository files navigation

🎬 WhisperVideo

🖼️ Demo

🎙️ Overview

✨ Features

🧩 Install and Run

1. Create / use environment

2. Optional dependencies

3. Set HF token

🚀 Quick Start

📦 Outputs

📌 Notes

🙏 Acknowledgements

🎓 Citations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages