Visually grounded speaker transcription for long videos
Track who speaks, and align speech to faces
| WhisperX | WhisperVideo (Ours) |
|---|---|
Text-only transcript
|
Visually grounded panelvideo_with_panel_compressed.mp4 |
WhisperVideo is a clean demo for long-form, multi-speaker videos. It links speech to on-screen speakers and keeps identities consistent. It is built for real conversations, not short clips.
- 🔎 SAM3 video segmentation for robust face masks
- 🗣️ Active speaker detection with TalkNet (audio-visual)
- 🧠 Identity memory with visual embeddings and track clustering
- 📝 Aligned subtitles with speaker IDs and panel overlays
- 🎥 Panel visualization for compact review and demo videos
- Visually grounded speaker attribution
- Long-video friendly
- Identity memory and clean speaker labels
- Panel view and subtitles for review
We recommend using the existing environment:
/home/siyuan/miniconda3/envs/whisperv/bin/python -VIf you need to (re)install packages, install the core stack:
pip install torch torchvision torchaudio
pip install whisperx pyannote.audio scenedetect opencv-python python_speech_features pysrtTalkNet checkpoint auto-download uses gdown (included in whisperv/requirement.txt):
pip install gdownCreate a .env file at repo root:
HF_TOKEN=your_huggingface_token/home/siyuan/miniconda3/envs/whisperv/bin/python whisperv/inference_folder_sam3.py \
--videoFolder demos/your_video_folder \
--renderPanel \
--panelTheme twitter \
--panelCompose subtitles \
--subtitleThe main results are written under:
<videoFolder>/pyavi/video_with_panel.mp4
<videoFolder>/pywork/*.pckl
- The TalkNet checkpoint will auto-download if missing.
- A HuggingFace token is required for diarization.
- For best results, use a CUDA GPU.
- SAM3, WhisperX, TalkNet, and Pyannote: SAM 3, WhisperX, TalkNet, Pyannote
- Open-source video processing tools: FFmpeg, SceneDetect
If you find our work helpful, please kindly consider citing our paper. Thank you!
@misc{whispervideo,
title={WhisperVideo},
url={https://github.com/showlab/whisperVideo},
author={Siyuan Hu*, Kevin Qinghong Lin*, Mike Zheng SHOU},
publisher = {Zenodo},
version = {v0.1.0},
month={January},
year={2026}
}

