Skip to content

showlab/whisperVideo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

WhisperVideo Banner

🎬 WhisperVideo

Visually grounded speaker transcription for long videos

Track who speaks, and align speech to faces

🖼️ Demo

WhisperX WhisperVideo (Ours)
Text-only transcript
WhisperX text-only example
Visually grounded panel
video_with_panel_compressed.mp4

🎙️ Overview

WhisperVideo is a clean demo for long-form, multi-speaker videos. It links speech to on-screen speakers and keeps identities consistent. It is built for real conversations, not short clips.

  • 🔎 SAM3 video segmentation for robust face masks
  • 🗣️ Active speaker detection with TalkNet (audio-visual)
  • 🧠 Identity memory with visual embeddings and track clustering
  • 📝 Aligned subtitles with speaker IDs and panel overlays
  • 🎥 Panel visualization for compact review and demo videos

✨ Features

  • Visually grounded speaker attribution
  • Long-video friendly
  • Identity memory and clean speaker labels
  • Panel view and subtitles for review

🧩 Install and Run

1. Create / use environment

We recommend using the existing environment:

/home/siyuan/miniconda3/envs/whisperv/bin/python -V

2. Optional dependencies

If you need to (re)install packages, install the core stack:

pip install torch torchvision torchaudio
pip install whisperx pyannote.audio scenedetect opencv-python python_speech_features pysrt

TalkNet checkpoint auto-download uses gdown (included in whisperv/requirement.txt):

pip install gdown

3. Set HF token

Create a .env file at repo root:

HF_TOKEN=your_huggingface_token

🚀 Quick Start

/home/siyuan/miniconda3/envs/whisperv/bin/python whisperv/inference_folder_sam3.py \
  --videoFolder demos/your_video_folder \
  --renderPanel \
  --panelTheme twitter \
  --panelCompose subtitles \
  --subtitle

📦 Outputs

The main results are written under:

<videoFolder>/pyavi/video_with_panel.mp4
<videoFolder>/pywork/*.pckl

📌 Notes

  • The TalkNet checkpoint will auto-download if missing.
  • A HuggingFace token is required for diarization.
  • For best results, use a CUDA GPU.

🙏 Acknowledgements

🎓 Citations

If you find our work helpful, please kindly consider citing our paper. Thank you!

@misc{whispervideo,
    title={WhisperVideo},
    url={https://github.com/showlab/whisperVideo},
    author={Siyuan Hu*, Kevin Qinghong Lin*, Mike Zheng SHOU},
    publisher    = {Zenodo},
    version      = {v0.1.0},
    month={January},
    year={2026}
}

Releases

No releases published

Packages

No packages published