-
Notifications
You must be signed in to change notification settings - Fork 29.7k
Description
Model description
We (BUT Speech@FIT) have recently developed DiCoW (Diarization-Conditioned Whisper), a target-speaker ASR model that enhances OpenAI’s Whisper by integrating speaker diarization for multi-talker, speaker-attributed ASR.
Unlike previous approaches, DiCoW directly conditions on diarization outputs and achieves state-of-the-art performance on multi-talker benchmarks such as AMI and Libri2Mix. The model recently secured second place in the Challenge and Workshop on Multilingual Conversational Speech Language Model (MLC-SLM) and received a jury award at the CHIME-8 challenge.
DiCoW employs Frame-Level Diarization-Dependent Transformations (FDDT), applying frame-wise projections of different embeddings—Silence, Target speaker, Non-target speaker, and Overlap with target—based on diarization outputs.
Designed for long-form, multi-speaker transcription tasks, DiCoW excels in scenarios such as meetings, interviews, and spontaneous conversations. It also performs well for single-speaker ASR, achieving Word Error Rates (WER) of 2.1 on LibriSpeech test-clean, 4.3 on test-other, 5.3 on TED-LIUM, and 11.2 on VoxPopuli.
The model is based on Whisper and the v3.2 version is already integrated with the Hugging Face Transformers AutoClasses.
Open source status
- The model implementation is available
- The model weights are available
Provide useful links for the implementation
Source Repositories
Related Publications
-
DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition Computer Speech & Language, 2025
-
Target Speaker ASR with Whisper IEEE ICASSP 2025
-
BUT/JHU System Description for CHiME-8 NOTSOFAR-1 Challenge CHiME 2024 Proceedings
-
BUT System for the MLC-SLM arXiv:2506.13414