This repository contains a Python script called ghe_transcribe
that transcribes audio files into text using Faster Whisper (a fast reimplementation of OpenAI's Whisper model) and Pyannote (for speaker diarization). This tool is especially useful for transcribing long audio recordings, improving transcription accuracy, and separating the audio into individual speakers.
Open https://jupyter.euler.hpc.ethz.ch/ and login with your @ethz.ch account. We can load the modules we need by running
module load stack/2024-06 python/3.11.6
python3.11 -m venv venv3.11_ghe_transcribe --system-site-packages
source venv3.11_ghe_transcribe/bin/activate
pip3.11 install faster-whisper pyannote.audio ffmpeg-python huggingface-hub
ipython kernel install --user --name=venv3.11_ghe_transcribe
To have all new JupyterHub instanced with the venv3.11_ghe_transcribe
Python environment,
nano .config/euler/jupyterhub/jupyterlabrc
and write:
module load stack/2024-06 python/3.11.6
source venv3.11_ghe_transcribe/bin/activate
brew install ffmpeg cmake python3.11
python3.11 -m venv venv3.11_ghe_transcribe --system-site-packages
source venv3.11_ghe_transcribe/bin/activate
pip3.11 install faster-whisper pyannote.audio ffmpeg-python huggingface-hub
ipython kernel install --user --name=venv3.11_ghe_transcribe
Let's say you have an audio file called testing_audio_01.mp3
, in the media
folder, that you want to transcribe into a .csv
and .md
file.
- setup
config.json
file
{
"HF_TOKEN": "hf_*********************"
}
- run the following command:
python ghe_transcribe.py media/testing_audio_01.mp3
Options for ghe_transcribe
:
ghe_transcribe(audio_file,
device='cpu'|'cuda'|'mps',
whisper_model='small.en'|'base.en'|'medium.en'|'small'|'base'|'medium'|'large'|'turbo',
pyannote_model='pyannote/[email protected]'|'pyannote/speaker-diarization-3.1',
save_output=True|False,
semicolon=True|False,
info=True|False
)
audio_file
: The path to the audio file you want to transcribe. Accepted formats are .mp3, .wav.device
(optional): The device on which to run the model (cpu
|cuda
|mps
). By default, the device is automatically detected based on whether CUDA or MPS is available.whisper_model
(optional): The size of the Faster Whisper model to use for transcription. Available options includesmall.en
,base.en
,medium.en
,small
,base
,medium
,large
,turbo
. By default, the English modelmedium.en
is used.pyannote_model
(optional): The Pyannote model, defaults topyannote/speaker-diarization-3.1
.save_output
(optional): Default isTrue
. It will create bothoutput.csv
andoutput.md
. Ifoutput = None
, the transcription will only be returned as a list of strings.semicolon
(optional): Specify whether to use semicolons or commas as the column separator in the CSV output. The default is commas.info
(optional): If you want the transcription tool to print additional information about the detected language and its probability.
Timing tests are run by using the timing function as defined in utils.py
, and the audio file media/testing_audio_01.mp3
Device | Time (sec) |
---|---|
Euler Cluster (16 CPU cores, 16GB RAM) - cpu |
67.4988 |
Euler Cluster (32 CPU cores, 16GB RAM) - cpu |
44.3622 |
MacOS (Apple M2, 16GB RAM) - mps |
41.2122 |
MacOS (Apple M2, 16GB RAM) - cpu |
64.7549 |
Why Whisper? See Whisper, wav2vec2 and Kaldi.
faster-whisper
by Guillaume Klein, builds on OpenAI's open source transcription modelWhisper
.
Why Pyannote? See Pyannote vs NeMo.
pyannote.audio
by Hervé Bredin, open source diarization model of pyannoteAI, gated by HuggingFace access token https://hf.co/settings/tokens.NeMo
by Nvidia, open source diarization model.
WhisperX
←faster-whisper
+pyannote.audio
whisper-diarization
←faster-whisper
+NeMo
insanely-fast-whisper
←insanely-faster-whisper
+pyannote.audio
wscribe-editor
, works with wordlevel timestamps in a .json formatted like so sample.json.QualCoder
, a qualitative data analysis application written in Python.
noScribe
←faster-whisper
+pyannote.audio
TranscriboZH
←WhisperX