This extension adds the capability to extract and save standalone transcriptions from the "jonatasgrosman/wav2vec2-large-xlsr-53-japanese" model when used with the whisperX_research repository.
The original whisperX_research repository enhances WhisperX ASR results using the wav2vec2 Japanese model, but it doesn't provide a way to access or save the independent wav2vec2 model transcriptions. This extension solves that problem.
- Extract and save standalone transcriptions from the wav2vec2 Japanese model
- Process long-form audio with automatic chunking
- Save transcriptions in multiple formats (TXT, SRT, JSON)
- Compare wav2vec2 and WhisperX transcription results
- Command-line interface for easy usage
-
Clone the whisperX_research repository:
git clone https://github.com/dgoryeo/whisperX_research cd whisperX_research
-
Copy the extension files to the repository:
# Copy the wav2vec2_transcriber.py file cp /path/to/wav2vec2_transcriber.py . # Copy the integration script cp /path/to/process_audio.py .
-
Install additional dependencies if needed:
pip install transformers librosa
python process_audio.py path/to/your/japanese_audio.mp3 --output-dir transcriptions
This will:
- Process the audio with both the wav2vec2 model and WhisperX
- Save transcriptions in TXT, SRT, and JSON formats
- Compare the results and save a comparison file
python process_audio.py path/to/your/japanese_audio.mp3 --wav2vec2-only
python process_audio.py path/to/your/japanese_audio.mp3 --whisperx-only
python process_audio.py path/to/your/japanese_audio.mp3 --formats txt,json
python process_audio.py path/to/your/japanese_audio.mp3 --device cpu
usage: process_audio.py [-h] [--output-dir OUTPUT_DIR] [--formats FORMATS]
[--device DEVICE] [--wav2vec2-only] [--whisperx-only]
[--no-compare] [--language LANGUAGE]
[--model-name MODEL_NAME] [--batch-size BATCH_SIZE]
audio_path
Process Japanese audio with Wav2Vec2 and WhisperX
positional arguments:
audio_path Path to the audio file
optional arguments:
-h, --help show this help message and exit
--output-dir OUTPUT_DIR
Directory to save transcription outputs
--formats FORMATS Output formats (comma-separated)
--device DEVICE Device to run models on (cuda or cpu)
--wav2vec2-only Only run Wav2Vec2 transcription
--whisperx-only Only run WhisperX transcription
--no-compare Skip comparison of results
--language LANGUAGE Language code for WhisperX
--model-name MODEL_NAME
Whisper model name
--batch-size BATCH_SIZE
Batch size for WhisperX
The integration script is designed to work with the existing whisperX_research repository structure. It imports the transcription function from the original repository and integrates it with the new wav2vec2 transcription capabilities.
You may need to adjust the import path in process_audio.py
based on the actual structure of the whisperX_research repository:
# Update this line based on the actual import path
from whisperx_research.transcribe import transcribe_with_whisperx
For an input file japanese_speech.mp3
, the script will generate:
transcriptions/
├── japanese_speech.wav2vec2.txt # Plain text transcription from wav2vec2
├── japanese_speech.wav2vec2.srt # SRT subtitles from wav2vec2
├── japanese_speech.wav2vec2.json # Detailed JSON from wav2vec2
├── japanese_speech.whisperx.txt # Plain text from WhisperX
├── japanese_speech.whisperx.json # Detailed JSON from WhisperX
├── japanese_speech.comparison.txt # Comparison between both transcriptions
└── japanese_speech.summary.json # Summary of processing
-
The
Wav2Vec2Transcriber
class handles processing audio with the wav2vec2 model:- Loads the Japanese wav2vec2 model
- Processes audio in chunks to handle long-form content
- Extracts text transcriptions with timestamps
- Saves the results in multiple formats
-
The integration script:
- Provides a command-line interface
- Handles both wav2vec2 and WhisperX processing
- Saves results from both models
- Optionally compares the transcription results
The chunking parameters in the transcribe_audio
method can be adjusted for different audio lengths:
# For very long audio files, you might want to increase the chunk size
result = transcriber.transcribe_audio(
audio_path,
chunk_size_seconds=60.0,
overlap_seconds=10.0
)
You can also use the transcriber programmatically in your Python code:
from wav2vec2_transcriber import Wav2Vec2Transcriber
# Initialize the transcriber
transcriber = Wav2Vec2Transcriber()
# Transcribe an audio file
result = transcriber.transcribe_audio("your_audio.mp3")
# Print the transcription
print(result.text)
# Save to files
transcriber.save_transcription(result, "output", "your_audio")
- The wav2vec2 model works best with clear, well-recorded Japanese audio
- Processing speed depends on your hardware and the length of the audio
- For very long audio files, consider adjusting the chunk size parameters