-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Description
I'm trying to translate some content from Japanese and German. Using large-v3. I often run into issues where the file gets caught in a loop while translating if I don't use VAD. It looks something like this.
Command without VAD:
./build/bin/whisper-cli -m models/ggml-large-v3.bin tr -osrt -of samples/p-test input.wav
[00:00:00.000 --> 00:00:29.980] Thank you.
[00:00:30.000 --> 00:00:59.980] Thank you.
[00:01:00.000 --> 00:01:29.980] Thank you.
[00:01:30.000 --> 00:01:59.980] Thank you.
[00:02:00.000 --> 00:02:29.980] Thank you.
[00:02:30.000 --> 00:02:59.980] Thank you.
[00:03:00.000 --> 00:03:29.980] Thank you.
[00:03:30.000 --> 00:03:59.980] Thank you.
[00:04:00.000 --> 00:04:29.980] Thank you.
[00:04:30.000 --> 00:04:59.980] Thank you.
[00:05:00.000 --> 00:05:29.980] Thank you.
[00:05:30.000 --> 00:05:59.980] Thank you.
[00:06:00.000 --> 00:06:26.000] Thank you.
[00:06:26.000 --> 00:06:29.980] Thank you.
[00:06:30.000 --> 00:06:59.980] Thank you.
Command with VAD:
./build/bin/whisper-cli -m models/ggml-large-v3.bin --vad --vad-model models/silero-v5.1.2-ggml.bin -tr -osrt -of samples/p-test input.wav
Without VAD the dialogue tends to be translated a bit more accurately when it is not looping (at least I assume so). It also tends to be more complete. Changing the threshold, padding and silence duration does not seem to make a difference and at too high a threshold I get the translation loop again.
I'm not sure what needs to be addressed here or if it's just a current issue with translation. It seems whisperX has a larger VAD API but I can't currently find it for comparison.