Replies: 1 comment
-
Here is my PR |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Description:
I was experimenting with the Whisper model to transcribe phone calls where each speaker is on their own channel. One of the speakers talks significantly less than the other, which led to some challenges.
I divided the audio file into two separate tracks, upscaled them to 16kHz, and noticed that the
compression_ratio_threshold
parameter in thetranscribe()
function doesn't work as effectively as it could. While it technically functions as intended, I believe it can be improved to better handle hallucinations.Here's the code I used:
When I print the segments one by one:
I get some hallucinations with unusually high compression ratio values:
From my observations, segments with a
compression_ratio
below 2 are real speech, while anything above approximately 2 to 2.2 is likely a hallucination.Issue with Current Implementation:
Upon reviewing the
transcribe()
function, I found that when thecompression_ratio_threshold
is exceeded, the function setsneeds_fallback = True
and retries decoding with the next temperature value. However, if all temperatures result in a high compression_ratio, the function ultimately accepts the last result—even if it's clearly a hallucination.Here's the relevant portion of the code:
Proposed Enhancement:
I suggest adding a new parameter,
compression_ratio_hallucination_threshold
, to thetranscribe()
function. This parameter would set a hard limit on the compression_ratio. If, after trying all temperatures, the compression_ratio still exceeds this threshold, the segment would be discarded as a hallucination.Here's how the modified code could look:
Implementing this enhancement would make the
transcribe()
function more robust in handling segments that are likely hallucinations, which is the biggest whisper problem, due to high compression_ratio. It provides a logical extension to the existing threshold parameters and improves transcription quality, especially in scenarios where one speaker is significantly less active.Beta Was this translation helpful? Give feedback.
All reactions