Enhancement Proposal: Add compression_ratio_hallucination_threshold to Discard High Compression Ratio Segments in transcribe() #2420

Ko4ka · 2024-11-01T17:10:51Z

Ko4ka
Nov 1, 2024

Description:

I was experimenting with the Whisper model to transcribe phone calls where each speaker is on their own channel. One of the speakers talks significantly less than the other, which led to some challenges.

I divided the audio file into two separate tracks, upscaled them to 16kHz, and noticed that the compression_ratio_threshold parameter in the transcribe() function doesn't work as effectively as it could. While it technically functions as intended, I believe it can be improved to better handle hallucinations.

Here's the code I used:

result_speaker_0 = model.transcribe(
    channel_0_path,
    language="ru",
    initial_prompt='Звонок в компанию, это колл центр застройщика, разговор ведет сотрудник Ольга',
    temperature=(0.0, 0.1),
    logprob_threshold=-0.6,
    no_speech_threshold=0.0,
    compression_ratio_threshold=2.1,
    condition_on_previous_text=True,
    word_timestamps=True,
    hallucination_silence_threshold=1
)

When I print the segments one by one:

for segment in result_speaker_0["segments"]:
    print(f"{segment['start']}s - {segment['end']}s: {segment['text']} {segment['compression_ratio']}")

I get some hallucinations with unusually high compression ratio values:

379.44s - 379.94s: Двадцать минут. 14.170212765957446
379.94s - 379.96s: Двадцать минут. 14.170212765957446
379.96s - 379.96s: 14.170212765957446
379.96s - 379.96s: 14.170212765957446
379.96s - 380.04s: Двадцать минут. 14.170212765957446
398.18s - 399.02s: Двадцать минут. 11.577777777777778
399.42s - 399.42s: 11.577777777777778
403.88s - 404.92s: Двадцать минут. 11.577777777777778
404.92s - 406.26s: Двадцать минут. 11.577777777777778
406.26s - 406.42s: Двадцать минут. 11.577777777777778
406.42s - 406.9s: Двадцать минут. 11.577777777777778
406.9s - 407.64s: Двадцать минут. 11.577777777777778

From my observations, segments with a compression_ratio below 2 are real speech, while anything above approximately 2 to 2.2 is likely a hallucination.

Issue with Current Implementation:

Upon reviewing the transcribe() function, I found that when the compression_ratio_threshold is exceeded, the function sets needs_fallback = True and retries decoding with the next temperature value. However, if all temperatures result in a high compression_ratio, the function ultimately accepts the last result—even if it's clearly a hallucination.

Here's the relevant portion of the code:

if (
    compression_ratio_threshold is not None
    and decode_result.compression_ratio > compression_ratio_threshold
):
    needs_fallback = True  # Too repetitive
if (
    logprob_threshold is not None
    and decode_result.avg_logprob < logprob_threshold
):
    needs_fallback = True  # Average log probability is too low
if (
    no_speech_threshold is not None
    and decode_result.no_speech_prob > no_speech_threshold
):
    needs_fallback = False  # Silence
if not needs_fallback:
    break

Proposed Enhancement:

I suggest adding a new parameter, compression_ratio_hallucination_threshold, to the transcribe() function. This parameter would set a hard limit on the compression_ratio. If, after trying all temperatures, the compression_ratio still exceeds this threshold, the segment would be discarded as a hallucination.

Here's how the modified code could look:

if (
    compression_ratio_threshold is not None
    and decode_result.compression_ratio > compression_ratio_threshold
):
    needs_fallback = True  # Too repetitive
if (
    logprob_threshold is not None
    and decode_result.avg_logprob < logprob_threshold
):
    needs_fallback = True  # Average log probability is too low
if (
    no_speech_threshold is not None
    and decode_result.no_speech_prob > no_speech_threshold
):
    needs_fallback = False  # Silence
if (
    compression_ratio_hallucination_threshold is not None
    and decode_result.compression_ratio > compression_ratio_hallucination_threshold
    and t == temperatures[-1]
):
    # Discard the segment
    return None # Skip to the next segment
if not needs_fallback:
    break

Implementing this enhancement would make the transcribe() function more robust in handling segments that are likely hallucinations, which is the biggest whisper problem, due to high compression_ratio. It provides a logical extension to the existing threshold parameters and improves transcription quality, especially in scenarios where one speaker is significantly less active.

Ko4ka · 2024-11-02T08:40:22Z

Ko4ka
Nov 2, 2024
Author

Here is my PR
#2421

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement Proposal: Add compression_ratio_hallucination_threshold to Discard High Compression Ratio Segments in transcribe() #2420

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Enhancement Proposal: Add compression_ratio_hallucination_threshold to Discard High Compression Ratio Segments in transcribe() #2420

Ko4ka Nov 1, 2024

Replies: 1 comment

Ko4ka Nov 2, 2024 Author

Ko4ka
Nov 1, 2024

Ko4ka
Nov 2, 2024
Author