-
Notifications
You must be signed in to change notification settings - Fork 109
Description
When modality is set to AUDIO, and outputaudiotranscription is enabled, one will notice that the outputaudiotranscription is output faster than the audio. However, if the model is interrupted, only the spoken part is added to context, and the rest of the output audio transcription is not added to context. This makes sense for an audio-only use case; if the user hasn't heard something, it shouldn't be added to context.
However, with outputaudiotranscription enabled, users will also become aware of something faster than they listen to it. Then, if a user attempts to interrupt the audio output but has seen the complete text transcription, the model will not be aware of the unspoken parts and will repeat information to the user.
This does not occur when modality is set to TEXT.
It would be nice to add a feature that allows us to enable adding to context based on transcription output.