-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Some models may have a duration of audio above which the RTF starts to really degrade e.g. greater than a context window length. If our current speech segment starts to approach these lengths we may want to break off a bunch of the audio from the start of the segment and remove it from the buffer (a forced non-interim result).
In this case we want all the VAD parameters to be the same but the ability to identify a less-permissive break point within a silence so we can slightly shorten the segment without setting our vad to cut more eagerly.
I'm still puzzling out how to do this, but maybe we want to track the midpoint of the longest sequence of silent frames in the audio stream - and then maybe how long that silence is as well? It's a bit of added state, and then along with that we'd want a way to specify that we're evacuating some of the prior buffer and not have any speech-end stuff or vad state reset in a weird way.
If anyone has any other thoughts I'm all ears 👂 👁️ 👁️ 👂