[Feature] Adding Input Audio Streaming Provider (S2T + VAD) #35

fvisticot · 2024-12-22T19:19:23Z

Is it possible to add a SpeechToText AudioStreaming provider as an input of the "Audio recording" button ?
It would be fine to plug different S2T audio provider and a Voice Activity Detection (VAD)

csells · 2024-12-22T19:25:37Z

May I ask what problem you're trying to solve?

fvisticot · 2024-12-22T19:53:40Z

User click on the record button or User Voice is detected to start the recording
Audio recording starts
Audio Streaming is sent to a local or distant SpeechToText Engine
- Local could be : https://github.com/huggingface/transformers.js-examples/tree/main/moonshine-web using WASM
- Distant could be any SpeechToText Engine
User clicks on Stop or "Silence" is detected
Current text (from stream) is ready to be sent as text for the LLM input prompt

csells · 2024-12-22T19:55:34Z

How is that different from what happens now?

fvisticot · 2024-12-22T20:02:56Z

If I'm correct, the current implementation starts audio recording when clicking on the record button and the audio is available as a binary file as an input attachement when clicking on "Send" button.
The attachement is send to the LLMProvider as an input (audio).

If I'm using another custom LLMProvider and that LLM does not take audio as an input, I need to "SpeechToText" the recording audio before sending the Text (from Audio) to the LLM.

csells · 2024-12-22T20:09:48Z

The behavior is as you are hoping for - the audio is translated and turned into text for the user to edit. It's not provided as an audio file when the user presses the Submit button. Give it a try. See what you think .

fvisticot · 2024-12-22T20:55:59Z

I gived a try using the EchoProvider and my own LLM provider implementation.
With the EchoProvider I get the following image. Text is not extracted from audio

With my own provider, the sendMessageStream method is fired with the following inputs:

Prompt: translate the attached audio to text; provide the result of that translation as just the text of the translation itself. be careful to separate the background audio from the foreground audio and only provide the result of translating the foreground audio.
attachments: 1 Audio file

I see this code from the _generateStream method:

final content = Content('user', [
      TextPart(prompt),
      ...attachments.map(_partFrom),
    ]);

I do not use the google_generative_ai package (because using my own provider). I assume that the S2T should be done with this package ?

csells · 2024-12-22T21:07:39Z

Is it sendMessageStream being called to translate the audio or generateStream? It should be the latter.

fvisticot · 2024-12-22T21:11:55Z

generateStream.

Did you confirm I need to use my own audio translation library/code in the generateStream method if I'M NOT USING the google_generative_ai package (this package is doing the audio translation ?) ?

If I'm correct the Text2Audio is managed from the ChatInput Widget and the translation starts when the Stop button is clicked.
It means we could reduce the latency if the translation is started in real time during audio streaming.

Future<void> onRecordingStopped() async {
    final file = _waveController.file;

    if (file == null) {
      AdaptiveSnackBar.show(context, 'Unable to record audio');
      return;
    }

    // will come back as initialMessage
    widget.onTranslateStt(file);
  }

csells · 2024-12-22T22:08:17Z

The way the chat works is that it uses the provider's generateStream implementation to translate audio. So far I haven't noticed a large latency that requires streaming the audio.

fvisticot · 2024-12-22T23:22:36Z

Tx a lot for the details.
It means we should use the same model for Audio2Text and running the Inference on the question ?
Do you think it could be possible to decouple the 2 actions ?

csells · 2024-12-23T00:25:34Z

You can do that today with your own custom provider that forwards to your model of choice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Adding Input Audio Streaming Provider (S2T + VAD) #35

[Feature] Adding Input Audio Streaming Provider (S2T + VAD) #35

fvisticot commented Dec 22, 2024

csells commented Dec 22, 2024

fvisticot commented Dec 22, 2024

csells commented Dec 22, 2024

fvisticot commented Dec 22, 2024

csells commented Dec 22, 2024

fvisticot commented Dec 22, 2024 •

edited

Loading

csells commented Dec 22, 2024

fvisticot commented Dec 22, 2024 •

edited

Loading

csells commented Dec 22, 2024

fvisticot commented Dec 22, 2024

csells commented Dec 23, 2024

[Feature] Adding Input Audio Streaming Provider (S2T + VAD) #35

[Feature] Adding Input Audio Streaming Provider (S2T + VAD) #35

Comments

fvisticot commented Dec 22, 2024

csells commented Dec 22, 2024

fvisticot commented Dec 22, 2024

csells commented Dec 22, 2024

fvisticot commented Dec 22, 2024

csells commented Dec 22, 2024

fvisticot commented Dec 22, 2024 • edited Loading

csells commented Dec 22, 2024

fvisticot commented Dec 22, 2024 • edited Loading

csells commented Dec 22, 2024

fvisticot commented Dec 22, 2024

csells commented Dec 23, 2024

fvisticot commented Dec 22, 2024 •

edited

Loading

fvisticot commented Dec 22, 2024 •

edited

Loading