Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Adding Input Audio Streaming Provider (S2T + VAD) #35

Open
fvisticot opened this issue Dec 22, 2024 · 11 comments
Open

[Feature] Adding Input Audio Streaming Provider (S2T + VAD) #35

fvisticot opened this issue Dec 22, 2024 · 11 comments

Comments

@fvisticot
Copy link

Is it possible to add a SpeechToText AudioStreaming provider as an input of the "Audio recording" button ?
It would be fine to plug different S2T audio provider and a Voice Activity Detection (VAD)

@csells
Copy link
Contributor

csells commented Dec 22, 2024

May I ask what problem you're trying to solve?

@fvisticot
Copy link
Author

  • User click on the record button or User Voice is detected to start the recording
  • Audio recording starts
  • Audio Streaming is sent to a local or distant SpeechToText Engine
  • User clicks on Stop or "Silence" is detected
  • Current text (from stream) is ready to be sent as text for the LLM input prompt

@csells
Copy link
Contributor

csells commented Dec 22, 2024

How is that different from what happens now?

@fvisticot
Copy link
Author

If I'm correct, the current implementation starts audio recording when clicking on the record button and the audio is available as a binary file as an input attachement when clicking on "Send" button.
The attachement is send to the LLMProvider as an input (audio).

If I'm using another custom LLMProvider and that LLM does not take audio as an input, I need to "SpeechToText" the recording audio before sending the Text (from Audio) to the LLM.

@csells
Copy link
Contributor

csells commented Dec 22, 2024

The behavior is as you are hoping for - the audio is translated and turned into text for the user to edit. It's not provided as an audio file when the user presses the Submit button. Give it a try. See what you think .

@fvisticot
Copy link
Author

fvisticot commented Dec 22, 2024

I gived a try using the EchoProvider and my own LLM provider implementation.
With the EchoProvider I get the following image. Text is not extracted from audio

image

With my own provider, the sendMessageStream method is fired with the following inputs:

  • Prompt: translate the attached audio to text; provide the result of that translation as just the text of the translation itself. be careful to separate the background audio from the foreground audio and only provide the result of translating the foreground audio.
  • attachments: 1 Audio file

I see this code from the _generateStream method:

final content = Content('user', [
      TextPart(prompt),
      ...attachments.map(_partFrom),
    ]);

I do not use the google_generative_ai package (because using my own provider). I assume that the S2T should be done with this package ?

@csells
Copy link
Contributor

csells commented Dec 22, 2024

Is it sendMessageStream being called to translate the audio or generateStream? It should be the latter.

@fvisticot
Copy link
Author

fvisticot commented Dec 22, 2024

generateStream.

Did you confirm I need to use my own audio translation library/code in the generateStream method if I'M NOT USING the google_generative_ai package (this package is doing the audio translation ?) ?

If I'm correct the Text2Audio is managed from the ChatInput Widget and the translation starts when the Stop button is clicked.
It means we could reduce the latency if the translation is started in real time during audio streaming.

Future<void> onRecordingStopped() async {
    final file = _waveController.file;

    if (file == null) {
      AdaptiveSnackBar.show(context, 'Unable to record audio');
      return;
    }

    // will come back as initialMessage
    widget.onTranslateStt(file);
  }

@csells
Copy link
Contributor

csells commented Dec 22, 2024

The way the chat works is that it uses the provider's generateStream implementation to translate audio. So far I haven't noticed a large latency that requires streaming the audio.

@fvisticot
Copy link
Author

Tx a lot for the details.
It means we should use the same model for Audio2Text and running the Inference on the question ?
Do you think it could be possible to decouple the 2 actions ?

@csells
Copy link
Contributor

csells commented Dec 23, 2024

You can do that today with your own custom provider that forwards to your model of choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants