Skip to content

Real-Time Speech-to-Text Translation Support #58

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hu-ke opened this issue Jan 24, 2025 · 3 comments
Open

Real-Time Speech-to-Text Translation Support #58

hu-ke opened this issue Jan 24, 2025 · 3 comments

Comments

@hu-ke
Copy link

hu-ke commented Jan 24, 2025

Description of the feature request:

Instead of waiting for a turn of speech to complete (VAD mode), would it be possible to stream the generated results in real-time?

What problem are you trying to solve with this feature?

Suppose I am currently in a Japanese interview, but my Japanese skills are not very strong. I would like to build a app with the Gemini Multimodal API to assist me with real-time speech-to-text translation.

Any other information you'd like to share?

No response

@ViaAnthroposBenevolentia

Currently, the easiest (and free) way to do this:

  1. Get a free ($200 credit) API from Deepgram;
  2. Establish a web socket connection to wss://api.deepgram.com/v1/listen;
  3. Send the base64 audio from Gemini to the web socket;
  4. Get real-time transcript from Deepgram.

@OptionIA
Copy link

or for free, but the price is the latency, make a function calling with 2 parameters Input: output, and add in the model instruction that the model in all of their ressponses must use the function call. then, add a system that get it and print it

@OptionIA
Copy link

OptionIA commented Apr 22, 2025

well, they early added the native function, but there is no documentation aviable (or easy to found)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants