Description of the feature request:
Instead of waiting for a turn of speech to complete (VAD mode), would it be possible to stream the generated results in real-time?
What problem are you trying to solve with this feature?
Suppose I am currently in a Japanese interview, but my Japanese skills are not very strong. I would like to build a app with the Gemini Multimodal API to assist me with real-time speech-to-text translation.
Any other information you'd like to share?
No response