You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It appears its possible to also recieve audio to the bot. With this, its possible to create a voice mode for Gemini models back and forth
The outline of this implementation would be
Using TTS and STT engines, preferably super fast and cost effective as possible if using clouds, best if natural sounding, with minimal latency as possible
Using wavelink as a voice engine by streaming the TTS output.. in separate Cog
Handle multiple requests if possible per server
The flow would be
Initiate possibly through slash command like /call or something and lock the session to specific user when they initiated the command
Record the voice conversation with timeout
On callback function
The recorded voice is then sent to the Speech-to-text engine such as Whisper, either in OpenAI API (paid, faster), Azure Speech services (free in most cases, requires Azure dependencies) or Huggingface spaces (free, slow).... OR USE Gemini's native multimodality
The transcription is now then used as a prompt to reason and engage (either with GPT or Gemini, with different system prompt optimized for speech
Performs checks, if there is an error occured due to model, still proceed... But will speak the error, if there's an error with Speech APIs, abort and ping the user.
Then the output is sent through dedicated TTS program and record
When no errors occured, stream it
Unlock and the command is now ready to be used by anyone
Possible limitation and outcomes:
Possibility of blocking and highest latency if not using Asynchronous tools
This command may be limited to one person at a time as a whole and not per user neither per guild, something that is being prototype how the flow works, for now, until this is being tested
Prone to errors
Chat history/Context handling, this would also require redundant code from /ask command
Multimodality, though parameter maybe added in slash command, just would need to copy code from ask command but with more lines of code
It cannot be initiated through voice, has to be invoked manually via slash command... Defeats the purpose of voice mode, but this should be considered as a basis for building block such implementation
Can be resolved; yes approx 80% success rate
Goals:
#11
#12
#13
The text was updated successfully, but these errors were encountered:
Looking at
https://guide.pycord.dev/voice/receiving
It appears its possible to also recieve audio to the bot. With this, its possible to create a voice mode for Gemini models back and forth
The outline of this implementation would be
The flow would be
/call
or something and lock the session to specific user when they initiated the commandOn callback function
Possible limitation and outcomes:
/ask
commandCan be resolved; yes approx 80% success rate
Goals:
The text was updated successfully, but these errors were encountered: