-
Notifications
You must be signed in to change notification settings - Fork 582
feat(speech-to-speech) #463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
1 Skipped Deployment
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
Introduces comprehensive speech-to-speech capabilities across the chat interface with streaming TTS and real-time voice interaction support.
- Implements secure server-side text-to-speech streaming in
apps/sim/app/api/proxy/tts/stream/route.ts
using ElevenLabs API with optimized buffering - Adds voice-first mode with 3D visualization in
apps/sim/app/chat/[subdomain]/components/voice-interface
using Three.js for audio feedback - Introduces real-time speech recognition with
voice-input.tsx
supporting both standard chat and voice-first modes - Implements
useAudioStreaming
hook inapps/sim/app/chat/[subdomain]/hooks/use-audio-streaming.ts
for efficient TTS streaming with MediaSource API and fallback handling - Refactors chat architecture by removing old
chat-client.tsx
and implementing new voice-capable version with proper state management
12 file(s) reviewed, 18 comment(s)
Edit PR Review Bot Settings | Greptile
// Check if speech-to-text is available in the browser | ||
const isSttAvailable = | ||
typeof window !== 'undefined' && !!(window.SpeechRecognition || window.webkitSpeechRecognition) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Browser compatibility check done on every render. Move to useMemo hook to optimize performance.
export const ChatInput: React.FC<{ | ||
onSubmit?: (value: string) => void | ||
onSubmit?: (value: string, isVoiceInput?: boolean) => void | ||
isStreaming?: boolean | ||
onStopStreaming?: () => void | ||
}> = ({ onSubmit, isStreaming = false, onStopStreaming }) => { | ||
onVoiceStart?: () => void | ||
voiceOnly?: boolean | ||
onInterrupt?: () => void | ||
}> = ({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Consider grouping optional props into a config object.
// Enhanced auto voice trigger for conversation mode | ||
useEffect(() => { | ||
if ( | ||
isVoiceFirstMode && | ||
DEFAULT_VOICE_SETTINGS.conversationMode && | ||
!isLoading && | ||
!isStreamingResponse && | ||
!isPlayingAudio && | ||
messages.length > 1 && // Ensure we have at least one exchange | ||
messages[messages.length - 1].type === 'assistant' // Last message is from assistant | ||
) { | ||
// Clear any existing timeout | ||
if (conversationTimeoutRef.current) { | ||
clearTimeout(conversationTimeoutRef.current) | ||
} | ||
|
||
// Auto-start voice input after audio ends with a short delay | ||
conversationTimeoutRef.current = setTimeout(() => { | ||
// Only trigger if the user hasn't started typing or interacting | ||
if (!inputValue.trim()) { | ||
// This would need to be implemented in the ChatInput component | ||
// For now, we'll dispatch a custom event | ||
window.dispatchEvent(new CustomEvent('auto-trigger-voice')) | ||
} | ||
}, 800) // Shorter delay for more natural conversation | ||
|
||
return () => { | ||
if (conversationTimeoutRef.current) { | ||
clearTimeout(conversationTimeoutRef.current) | ||
} | ||
} | ||
} | ||
}, [ | ||
isVoiceFirstMode, | ||
DEFAULT_VOICE_SETTINGS.conversationMode, | ||
isLoading, | ||
isStreamingResponse, | ||
isPlayingAudio, | ||
messages, | ||
inputValue, | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Complex useEffect with multiple dependencies. Consider extracting voice trigger logic into a custom hook for better separation of concerns.
if (contentType.includes('text/plain')) { | ||
// Handle streaming response - pass the current userHasScrolled value | ||
// Prepare audio streaming handler if voice mode is enabled | ||
// Play audio if: voice input was used OR (in voice-first mode with auto-play enabled) | ||
const shouldPlayAudio = | ||
isVoiceReady && | ||
!ttsDisabled && | ||
(isVoiceInput || (isVoiceFirstMode && DEFAULT_VOICE_SETTINGS.autoPlayResponses)) | ||
|
||
const audioStreamHandler = shouldPlayAudio | ||
? async (text: string) => { | ||
try { | ||
await streamTextToAudio(text, { | ||
voiceId: DEFAULT_VOICE_SETTINGS.voiceId, | ||
// Use optimized streaming for conversation mode | ||
onAudioStart: () => { | ||
lastAudioEndTimeRef.current = 0 // Reset end time | ||
}, | ||
onAudioEnd: () => { | ||
lastAudioEndTimeRef.current = Date.now() | ||
}, | ||
onAudioChunkStart: () => { | ||
// Reset interruption flag for each new audio chunk to allow multiple interruptions | ||
// Reset the interruption flag in the voice interface | ||
if (resetInterruptionRef.current) { | ||
resetInterruptionRef.current() | ||
} | ||
}, | ||
onError: (error) => { | ||
logger.error('Audio streaming error:', error) | ||
// Disable TTS on authentication errors | ||
if (error.message.includes('401')) { | ||
ttsFailureCountRef.current++ | ||
if (ttsFailureCountRef.current >= 3) { | ||
logger.warn('Disabling TTS due to repeated authentication failures') | ||
setTtsDisabled(true) | ||
} | ||
} | ||
}, | ||
}) | ||
// Reset failure count on success | ||
ttsFailureCountRef.current = 0 | ||
} catch (error) { | ||
logger.error('TTS error:', error) | ||
} | ||
} | ||
: undefined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Audio stream handler options should be grouped into a configuration object for better maintainability.
return () => { | ||
cleanup() | ||
} | ||
}, []) // Remove ALL dependencies to prevent re-initialization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Dependencies array empty but using onVoiceStart() inside effect. Should include onVoiceStart or refactor to avoid dependency.
abortControllerRef.current?.abort() | ||
pendingRequestsRef.current.forEach((request) => { | ||
// Requests will be aborted by the AbortController | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Empty forEach block with commented purpose. Either implement abort handling or remove the forEach loop
interface AudioStreamingOptions { | ||
voiceId: string | ||
modelId?: string | ||
onAudioStart?: () => void | ||
onAudioEnd?: () => void | ||
onError?: (error: Error) => void | ||
onAudioChunkStart?: () => void | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Consider grouping modelId and voiceId into a TTSConfig object parameter since they're related TTS settings
voiceSettings?: { | ||
isVoiceEnabled: boolean | ||
voiceId: string | ||
autoPlayResponses: boolean | ||
voiceFirstMode?: boolean | ||
textStreamingInVoiceMode?: 'hidden' | 'synced' | 'normal' | ||
conversationMode?: boolean | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Consider grouping voiceSettings into an exported interface for reuse across components
@@ -73,6 +73,7 @@ export const env = createEnv({ | |||
NODE_ENV: z.string().optional(), | |||
GITHUB_TOKEN: z.string().optional(), | |||
CHONKIE_API_KEY: z.string().min(1).optional(), | |||
ELEVENLABS_API_KEY: z.string().min(1).optional(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Group with other 3rd party API keys (around line 40-48) for better organization
// Use sentence-based streaming for natural audio flow | ||
const sentenceEndings = ['. ', '! ', '? ', '.\n', '!\n', '?\n', '.', '!', '?'] | ||
let sentenceEnd = -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Move sentenceEndings to a constant outside the function to avoid recreating array on each iteration
// Use sentence-based streaming for natural audio flow | |
const sentenceEndings = ['. ', '! ', '? ', '.\n', '!\n', '?\n', '.', '!', '?'] | |
let sentenceEnd = -1 | |
const SENTENCE_ENDINGS = ['. ', '! ', '? ', '.\n', '!\n', '?\n', '.', '!', '?'] | |
const logger = createLogger('UseChatStreaming') |
042c95c
to
1ef64c4
Compare
Description
Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context.
Fixes # (issue)
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.
Checklist:
bun run test
)Security Considerations:
Additional Information:
Any additional information, configuration or data that might be necessary to reproduce the issue or use the feature.