Skip to content

feat(speech-to-speech) #463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed

feat(speech-to-speech) #463

wants to merge 6 commits into from

Conversation

emir-karabeg
Copy link
Collaborator

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context.

Fixes # (issue)

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Security enhancement
  • Performance improvement
  • Code refactoring (no functional changes)

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass locally and in CI (bun run test)
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules
  • I have updated version numbers as needed (if needed)
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

Security Considerations:

  • My changes do not introduce any new security vulnerabilities
  • I have considered the security implications of my changes

Additional Information:

Any additional information, configuration or data that might be necessary to reproduce the issue or use the feature.

Copy link

vercel bot commented Jun 8, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
sim ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 9, 2025 6:07pm
1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
docs ⬜️ Skipped (Inspect) Jun 9, 2025 6:07pm

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

Introduces comprehensive speech-to-speech capabilities across the chat interface with streaming TTS and real-time voice interaction support.

  • Implements secure server-side text-to-speech streaming in apps/sim/app/api/proxy/tts/stream/route.ts using ElevenLabs API with optimized buffering
  • Adds voice-first mode with 3D visualization in apps/sim/app/chat/[subdomain]/components/voice-interface using Three.js for audio feedback
  • Introduces real-time speech recognition with voice-input.tsx supporting both standard chat and voice-first modes
  • Implements useAudioStreaming hook in apps/sim/app/chat/[subdomain]/hooks/use-audio-streaming.ts for efficient TTS streaming with MediaSource API and fallback handling
  • Refactors chat architecture by removing old chat-client.tsx and implementing new voice-capable version with proper state management

12 file(s) reviewed, 18 comment(s)
Edit PR Review Bot Settings | Greptile

Comment on lines +45 to +47
// Check if speech-to-text is available in the browser
const isSttAvailable =
typeof window !== 'undefined' && !!(window.SpeechRecognition || window.webkitSpeechRecognition)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Browser compatibility check done on every render. Move to useMemo hook to optimize performance.

Comment on lines 24 to +31
export const ChatInput: React.FC<{
onSubmit?: (value: string) => void
onSubmit?: (value: string, isVoiceInput?: boolean) => void
isStreaming?: boolean
onStopStreaming?: () => void
}> = ({ onSubmit, isStreaming = false, onStopStreaming }) => {
onVoiceStart?: () => void
voiceOnly?: boolean
onInterrupt?: () => void
}> = ({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Consider grouping optional props into a config object.

Comment on lines 189 to 229
// Enhanced auto voice trigger for conversation mode
useEffect(() => {
if (
isVoiceFirstMode &&
DEFAULT_VOICE_SETTINGS.conversationMode &&
!isLoading &&
!isStreamingResponse &&
!isPlayingAudio &&
messages.length > 1 && // Ensure we have at least one exchange
messages[messages.length - 1].type === 'assistant' // Last message is from assistant
) {
// Clear any existing timeout
if (conversationTimeoutRef.current) {
clearTimeout(conversationTimeoutRef.current)
}

// Auto-start voice input after audio ends with a short delay
conversationTimeoutRef.current = setTimeout(() => {
// Only trigger if the user hasn't started typing or interacting
if (!inputValue.trim()) {
// This would need to be implemented in the ChatInput component
// For now, we'll dispatch a custom event
window.dispatchEvent(new CustomEvent('auto-trigger-voice'))
}
}, 800) // Shorter delay for more natural conversation

return () => {
if (conversationTimeoutRef.current) {
clearTimeout(conversationTimeoutRef.current)
}
}
}
}, [
isVoiceFirstMode,
DEFAULT_VOICE_SETTINGS.conversationMode,
isLoading,
isStreamingResponse,
isPlayingAudio,
messages,
inputValue,
])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Complex useEffect with multiple dependencies. Consider extracting voice trigger logic into a custom hook for better separation of concerns.

Comment on lines 372 to 417
if (contentType.includes('text/plain')) {
// Handle streaming response - pass the current userHasScrolled value
// Prepare audio streaming handler if voice mode is enabled
// Play audio if: voice input was used OR (in voice-first mode with auto-play enabled)
const shouldPlayAudio =
isVoiceReady &&
!ttsDisabled &&
(isVoiceInput || (isVoiceFirstMode && DEFAULT_VOICE_SETTINGS.autoPlayResponses))

const audioStreamHandler = shouldPlayAudio
? async (text: string) => {
try {
await streamTextToAudio(text, {
voiceId: DEFAULT_VOICE_SETTINGS.voiceId,
// Use optimized streaming for conversation mode
onAudioStart: () => {
lastAudioEndTimeRef.current = 0 // Reset end time
},
onAudioEnd: () => {
lastAudioEndTimeRef.current = Date.now()
},
onAudioChunkStart: () => {
// Reset interruption flag for each new audio chunk to allow multiple interruptions
// Reset the interruption flag in the voice interface
if (resetInterruptionRef.current) {
resetInterruptionRef.current()
}
},
onError: (error) => {
logger.error('Audio streaming error:', error)
// Disable TTS on authentication errors
if (error.message.includes('401')) {
ttsFailureCountRef.current++
if (ttsFailureCountRef.current >= 3) {
logger.warn('Disabling TTS due to repeated authentication failures')
setTtsDisabled(true)
}
}
},
})
// Reset failure count on success
ttsFailureCountRef.current = 0
} catch (error) {
logger.error('TTS error:', error)
}
}
: undefined
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Audio stream handler options should be grouped into a configuration object for better maintainability.

return () => {
cleanup()
}
}, []) // Remove ALL dependencies to prevent re-initialization
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Dependencies array empty but using onVoiceStart() inside effect. Should include onVoiceStart or refactor to avoid dependency.

Comment on lines 179 to 182
abortControllerRef.current?.abort()
pendingRequestsRef.current.forEach((request) => {
// Requests will be aborted by the AbortController
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Empty forEach block with commented purpose. Either implement abort handling or remove the forEach loop

Comment on lines 8 to 15
interface AudioStreamingOptions {
voiceId: string
modelId?: string
onAudioStart?: () => void
onAudioEnd?: () => void
onError?: (error: Error) => void
onAudioChunkStart?: () => void
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Consider grouping modelId and voiceId into a TTSConfig object parameter since they're related TTS settings

Comment on lines +10 to +17
voiceSettings?: {
isVoiceEnabled: boolean
voiceId: string
autoPlayResponses: boolean
voiceFirstMode?: boolean
textStreamingInVoiceMode?: 'hidden' | 'synced' | 'normal'
conversationMode?: boolean
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Consider grouping voiceSettings into an exported interface for reuse across components

@@ -73,6 +73,7 @@ export const env = createEnv({
NODE_ENV: z.string().optional(),
GITHUB_TOKEN: z.string().optional(),
CHONKIE_API_KEY: z.string().min(1).optional(),
ELEVENLABS_API_KEY: z.string().min(1).optional(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Group with other 3rd party API keys (around line 40-48) for better organization

Comment on lines +230 to +232
// Use sentence-based streaming for natural audio flow
const sentenceEndings = ['. ', '! ', '? ', '.\n', '!\n', '?\n', '.', '!', '?']
let sentenceEnd = -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Move sentenceEndings to a constant outside the function to avoid recreating array on each iteration

Suggested change
// Use sentence-based streaming for natural audio flow
const sentenceEndings = ['. ', '! ', '? ', '.\n', '!\n', '?\n', '.', '!', '?']
let sentenceEnd = -1
const SENTENCE_ENDINGS = ['. ', '! ', '? ', '.\n', '!\n', '?\n', '.', '!', '?']
const logger = createLogger('UseChatStreaming')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants