|
| 1 | +--- |
| 2 | +title: Voice Transcription |
| 3 | +description: Speech-to-text transcription for audio attachments using Whisper-compatible APIs. |
| 4 | +--- |
| 5 | + |
| 6 | +# Voice Transcription |
| 7 | + |
| 8 | +Spacebot converts audio attachments (Telegram voice messages, Discord audio clips, etc.) to text using Whisper-compatible speech-to-text APIs. The transcript is injected into the conversation before the channel LLM processes it. |
| 9 | + |
| 10 | +## How It Works |
| 11 | + |
| 12 | +When a user sends an audio attachment, Spacebot: |
| 13 | + |
| 14 | +1. Downloads the audio bytes from the messaging platform |
| 15 | +2. Resolves the STT provider and model from routing config |
| 16 | +3. Sends a multipart `POST` to the provider's `/v1/audio/transcriptions` endpoint |
| 17 | +4. Injects the transcript into the conversation as a structured XML tag |
| 18 | + |
| 19 | +The channel LLM sees the transcript, not raw audio: |
| 20 | + |
| 21 | +```xml |
| 22 | +<voice_transcript name="voice_message.ogg" mime="audio/ogg"> |
| 23 | +Hello, this is what the user said in their voice message. |
| 24 | +</voice_transcript> |
| 25 | +``` |
| 26 | + |
| 27 | +When translation mode is enabled, the tag changes: |
| 28 | + |
| 29 | +```xml |
| 30 | +<voice_translation name="voice_message.ogg" mime="audio/ogg"> |
| 31 | +Hello, this is the English translation of what the user said. |
| 32 | +</voice_translation> |
| 33 | +``` |
| 34 | + |
| 35 | +## Configuration |
| 36 | + |
| 37 | +All voice settings live under `[defaults.routing]` or per-agent `[agents.routing]`. |
| 38 | + |
| 39 | +```toml |
| 40 | +[defaults.routing] |
| 41 | +voice = "groq/whisper-large-v3-turbo" |
| 42 | +voice_language = "en" # optional |
| 43 | +voice_translate = false # optional |
| 44 | +stt_provider = "groq" # optional |
| 45 | +``` |
| 46 | + |
| 47 | +### Parameters |
| 48 | + |
| 49 | +| Parameter | Type | Default | Description | |
| 50 | +|-----------|------|---------|-------------| |
| 51 | +| `voice` | string | Provider-dependent | STT model in `provider/model` format. Empty string disables voice transcription. | |
| 52 | +| `voice_language` | string | None | ISO 639-1 language hint for accuracy (e.g. `en`, `es`, `fr`, `ja`). Ignored in translation mode. | |
| 53 | +| `voice_translate` | bool | `false` | When `true`, uses the translations endpoint to translate audio to English. | |
| 54 | +| `stt_provider` | string | None | Override which provider handles STT. When absent, provider is extracted from the `voice` model prefix. | |
| 55 | + |
| 56 | +### Provider Defaults |
| 57 | + |
| 58 | +When no explicit `voice` is set, Spacebot applies a default based on the primary provider: |
| 59 | + |
| 60 | +| Primary Provider | Default `voice` | Notes | |
| 61 | +|------------------|----------------|-------| |
| 62 | +| OpenAI | `openai/whisper-1` | Native Whisper API | |
| 63 | +| Groq | `groq/whisper-large-v3-turbo` | Fast and cheap | |
| 64 | +| Gemini | `gemini/gemini-2.5-flash` | OpenAI-compatible endpoint | |
| 65 | +| OpenRouter | *(empty)* | No native STT — configure `stt_provider` separately | |
| 66 | +| Anthropic | *(empty)* | No STT — configure `stt_provider` separately | |
| 67 | +| All others | *(empty)* | Must configure `voice` explicitly | |
| 68 | + |
| 69 | +### Environment Variables |
| 70 | + |
| 71 | +| Variable | Description | Example | |
| 72 | +|----------|-------------|---------| |
| 73 | +| `SPACEBOT_VOICE_MODEL` | STT model | `groq/whisper-large-v3-turbo` | |
| 74 | +| `SPACEBOT_VOICE_LANGUAGE` | Language hint | `en` | |
| 75 | +| `SPACEBOT_VOICE_TRANSLATE` | Translation mode | `true` | |
| 76 | +| `SPACEBOT_STT_PROVIDER` | Provider override | `groq` | |
| 77 | + |
| 78 | +Resolution order: **environment variable > config file > provider default**. |
| 79 | + |
| 80 | +## Supported Providers |
| 81 | + |
| 82 | +Voice transcription requires a provider that supports the OpenAI-compatible Whisper API (`/v1/audio/transcriptions` with multipart form data). |
| 83 | + |
| 84 | +| Provider | Models | Transcription Endpoint | Translation Endpoint | |
| 85 | +|----------|--------|----------------------|---------------------| |
| 86 | +| **OpenAI** | `whisper-1`, `gpt-4o-transcribe`, `gpt-4o-mini-transcribe` | `/v1/audio/transcriptions` | `/v1/audio/translations` | |
| 87 | +| **Groq** | `whisper-large-v3`, `whisper-large-v3-turbo` | `/openai/v1/audio/transcriptions` | `/openai/v1/audio/translations` | |
| 88 | +| **Gemini** | `gemini-2.5-flash` (and other Gemini models) | `/v1/audio/transcriptions` | Not supported | |
| 89 | + |
| 90 | +Providers that do **not** have a transcription endpoint (Anthropic, OpenRouter, DeepSeek, Together, xAI, Mistral, etc.) cannot be used directly for voice. Configure a separate STT provider instead. |
| 91 | + |
| 92 | +### Supported Audio Formats |
| 93 | + |
| 94 | +The Whisper API accepts: `flac`, `m4a`, `mp3`, `mp4`, `mpeg`, `mpga`, `oga`, `ogg`, `wav`, `webm`. |
| 95 | + |
| 96 | +Telegram voice messages (OGG/Opus) are natively supported with no conversion needed. |
| 97 | + |
| 98 | +## Examples |
| 99 | + |
| 100 | +### Groq for chat and transcription |
| 101 | + |
| 102 | +```toml |
| 103 | +[llm] |
| 104 | +groq_key = "gsk_xxx" |
| 105 | + |
| 106 | +[defaults.routing] |
| 107 | +channel = "groq/llama-3.3-70b-versatile" |
| 108 | +voice = "groq/whisper-large-v3-turbo" |
| 109 | +``` |
| 110 | + |
| 111 | +### OpenRouter for chat, Groq for transcription |
| 112 | + |
| 113 | +```toml |
| 114 | +[llm] |
| 115 | +openrouter_key = "sk-or-xxx" |
| 116 | +groq_key = "gsk_xxx" |
| 117 | + |
| 118 | +[defaults.routing] |
| 119 | +channel = "openrouter/anthropic/claude-sonnet-4" |
| 120 | +voice = "groq/whisper-large-v3-turbo" |
| 121 | +voice_language = "en" |
| 122 | +``` |
| 123 | + |
| 124 | +### Anthropic for chat, OpenAI for transcription with translation |
| 125 | + |
| 126 | +```toml |
| 127 | +[llm] |
| 128 | +anthropic_key = "sk-ant-xxx" |
| 129 | +openai_key = "sk-xxx" |
| 130 | + |
| 131 | +[defaults.routing] |
| 132 | +channel = "anthropic/claude-sonnet-4" |
| 133 | +voice = "openai/whisper-1" |
| 134 | +voice_translate = true |
| 135 | +stt_provider = "openai" |
| 136 | +``` |
| 137 | + |
| 138 | +### Multilingual transcription with language hint |
| 139 | + |
| 140 | +```toml |
| 141 | +[llm] |
| 142 | +openai_key = "sk-xxx" |
| 143 | + |
| 144 | +[defaults.routing] |
| 145 | +channel = "openai/gpt-4.1" |
| 146 | +voice = "openai/whisper-1" |
| 147 | +voice_language = "ja" |
| 148 | +``` |
| 149 | + |
| 150 | +### Gemini for everything |
| 151 | + |
| 152 | +```toml |
| 153 | +[llm] |
| 154 | +gemini_key = "xxx" |
| 155 | + |
| 156 | +[defaults.routing] |
| 157 | +channel = "gemini/gemini-2.5-pro" |
| 158 | +voice = "gemini/gemini-2.5-flash" |
| 159 | +``` |
| 160 | + |
| 161 | +## Error Handling |
| 162 | + |
| 163 | +Errors are returned as inline text in the conversation so the channel LLM can inform the user: |
| 164 | + |
| 165 | +| Condition | Message | |
| 166 | +|-----------|---------| |
| 167 | +| No voice model configured | `[Audio attachment received but no voice model is configured...]` | |
| 168 | +| STT provider not found | `[Audio transcription failed: provider 'xxx' is not configured]` | |
| 169 | +| Provider doesn't support Whisper | `[Audio transcription not supported by provider 'xxx'...]` | |
| 170 | +| API error | `[Audio transcription failed for filename.ogg: Whisper API error (400): ...]` | |
| 171 | +| Download failure | `[Failed to download audio: filename.ogg]` | |
| 172 | + |
| 173 | +There is no fallback to alternative transcription methods. If transcription fails, the error is returned directly. |
| 174 | + |
| 175 | +## API |
| 176 | + |
| 177 | +### Runtime Configuration |
| 178 | + |
| 179 | +Voice settings are included in the agent config API: |
| 180 | + |
| 181 | +``` |
| 182 | +GET /api/config?agent_id=main |
| 183 | +PATCH /api/config { "agent_id": "main", "routing": { "voice": "...", ... } } |
| 184 | +``` |
| 185 | + |
| 186 | +### Model Discovery |
| 187 | + |
| 188 | +Filter models to transcription-capable providers: |
| 189 | + |
| 190 | +``` |
| 191 | +GET /api/models?capability=voice_transcription |
| 192 | +``` |
| 193 | + |
| 194 | +Returns models from providers that support the Whisper-compatible transcription endpoint (currently: OpenAI, Groq, Gemini). |
0 commit comments