Skip to content

Commit e37dc26

Browse files
committed
docs: add voice transcription documentation
1 parent 96759b2 commit e37dc26

File tree

4 files changed

+247
-1
lines changed

4 files changed

+247
-1
lines changed

docs/content/docs/(configuration)/config.mdx

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,10 @@ These environment variables control instance-level behavior and are not set in `
2828
| `SPACEBOT_USER_TIMEZONE` | inherits cron | Default timezone for channel/worker temporal context. Overridden by config equivalents. |
2929
| `SPACEBOT_CHANNEL_MODEL` | `anthropic/claude-sonnet-4-20250514` | Default channel model (env-only mode). |
3030
| `SPACEBOT_WORKER_MODEL` | `anthropic/claude-haiku-4.5-20250514` | Default worker model (env-only mode). |
31+
| `SPACEBOT_VOICE_MODEL` | None | Voice transcription model (e.g. `groq/whisper-large-v3-turbo`). |
32+
| `SPACEBOT_VOICE_LANGUAGE` | None | Language hint for voice transcription (ISO 639-1, e.g. `en`, `es`). |
33+
| `SPACEBOT_VOICE_TRANSLATE` | `false` | Set to `true` to translate audio to English instead of transcribing. |
34+
| `SPACEBOT_STT_PROVIDER` | None | Override which provider handles speech-to-text (e.g. `groq`, `openai`). |
3135

3236
## Full Reference
3337

@@ -85,6 +89,10 @@ branch = "anthropic/claude-sonnet-4-20250514"
8589
worker = "anthropic/claude-haiku-4.5-20250514"
8690
compactor = "anthropic/claude-haiku-4.5-20250514"
8791
cortex = "anthropic/claude-haiku-4.5-20250514"
92+
voice = "groq/whisper-large-v3-turbo" # STT model (provider/model)
93+
voice_language = "en" # optional language hint
94+
# voice_translate = false # set true to translate to English
95+
# stt_provider = "groq" # optional provider override
8896
rate_limit_cooldown_secs = 60
8997

9098
# Task-type overrides for workers/branches.
@@ -462,6 +470,10 @@ At least one provider (legacy key or custom provider) must be configured.
462470
| `worker` | string | `anthropic/claude-haiku-4.5-20250514` | Model for task workers |
463471
| `compactor` | string | `anthropic/claude-haiku-4.5-20250514` | Model for summarization |
464472
| `cortex` | string | `anthropic/claude-haiku-4.5-20250514` | Model for system observation |
473+
| `voice` | string | Provider-dependent | STT model for audio transcription (e.g. `groq/whisper-large-v3-turbo`). Empty disables voice. See [Voice Transcription](/docs/voice-transcription). |
474+
| `voice_language` | string | None | ISO 639-1 language hint for transcription accuracy (e.g. `en`, `es`, `ja`). Ignored in translation mode. |
475+
| `voice_translate` | bool | `false` | When `true`, translates audio to English via `/v1/audio/translations` instead of transcribing in the source language. |
476+
| `stt_provider` | string | None | Override which provider handles STT. When absent, the provider is extracted from the `voice` model prefix. |
465477
| `rate_limit_cooldown_secs` | integer | 60 | How long to deprioritize a rate-limited model |
466478

467479
Routing selects providers by the prefix before the first `/` in the model name.

docs/content/docs/(core)/routing.mdx

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,10 @@ pub struct RoutingConfig {
111111
pub worker: String,
112112
pub compactor: String,
113113
pub cortex: String,
114+
pub voice: String,
115+
pub voice_language: Option<String>,
116+
pub voice_translate: bool,
117+
pub stt_provider: Option<String>,
114118
pub task_overrides: HashMap<String, String>,
115119
pub fallbacks: HashMap<String, Vec<String>>,
116120
pub rate_limit_cooldown_secs: u64,
@@ -212,6 +216,42 @@ pub struct LlmManager {
212216

213217
Rate limit state is shared across all agents (it's provider-level, not agent-level). When a 429 is received, the model is marked with the current timestamp. Future routing decisions can check `is_rate_limited()` to proactively skip models in cooldown.
214218

219+
## Voice Transcription Routing
220+
221+
Voice transcription (speech-to-text) uses a separate routing path from the main LLM models. When a user sends an audio attachment (e.g. a Telegram voice message), Spacebot transcribes it to text using a Whisper-compatible API before the channel LLM ever sees it.
222+
223+
Voice routing is independent from the main process-type routing. You can use Anthropic for chat and Groq for transcription.
224+
225+
```toml
226+
[defaults.routing]
227+
channel = "anthropic/claude-sonnet-4-20250514" # chat
228+
voice = "groq/whisper-large-v3-turbo" # transcription (different provider)
229+
```
230+
231+
### How It Routes
232+
233+
1. `stt_provider` override (if set) determines the provider
234+
2. Otherwise, the provider prefix in `voice` is used (e.g. `groq/` in `groq/whisper-large-v3-turbo`)
235+
3. The provider must support the Whisper-compatible `/v1/audio/transcriptions` endpoint
236+
237+
### Supported STT Providers
238+
239+
| Provider | Default Voice Model | Endpoint |
240+
|----------|-------------------|----------|
241+
| OpenAI | `openai/whisper-1` | `/v1/audio/transcriptions` |
242+
| Groq | `groq/whisper-large-v3-turbo` | `/openai/v1/audio/transcriptions` |
243+
| Gemini | `gemini/gemini-2.5-flash` | `/v1/audio/transcriptions` (OpenAI-compatible) |
244+
245+
Providers without native STT (Anthropic, OpenRouter, DeepSeek, etc.) require configuring a separate STT provider:
246+
247+
```toml
248+
[defaults.routing]
249+
channel = "openrouter/anthropic/claude-sonnet-4" # chat via OpenRouter
250+
voice = "groq/whisper-large-v3-turbo" # STT via Groq
251+
```
252+
253+
See [Voice Transcription](/docs/voice-transcription) for the full feature reference including language hints, translation mode, and configuration examples.
254+
215255
## What We Don't Do
216256

217257
**No prompt-level content analysis.** We know the process type and task type at spawn time.
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
{
22
"title": "Features",
3-
"pages": ["workers", "tasks", "opencode", "tools", "mcp", "browser", "cron", "skills", "ingestion"]
3+
"pages": ["workers", "tasks", "opencode", "tools", "mcp", "browser", "cron", "skills", "ingestion", "voice-transcription"]
44
}
Lines changed: 194 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,194 @@
1+
---
2+
title: Voice Transcription
3+
description: Speech-to-text transcription for audio attachments using Whisper-compatible APIs.
4+
---
5+
6+
# Voice Transcription
7+
8+
Spacebot converts audio attachments (Telegram voice messages, Discord audio clips, etc.) to text using Whisper-compatible speech-to-text APIs. The transcript is injected into the conversation before the channel LLM processes it.
9+
10+
## How It Works
11+
12+
When a user sends an audio attachment, Spacebot:
13+
14+
1. Downloads the audio bytes from the messaging platform
15+
2. Resolves the STT provider and model from routing config
16+
3. Sends a multipart `POST` to the provider's `/v1/audio/transcriptions` endpoint
17+
4. Injects the transcript into the conversation as a structured XML tag
18+
19+
The channel LLM sees the transcript, not raw audio:
20+
21+
```xml
22+
<voice_transcript name="voice_message.ogg" mime="audio/ogg">
23+
Hello, this is what the user said in their voice message.
24+
</voice_transcript>
25+
```
26+
27+
When translation mode is enabled, the tag changes:
28+
29+
```xml
30+
<voice_translation name="voice_message.ogg" mime="audio/ogg">
31+
Hello, this is the English translation of what the user said.
32+
</voice_translation>
33+
```
34+
35+
## Configuration
36+
37+
All voice settings live under `[defaults.routing]` or per-agent `[agents.routing]`.
38+
39+
```toml
40+
[defaults.routing]
41+
voice = "groq/whisper-large-v3-turbo"
42+
voice_language = "en" # optional
43+
voice_translate = false # optional
44+
stt_provider = "groq" # optional
45+
```
46+
47+
### Parameters
48+
49+
| Parameter | Type | Default | Description |
50+
|-----------|------|---------|-------------|
51+
| `voice` | string | Provider-dependent | STT model in `provider/model` format. Empty string disables voice transcription. |
52+
| `voice_language` | string | None | ISO 639-1 language hint for accuracy (e.g. `en`, `es`, `fr`, `ja`). Ignored in translation mode. |
53+
| `voice_translate` | bool | `false` | When `true`, uses the translations endpoint to translate audio to English. |
54+
| `stt_provider` | string | None | Override which provider handles STT. When absent, provider is extracted from the `voice` model prefix. |
55+
56+
### Provider Defaults
57+
58+
When no explicit `voice` is set, Spacebot applies a default based on the primary provider:
59+
60+
| Primary Provider | Default `voice` | Notes |
61+
|------------------|----------------|-------|
62+
| OpenAI | `openai/whisper-1` | Native Whisper API |
63+
| Groq | `groq/whisper-large-v3-turbo` | Fast and cheap |
64+
| Gemini | `gemini/gemini-2.5-flash` | OpenAI-compatible endpoint |
65+
| OpenRouter | *(empty)* | No native STT — configure `stt_provider` separately |
66+
| Anthropic | *(empty)* | No STT — configure `stt_provider` separately |
67+
| All others | *(empty)* | Must configure `voice` explicitly |
68+
69+
### Environment Variables
70+
71+
| Variable | Description | Example |
72+
|----------|-------------|---------|
73+
| `SPACEBOT_VOICE_MODEL` | STT model | `groq/whisper-large-v3-turbo` |
74+
| `SPACEBOT_VOICE_LANGUAGE` | Language hint | `en` |
75+
| `SPACEBOT_VOICE_TRANSLATE` | Translation mode | `true` |
76+
| `SPACEBOT_STT_PROVIDER` | Provider override | `groq` |
77+
78+
Resolution order: **environment variable > config file > provider default**.
79+
80+
## Supported Providers
81+
82+
Voice transcription requires a provider that supports the OpenAI-compatible Whisper API (`/v1/audio/transcriptions` with multipart form data).
83+
84+
| Provider | Models | Transcription Endpoint | Translation Endpoint |
85+
|----------|--------|----------------------|---------------------|
86+
| **OpenAI** | `whisper-1`, `gpt-4o-transcribe`, `gpt-4o-mini-transcribe` | `/v1/audio/transcriptions` | `/v1/audio/translations` |
87+
| **Groq** | `whisper-large-v3`, `whisper-large-v3-turbo` | `/openai/v1/audio/transcriptions` | `/openai/v1/audio/translations` |
88+
| **Gemini** | `gemini-2.5-flash` (and other Gemini models) | `/v1/audio/transcriptions` | Not supported |
89+
90+
Providers that do **not** have a transcription endpoint (Anthropic, OpenRouter, DeepSeek, Together, xAI, Mistral, etc.) cannot be used directly for voice. Configure a separate STT provider instead.
91+
92+
### Supported Audio Formats
93+
94+
The Whisper API accepts: `flac`, `m4a`, `mp3`, `mp4`, `mpeg`, `mpga`, `oga`, `ogg`, `wav`, `webm`.
95+
96+
Telegram voice messages (OGG/Opus) are natively supported with no conversion needed.
97+
98+
## Examples
99+
100+
### Groq for chat and transcription
101+
102+
```toml
103+
[llm]
104+
groq_key = "gsk_xxx"
105+
106+
[defaults.routing]
107+
channel = "groq/llama-3.3-70b-versatile"
108+
voice = "groq/whisper-large-v3-turbo"
109+
```
110+
111+
### OpenRouter for chat, Groq for transcription
112+
113+
```toml
114+
[llm]
115+
openrouter_key = "sk-or-xxx"
116+
groq_key = "gsk_xxx"
117+
118+
[defaults.routing]
119+
channel = "openrouter/anthropic/claude-sonnet-4"
120+
voice = "groq/whisper-large-v3-turbo"
121+
voice_language = "en"
122+
```
123+
124+
### Anthropic for chat, OpenAI for transcription with translation
125+
126+
```toml
127+
[llm]
128+
anthropic_key = "sk-ant-xxx"
129+
openai_key = "sk-xxx"
130+
131+
[defaults.routing]
132+
channel = "anthropic/claude-sonnet-4"
133+
voice = "openai/whisper-1"
134+
voice_translate = true
135+
stt_provider = "openai"
136+
```
137+
138+
### Multilingual transcription with language hint
139+
140+
```toml
141+
[llm]
142+
openai_key = "sk-xxx"
143+
144+
[defaults.routing]
145+
channel = "openai/gpt-4.1"
146+
voice = "openai/whisper-1"
147+
voice_language = "ja"
148+
```
149+
150+
### Gemini for everything
151+
152+
```toml
153+
[llm]
154+
gemini_key = "xxx"
155+
156+
[defaults.routing]
157+
channel = "gemini/gemini-2.5-pro"
158+
voice = "gemini/gemini-2.5-flash"
159+
```
160+
161+
## Error Handling
162+
163+
Errors are returned as inline text in the conversation so the channel LLM can inform the user:
164+
165+
| Condition | Message |
166+
|-----------|---------|
167+
| No voice model configured | `[Audio attachment received but no voice model is configured...]` |
168+
| STT provider not found | `[Audio transcription failed: provider 'xxx' is not configured]` |
169+
| Provider doesn't support Whisper | `[Audio transcription not supported by provider 'xxx'...]` |
170+
| API error | `[Audio transcription failed for filename.ogg: Whisper API error (400): ...]` |
171+
| Download failure | `[Failed to download audio: filename.ogg]` |
172+
173+
There is no fallback to alternative transcription methods. If transcription fails, the error is returned directly.
174+
175+
## API
176+
177+
### Runtime Configuration
178+
179+
Voice settings are included in the agent config API:
180+
181+
```
182+
GET /api/config?agent_id=main
183+
PATCH /api/config { "agent_id": "main", "routing": { "voice": "...", ... } }
184+
```
185+
186+
### Model Discovery
187+
188+
Filter models to transcription-capable providers:
189+
190+
```
191+
GET /api/models?capability=voice_transcription
192+
```
193+
194+
Returns models from providers that support the Whisper-compatible transcription endpoint (currently: OpenAI, Groq, Gemini).

0 commit comments

Comments
 (0)