docs: add voice transcription documentation

neurocis · neurocis · commit e37dc26e1772 · 2026-03-09T21:12:02.000-07:00
diff --git a/docs/content/docs/(configuration)/config.mdx b/docs/content/docs/(configuration)/config.mdx
@@ -28,6 +28,10 @@ These environment variables control instance-level behavior and are not set in `
 | `SPACEBOT_USER_TIMEZONE` | inherits cron | Default timezone for channel/worker temporal context. Overridden by config equivalents. |
 | `SPACEBOT_CHANNEL_MODEL` | `anthropic/claude-sonnet-4-20250514` | Default channel model (env-only mode). |
 | `SPACEBOT_WORKER_MODEL` | `anthropic/claude-haiku-4.5-20250514` | Default worker model (env-only mode). |
+| `SPACEBOT_VOICE_MODEL` | None | Voice transcription model (e.g. `groq/whisper-large-v3-turbo`). |
+| `SPACEBOT_VOICE_LANGUAGE` | None | Language hint for voice transcription (ISO 639-1, e.g. `en`, `es`). |
+| `SPACEBOT_VOICE_TRANSLATE` | `false` | Set to `true` to translate audio to English instead of transcribing. |
+| `SPACEBOT_STT_PROVIDER` | None | Override which provider handles speech-to-text (e.g. `groq`, `openai`). |
 
 ## Full Reference
 
@@ -85,6 +89,10 @@ branch = "anthropic/claude-sonnet-4-20250514"
 worker = "anthropic/claude-haiku-4.5-20250514"
 compactor = "anthropic/claude-haiku-4.5-20250514"
 cortex = "anthropic/claude-haiku-4.5-20250514"
+voice = "groq/whisper-large-v3-turbo"  # STT model (provider/model)
+voice_language = "en"                   # optional language hint
+# voice_translate = false               # set true to translate to English
+# stt_provider = "groq"                 # optional provider override
 rate_limit_cooldown_secs = 60
 
 # Task-type overrides for workers/branches.
@@ -462,6 +470,10 @@ At least one provider (legacy key or custom provider) must be configured.
 | `worker` | string | `anthropic/claude-haiku-4.5-20250514` | Model for task workers |
 | `compactor` | string | `anthropic/claude-haiku-4.5-20250514` | Model for summarization |
 | `cortex` | string | `anthropic/claude-haiku-4.5-20250514` | Model for system observation |
+| `voice` | string | Provider-dependent | STT model for audio transcription (e.g. `groq/whisper-large-v3-turbo`). Empty disables voice. See [Voice Transcription](/docs/voice-transcription). |
+| `voice_language` | string | None | ISO 639-1 language hint for transcription accuracy (e.g. `en`, `es`, `ja`). Ignored in translation mode. |
+| `voice_translate` | bool | `false` | When `true`, translates audio to English via `/v1/audio/translations` instead of transcribing in the source language. |
+| `stt_provider` | string | None | Override which provider handles STT. When absent, the provider is extracted from the `voice` model prefix. |
 | `rate_limit_cooldown_secs` | integer | 60 | How long to deprioritize a rate-limited model |
 
 Routing selects providers by the prefix before the first `/` in the model name.
diff --git a/docs/content/docs/(core)/routing.mdx b/docs/content/docs/(core)/routing.mdx
@@ -111,6 +111,10 @@ pub struct RoutingConfig {
     pub worker: String,
     pub compactor: String,
     pub cortex: String,
+    pub voice: String,
+    pub voice_language: Option<String>,
+    pub voice_translate: bool,
+    pub stt_provider: Option<String>,
     pub task_overrides: HashMap<String, String>,
     pub fallbacks: HashMap<String, Vec<String>>,
     pub rate_limit_cooldown_secs: u64,
@@ -212,6 +216,42 @@ pub struct LlmManager {
 
 Rate limit state is shared across all agents (it's provider-level, not agent-level). When a 429 is received, the model is marked with the current timestamp. Future routing decisions can check `is_rate_limited()` to proactively skip models in cooldown.
 
+## Voice Transcription Routing
+
+Voice transcription (speech-to-text) uses a separate routing path from the main LLM models. When a user sends an audio attachment (e.g. a Telegram voice message), Spacebot transcribes it to text using a Whisper-compatible API before the channel LLM ever sees it.
+
+Voice routing is independent from the main process-type routing. You can use Anthropic for chat and Groq for transcription.
+
+```toml
+[defaults.routing]
+channel = "anthropic/claude-sonnet-4-20250514"   # chat
+voice = "groq/whisper-large-v3-turbo"             # transcription (different provider)
+```
+
+### How It Routes
+
+1. `stt_provider` override (if set) determines the provider
+2. Otherwise, the provider prefix in `voice` is used (e.g. `groq/` in `groq/whisper-large-v3-turbo`)
+3. The provider must support the Whisper-compatible `/v1/audio/transcriptions` endpoint
+
+### Supported STT Providers
+
+| Provider | Default Voice Model | Endpoint |
+|----------|-------------------|----------|
+| OpenAI | `openai/whisper-1` | `/v1/audio/transcriptions` |
+| Groq | `groq/whisper-large-v3-turbo` | `/openai/v1/audio/transcriptions` |
+| Gemini | `gemini/gemini-2.5-flash` | `/v1/audio/transcriptions` (OpenAI-compatible) |
+
+Providers without native STT (Anthropic, OpenRouter, DeepSeek, etc.) require configuring a separate STT provider:
+
+```toml
+[defaults.routing]
+channel = "openrouter/anthropic/claude-sonnet-4"   # chat via OpenRouter
+voice = "groq/whisper-large-v3-turbo"               # STT via Groq
+```
+
+See [Voice Transcription](/docs/voice-transcription) for the full feature reference including language hints, translation mode, and configuration examples.
+
 ## What We Don't Do
 
 **No prompt-level content analysis.** We know the process type and task type at spawn time.
diff --git a/docs/content/docs/(features)/meta.json b/docs/content/docs/(features)/meta.json
@@ -1,4 +1,4 @@
 {
   "title": "Features",
-  "pages": ["workers", "tasks", "opencode", "tools", "mcp", "browser", "cron", "skills", "ingestion"]
+  "pages": ["workers", "tasks", "opencode", "tools", "mcp", "browser", "cron", "skills", "ingestion", "voice-transcription"]
 }
diff --git a/docs/content/docs/(features)/voice-transcription.mdx b/docs/content/docs/(features)/voice-transcription.mdx
@@ -0,0 +1,194 @@
+---
+title: Voice Transcription
+description: Speech-to-text transcription for audio attachments using Whisper-compatible APIs.
+---
+
+# Voice Transcription
+
+Spacebot converts audio attachments (Telegram voice messages, Discord audio clips, etc.) to text using Whisper-compatible speech-to-text APIs. The transcript is injected into the conversation before the channel LLM processes it.
+
+## How It Works
+
+When a user sends an audio attachment, Spacebot:
+
+1. Downloads the audio bytes from the messaging platform
+2. Resolves the STT provider and model from routing config
+3. Sends a multipart `POST` to the provider's `/v1/audio/transcriptions` endpoint
+4. Injects the transcript into the conversation as a structured XML tag
+
+The channel LLM sees the transcript, not raw audio:
+
+```xml
+<voice_transcript name="voice_message.ogg" mime="audio/ogg">
+Hello, this is what the user said in their voice message.
+</voice_transcript>
+```
+
+When translation mode is enabled, the tag changes:
+
+```xml
+<voice_translation name="voice_message.ogg" mime="audio/ogg">
+Hello, this is the English translation of what the user said.
+</voice_translation>
+```
+
+## Configuration
+
+All voice settings live under `[defaults.routing]` or per-agent `[agents.routing]`.
+
+```toml
+[defaults.routing]
+voice = "groq/whisper-large-v3-turbo"
+voice_language = "en"      # optional
+voice_translate = false    # optional
+stt_provider = "groq"      # optional
+```
+
+### Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `voice` | string | Provider-dependent | STT model in `provider/model` format. Empty string disables voice transcription. |
+| `voice_language` | string | None | ISO 639-1 language hint for accuracy (e.g. `en`, `es`, `fr`, `ja`). Ignored in translation mode. |
+| `voice_translate` | bool | `false` | When `true`, uses the translations endpoint to translate audio to English. |
+| `stt_provider` | string | None | Override which provider handles STT. When absent, provider is extracted from the `voice` model prefix. |
+
+### Provider Defaults
+
+When no explicit `voice` is set, Spacebot applies a default based on the primary provider:
+
+| Primary Provider | Default `voice` | Notes |
+|------------------|----------------|-------|
+| OpenAI | `openai/whisper-1` | Native Whisper API |
+| Groq | `groq/whisper-large-v3-turbo` | Fast and cheap |
+| Gemini | `gemini/gemini-2.5-flash` | OpenAI-compatible endpoint |
+| OpenRouter | *(empty)* | No native STT — configure `stt_provider` separately |
+| Anthropic | *(empty)* | No STT — configure `stt_provider` separately |
+| All others | *(empty)* | Must configure `voice` explicitly |
+
+### Environment Variables
+
+| Variable | Description | Example |
+|----------|-------------|---------|
+| `SPACEBOT_VOICE_MODEL` | STT model | `groq/whisper-large-v3-turbo` |
+| `SPACEBOT_VOICE_LANGUAGE` | Language hint | `en` |
+| `SPACEBOT_VOICE_TRANSLATE` | Translation mode | `true` |
+| `SPACEBOT_STT_PROVIDER` | Provider override | `groq` |
+
+Resolution order: **environment variable > config file > provider default**.
+
+## Supported Providers
+
+Voice transcription requires a provider that supports the OpenAI-compatible Whisper API (`/v1/audio/transcriptions` with multipart form data).
+
+| Provider | Models | Transcription Endpoint | Translation Endpoint |
+|----------|--------|----------------------|---------------------|
+| **OpenAI** | `whisper-1`, `gpt-4o-transcribe`, `gpt-4o-mini-transcribe` | `/v1/audio/transcriptions` | `/v1/audio/translations` |
+| **Groq** | `whisper-large-v3`, `whisper-large-v3-turbo` | `/openai/v1/audio/transcriptions` | `/openai/v1/audio/translations` |
+| **Gemini** | `gemini-2.5-flash` (and other Gemini models) | `/v1/audio/transcriptions` | Not supported |
+
+Providers that do **not** have a transcription endpoint (Anthropic, OpenRouter, DeepSeek, Together, xAI, Mistral, etc.) cannot be used directly for voice. Configure a separate STT provider instead.
+
+### Supported Audio Formats
+
+The Whisper API accepts: `flac`, `m4a`, `mp3`, `mp4`, `mpeg`, `mpga`, `oga`, `ogg`, `wav`, `webm`.
+
+Telegram voice messages (OGG/Opus) are natively supported with no conversion needed.
+
+## Examples
+
+### Groq for chat and transcription
+
+```toml
+[llm]
+groq_key = "gsk_xxx"
+
+[defaults.routing]
+channel = "groq/llama-3.3-70b-versatile"
+voice = "groq/whisper-large-v3-turbo"
+```
+
+### OpenRouter for chat, Groq for transcription
+
+```toml
+[llm]
+openrouter_key = "sk-or-xxx"
+groq_key = "gsk_xxx"
+
+[defaults.routing]
+channel = "openrouter/anthropic/claude-sonnet-4"
+voice = "groq/whisper-large-v3-turbo"
+voice_language = "en"
+```
+
+### Anthropic for chat, OpenAI for transcription with translation
+
+```toml
+[llm]
+anthropic_key = "sk-ant-xxx"
+openai_key = "sk-xxx"
+
+[defaults.routing]
+channel = "anthropic/claude-sonnet-4"
+voice = "openai/whisper-1"
+voice_translate = true
+stt_provider = "openai"
+```
+
+### Multilingual transcription with language hint
+
+```toml
+[llm]
+openai_key = "sk-xxx"
+
+[defaults.routing]
+channel = "openai/gpt-4.1"
+voice = "openai/whisper-1"
+voice_language = "ja"
+```
+
+### Gemini for everything
+
+```toml
+[llm]
+gemini_key = "xxx"
+
+[defaults.routing]
+channel = "gemini/gemini-2.5-pro"
+voice = "gemini/gemini-2.5-flash"
+```
+
+## Error Handling
+
+Errors are returned as inline text in the conversation so the channel LLM can inform the user:
+
+| Condition | Message |
+|-----------|---------|
+| No voice model configured | `[Audio attachment received but no voice model is configured...]` |
+| STT provider not found | `[Audio transcription failed: provider 'xxx' is not configured]` |
+| Provider doesn't support Whisper | `[Audio transcription not supported by provider 'xxx'...]` |
+| API error | `[Audio transcription failed for filename.ogg: Whisper API error (400): ...]` |
+| Download failure | `[Failed to download audio: filename.ogg]` |
+
+There is no fallback to alternative transcription methods. If transcription fails, the error is returned directly.
+
+## API
+
+### Runtime Configuration
+
+Voice settings are included in the agent config API:
+
+```
+GET  /api/config?agent_id=main
+PATCH /api/config  { "agent_id": "main", "routing": { "voice": "...", ... } }
+```
+
+### Model Discovery
+
+Filter models to transcription-capable providers:
+
+```
+GET /api/models?capability=voice_transcription
+```
+
+Returns models from providers that support the Whisper-compatible transcription endpoint (currently: OpenAI, Groq, Gemini).

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`	`1`	`{`
`2`	`2`	`"title": "Features",`
`3`		`- "pages": ["workers", "tasks", "opencode", "tools", "mcp", "browser", "cron", "skills", "ingestion"]`
	`3`	`+ "pages": ["workers", "tasks", "opencode", "tools", "mcp", "browser", "cron", "skills", "ingestion", "voice-transcription"]`
`4`	`4`	`}`