Skip to content

Commit 9fcdf2f

Browse files
authored
feat: add PocketTTS backend for lightweight text-to-speech (#273)
## Summary - Add PocketTTS as a new TTS backend — flow-matching language model with autoregressive streaming synthesis - Pure Swift implementation using 4 CoreML models (cond_step, flowlm_step, flow_decoder, mimi_decoder) - iOS 17 compatible — no `scaled_dot_product_attention` ops (avoids BNNS crash) - Add audio post-processor with de-esser for reducing sibilant harshness ## Test plan - [x] Short sentence: WER 0, 3.44s audio - [x] Long sentence: WER 0, 6.64s audio - [x] Fresh HuggingFace download works end-to-end - [x] iOS build succeeds (`xcodebuild -destination 'generic/platform=iOS'`) - [x] macOS build succeeds (`swift build -c release`)
1 parent 980a7e5 commit 9fcdf2f

24 files changed

+2319
-137
lines changed

Documentation/Models.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,8 @@ TDT models process audio in chunks (~15s with overlap) as batch operations. Fast
4343

4444
| Model | Description | Context |
4545
|-------|-------------|---------|
46-
| **Kokoro TTS** | Text-to-speech synthesis (82M params), 48 voices, minimal RAM usage on iOS. | First TTS backend added. |
46+
| **Kokoro TTS** | Text-to-speech synthesis (82M params), 48 voices, minimal RAM usage on iOS. Generates all frames at once via flow matching over mel spectrograms + Vocos vocoder. Requires espeak for phonemization. | First TTS backend added. |
47+
| **PocketTTS** | Second TTS backend (~155M params). Upgrade over Kokoro with much better dynamic audio chunking. No espeak dependency. | |
4748

4849
## Model Sources
4950

@@ -58,3 +59,4 @@ TDT models process audio in chunks (~15s with overlap) as batch operations. Fast
5859
| Diarization (Pyannote) | [FluidInference/speaker-diarization-coreml](https://huggingface.co/FluidInference/speaker-diarization-coreml) |
5960
| Sortformer | [FluidInference/diar-streaming-sortformer-coreml](https://huggingface.co/FluidInference/diar-streaming-sortformer-coreml) |
6061
| Kokoro TTS | [FluidInference/kokoro-82m-coreml](https://huggingface.co/FluidInference/kokoro-82m-coreml) |
62+
| PocketTTS | [FluidInference/pocket-tts-coreml](https://huggingface.co/FluidInference/pocket-tts-coreml) |

Documentation/TTS/Kokoro.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Kokoro: High-Quality Text-to-Speech
2+
3+
## Overview
4+
5+
Kokoro is a high-quality, English-only TTS backend. It generates the entire audio representation in one pass (all frames at once) using flow matching over mel spectrograms, then converts to audio with the Vocos vocoder.
6+
7+
## Quick Start
8+
9+
### CLI
10+
11+
```bash
12+
swift run fluidaudio tts "Welcome to FluidAudio text to speech" \
13+
--output ~/Desktop/demo.wav \
14+
--voice af_heart
15+
```
16+
17+
The first invocation downloads Kokoro models, phoneme dictionaries, and voice embeddings; later runs reuse the cached assets.
18+
19+
### Swift
20+
21+
```swift
22+
import FluidAudioTTS
23+
24+
let manager = TtSManager()
25+
try await manager.initialize()
26+
27+
let audioData = try await manager.synthesize(text: "Hello from FluidAudio!")
28+
29+
let outputURL = URL(fileURLWithPath: "/tmp/demo.wav")
30+
try audioData.write(to: outputURL)
31+
```
32+
33+
Swap in `manager.initialize(models:)` when you want to preload only the long-form `.fifteenSecond` variant.
34+
35+
## Inspecting Chunk Metadata
36+
37+
```swift
38+
let manager = TtSManager()
39+
try await manager.initialize()
40+
41+
let detailed = try await manager.synthesizeDetailed(
42+
text: "FluidAudio can report chunk splits for you.",
43+
variantPreference: .fifteenSecond
44+
)
45+
46+
for chunk in detailed.chunks {
47+
print("Chunk #\(chunk.index) -> variant: \(chunk.variant), tokens: \(chunk.tokenCount)")
48+
print(" text: \(chunk.text)")
49+
}
50+
```
51+
52+
`KokoroSynthesizer.SynthesisResult` also exposes `diagnostics` for per-run variant and audio footprint totals.
53+
54+
## SSML Support
55+
56+
Kokoro supports a subset of SSML tags for controlling pronunciation. See [SSML.md](SSML.md) for details.
57+
58+
## How It Differs From PocketTTS
59+
60+
| | Kokoro | PocketTTS |
61+
|---|---|---|
62+
| Text input | Phonemes (IPA via espeak) | Raw text (SentencePiece) |
63+
| Voice conditioning | Style embedding vector | 125 audio prompt tokens |
64+
| Generation | All frames at once | Frame-by-frame autoregressive |
65+
| Flow matching target | Mel spectrogram | 32-dim latent per frame |
66+
| Audio synthesis | Vocos vocoder | Mimi streaming codec |
67+
| Latency to first audio | Must wait for full generation | ~80ms after prefill |
68+
69+
Kokoro parallelizes across time (fast total, but must wait for everything). PocketTTS is sequential across time (slower total, but audio starts immediately).
70+
71+
## Enable TTS in Your Project
72+
73+
### App/Library Development (Xcode & SwiftPM)
74+
75+
When adding FluidAudio to your Xcode project or Package.swift, select the **`FluidAudioWithTTS`** product:
76+
77+
**Xcode:**
78+
1. File > Add Package Dependencies
79+
2. Enter FluidAudio repository URL
80+
3. Choose **`FluidAudioWithTTS`**
81+
4. Add it to your app target
82+
83+
**Package.swift:**
84+
```swift
85+
dependencies: [
86+
.package(url: "https://github.com/FluidInference/FluidAudio.git", from: "0.7.7"),
87+
],
88+
targets: [
89+
.target(
90+
name: "YourTarget",
91+
dependencies: [
92+
.product(name: "FluidAudioWithTTS", package: "FluidAudio")
93+
]
94+
)
95+
]
96+
```
97+
98+
**Import in your code:**
99+
```swift
100+
import FluidAudio // Core functionality (ASR, diarization, VAD)
101+
import FluidAudioTTS // TTS features
102+
```
103+
104+
### CLI Development
105+
106+
TTS support is enabled by default in the CLI:
107+
108+
```bash
109+
swift run fluidaudio tts "Welcome to FluidAudio" --output ~/Desktop/demo.wav
110+
```

Documentation/TTS/PocketTTS.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# PocketTTS Swift Inference
2+
3+
How the Swift code generates speech from text.
4+
5+
## Files
6+
7+
| File | Role |
8+
|------|------|
9+
| `PocketTtsManager.swift` | Public API — `initialize()`, `synthesize()`, `synthesizeToFile()` |
10+
| `PocketTtsModelStore.swift` | Loads and stores the 4 CoreML models + constants + voice data |
11+
| `PocketTtsSynthesizer.swift` | Main synthesis loop — chunking, prefill, generation, output |
12+
| `PocketTtsSynthesizer+KVCache.swift` | KV cache state, `prefillKVCache()`, `runCondStep()`, `runFlowLMStep()` |
13+
| `PocketTtsSynthesizer+Flow.swift` | Flow decoder loop, `denormalize()`, `quantize()`, SeededRNG |
14+
| `PocketTtsSynthesizer+Mimi.swift` | Mimi decoder state, `runMimiDecoder()`, `loadMimiInitialState()` |
15+
| `PocketTtsConstantsLoader.swift` | Loads binary constants (embeddings, tokenizer, quantizer weights) |
16+
| `PocketTtsConstants.swift` | All numeric constants (dimensions, thresholds, etc.) |
17+
18+
## Call Flow
19+
20+
```
21+
PocketTtsManager.synthesize(text:)
22+
|
23+
v
24+
PocketTtsSynthesizer.synthesize(text:voice:temperature:)
25+
|
26+
|-- chunkText() split text into <=50 token chunks
27+
|-- loadMimiInitialState() load 23 streaming state tensors from disk
28+
|
29+
|-- FOR EACH CHUNK:
30+
| |
31+
| |-- tokenizer.encode() SentencePiece text → token IDs
32+
| |-- embedTokens() table lookup: token ID → [1024] vector
33+
| |-- prefillKVCache() feed 125 voice + N text tokens through cond_step
34+
| | |
35+
| | |-- emptyKVCacheState() fresh cache (6 layers × [2,1,512,16,64])
36+
| | |-- runCondStep() × ~141 one token per call, updates cache
37+
| |
38+
| |-- GENERATE LOOP (until EOS or max frames):
39+
| | |
40+
| | |-- runFlowLMStep() → transformer_out [1,1024] + eos_logit
41+
| | |-- flowDecode() → 32-dim latent
42+
| | | |-- randn(32) * sqrt(temperature)
43+
| | | |-- runFlowDecoderStep() × 8 Euler steps
44+
| | | |-- latent += velocity * dt each step
45+
| | |
46+
| | |-- denormalize() latent * std + mean
47+
| | |-- quantize() matmul [32] × [32,512] → [512]
48+
| | |-- runMimiDecoder() [512] → 1920 audio samples
49+
| | | updates 23 streaming state tensors
50+
| | |
51+
| | |-- createSequenceFromLatent() feed latent back for next frame
52+
|
53+
|-- concatenate all frames
54+
|-- applyTtsPostProcessing() (optional de-essing)
55+
|-- AudioWAV.data() wrap in WAV header (24kHz mono)
56+
```
57+
58+
## Key State
59+
60+
### KV Cache (`KVCacheState`)
61+
- 6 cache tensors `[2, 1, 512, 16, 64]` + 6 position counters
62+
- Written during prefill (voice + text tokens)
63+
- Read and extended during generation (one position per frame)
64+
- **Reset per chunk** — each chunk gets a fresh cache
65+
66+
### Mimi State (`MimiState`)
67+
- 23 tensors: convolution history, attention caches, overlap-add buffers
68+
- Loaded once from `mimi_init_state/*.bin` files via `manifest.json`
69+
- Updated after every `runMimiDecoder()` call — outputs feed back as next input
70+
- **Continuous across chunks** — never reset, keeps audio seamless
71+
72+
## Text Chunking
73+
74+
Long text is split into chunks of <=50 tokens to fit the KV cache (512 positions, minus ~125 voice + ~25 overhead).
75+
76+
Splitting priority:
77+
1. Sentence boundaries (`.!?`)
78+
2. Clause boundaries (`,;:`)
79+
3. Word boundaries (fallback)
80+
81+
`normalizeText()` also capitalizes, adds terminal punctuation, and pads short text with leading spaces for better prosody.
82+
83+
## EOS Detection
84+
85+
`runFlowLMStep()` returns an `eos_logit`. When it exceeds `-4.0`, the code generates a few extra frames (3 for short text, 1 for long) then stops.
86+
87+
## CoreML Details
88+
89+
- All 4 models loaded with `.cpuAndGPU` compute units (ANE float16 causes artifacts in Mimi state feedback)
90+
- Models compiled from `.mlpackage``.mlmodelc` on first load, cached on disk
91+
- `PocketTtsModelStore` is an actor — thread-safe access to loaded models
92+
- Voice data cached per voice name to avoid reloading
93+
94+
## Usage
95+
96+
```swift
97+
import FluidAudioTTS
98+
99+
let manager = PocketTtsManager()
100+
try await manager.initialize()
101+
102+
let audioData = try await manager.synthesize(text: "Hello, world!")
103+
104+
try await manager.synthesizeToFile(
105+
text: "Hello, world!",
106+
outputURL: URL(fileURLWithPath: "/tmp/output.wav")
107+
)
108+
```
109+
110+
## License
111+
112+
CC-BY-4.0, inherited from [kyutai/pocket-tts](https://huggingface.co/kyutai/pocket-tts).

Documentation/TTS/README.md

Lines changed: 0 additions & 108 deletions
This file was deleted.

0 commit comments

Comments
 (0)