FluidInference
diff --git a/‎Documentation/Models.md‎
Lines changed: 3 additions & 1 deletion b/‎Documentation/Models.md‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎Documentation/TTS/Kokoro.md‎
Lines changed: 110 additions & 0 deletions b/‎Documentation/TTS/Kokoro.md‎
Lines changed: 110 additions & 0 deletions
diff --git a/‎Documentation/TTS/PocketTTS.md‎
Lines changed: 112 additions & 0 deletions b/‎Documentation/TTS/PocketTTS.md‎
Lines changed: 112 additions & 0 deletions
diff --git a/‎Documentation/TTS/README.md‎
Lines changed: 0 additions & 108 deletions b/‎Documentation/TTS/README.md‎
Lines changed: 0 additions & 108 deletions
@@ -43,7 +43,8 @@ TDT models process audio in chunks (~15s with overlap) as batch operations. Fast
 
 | Model | Description | Context |
 |-------|-------------|---------|
-| **Kokoro TTS** | Text-to-speech synthesis (82M params), 48 voices, minimal RAM usage on iOS. | First TTS backend added. |
+| **Kokoro TTS** | Text-to-speech synthesis (82M params), 48 voices, minimal RAM usage on iOS. Generates all frames at once via flow matching over mel spectrograms + Vocos vocoder. Requires espeak for phonemization. | First TTS backend added. |
+| **PocketTTS** | Second TTS backend (~155M params). Upgrade over Kokoro with much better dynamic audio chunking. No espeak dependency. | |
 
 ## Model Sources
 
@@ -58,3 +59,4 @@ TDT models process audio in chunks (~15s with overlap) as batch operations. Fast
 | Diarization (Pyannote) | [FluidInference/speaker-diarization-coreml](https://huggingface.co/FluidInference/speaker-diarization-coreml) |
 | Sortformer | [FluidInference/diar-streaming-sortformer-coreml](https://huggingface.co/FluidInference/diar-streaming-sortformer-coreml) |
 | Kokoro TTS | [FluidInference/kokoro-82m-coreml](https://huggingface.co/FluidInference/kokoro-82m-coreml) |
+| PocketTTS | [FluidInference/pocket-tts-coreml](https://huggingface.co/FluidInference/pocket-tts-coreml) |
@@ -0,0 +1,110 @@
+# Kokoro: High-Quality Text-to-Speech
+
+## Overview
+
+Kokoro is a high-quality, English-only TTS backend. It generates the entire audio representation in one pass (all frames at once) using flow matching over mel spectrograms, then converts to audio with the Vocos vocoder.
+
+## Quick Start
+
+### CLI
+
+```bash
+swift run fluidaudio tts "Welcome to FluidAudio text to speech" \
+  --output ~/Desktop/demo.wav \
+  --voice af_heart
+```
+
+The first invocation downloads Kokoro models, phoneme dictionaries, and voice embeddings; later runs reuse the cached assets.
+
+### Swift
+
+```swift
+import FluidAudioTTS
+
+let manager = TtSManager()
+try await manager.initialize()
+
+let audioData = try await manager.synthesize(text: "Hello from FluidAudio!")
+
+let outputURL = URL(fileURLWithPath: "/tmp/demo.wav")
+try audioData.write(to: outputURL)
+```
+
+Swap in `manager.initialize(models:)` when you want to preload only the long-form `.fifteenSecond` variant.
+
+## Inspecting Chunk Metadata
+
+```swift
+let manager = TtSManager()
+try await manager.initialize()
+
+let detailed = try await manager.synthesizeDetailed(
+    text: "FluidAudio can report chunk splits for you.",
+    variantPreference: .fifteenSecond
+)
+
+for chunk in detailed.chunks {
+    print("Chunk #\(chunk.index) -> variant: \(chunk.variant), tokens: \(chunk.tokenCount)")
+    print("  text: \(chunk.text)")
+}
+```
+
+`KokoroSynthesizer.SynthesisResult` also exposes `diagnostics` for per-run variant and audio footprint totals.
+
+## SSML Support
+
+Kokoro supports a subset of SSML tags for controlling pronunciation. See [SSML.md](SSML.md) for details.
+
+## How It Differs From PocketTTS
+
+| | Kokoro | PocketTTS |
+|---|---|---|
+| Text input | Phonemes (IPA via espeak) | Raw text (SentencePiece) |
+| Voice conditioning | Style embedding vector | 125 audio prompt tokens |
+| Generation | All frames at once | Frame-by-frame autoregressive |
+| Flow matching target | Mel spectrogram | 32-dim latent per frame |
+| Audio synthesis | Vocos vocoder | Mimi streaming codec |
+| Latency to first audio | Must wait for full generation | ~80ms after prefill |
+
+Kokoro parallelizes across time (fast total, but must wait for everything). PocketTTS is sequential across time (slower total, but audio starts immediately).
+
+## Enable TTS in Your Project
+
+### App/Library Development (Xcode & SwiftPM)
+
+When adding FluidAudio to your Xcode project or Package.swift, select the **`FluidAudioWithTTS`** product:
+
+**Xcode:**
+1. File > Add Package Dependencies
+2. Enter FluidAudio repository URL
+3. Choose **`FluidAudioWithTTS`**
+4. Add it to your app target
+
+**Package.swift:**
+```swift
+dependencies: [
+    .package(url: "https://github.com/FluidInference/FluidAudio.git", from: "0.7.7"),
+],
+targets: [
+    .target(
+        name: "YourTarget",
+        dependencies: [
+            .product(name: "FluidAudioWithTTS", package: "FluidAudio")
+        ]
+    )
+]
+```
+
+**Import in your code:**
+```swift
+import FluidAudio       // Core functionality (ASR, diarization, VAD)
+import FluidAudioTTS    // TTS features
+```
+
+### CLI Development
+
+TTS support is enabled by default in the CLI:
+
+```bash
+swift run fluidaudio tts "Welcome to FluidAudio" --output ~/Desktop/demo.wav
+```
@@ -0,0 +1,112 @@
+# PocketTTS Swift Inference
+
+How the Swift code generates speech from text.
+
+## Files
+
+| File | Role |
+|------|------|
+| `PocketTtsManager.swift` | Public API — `initialize()`, `synthesize()`, `synthesizeToFile()` |
+| `PocketTtsModelStore.swift` | Loads and stores the 4 CoreML models + constants + voice data |
+| `PocketTtsSynthesizer.swift` | Main synthesis loop — chunking, prefill, generation, output |
+| `PocketTtsSynthesizer+KVCache.swift` | KV cache state, `prefillKVCache()`, `runCondStep()`, `runFlowLMStep()` |
+| `PocketTtsSynthesizer+Flow.swift` | Flow decoder loop, `denormalize()`, `quantize()`, SeededRNG |
+| `PocketTtsSynthesizer+Mimi.swift` | Mimi decoder state, `runMimiDecoder()`, `loadMimiInitialState()` |
+| `PocketTtsConstantsLoader.swift` | Loads binary constants (embeddings, tokenizer, quantizer weights) |
+| `PocketTtsConstants.swift` | All numeric constants (dimensions, thresholds, etc.) |
+
+## Call Flow
+
+```
+PocketTtsManager.synthesize(text:)
+  |
+  v
+PocketTtsSynthesizer.synthesize(text:voice:temperature:)
+  |
+  |-- chunkText()              split text into <=50 token chunks
+  |-- loadMimiInitialState()   load 23 streaming state tensors from disk
+  |
+  |-- FOR EACH CHUNK:
+  |     |
+  |     |-- tokenizer.encode()     SentencePiece text → token IDs
+  |     |-- embedTokens()          table lookup: token ID → [1024] vector
+  |     |-- prefillKVCache()       feed 125 voice + N text tokens through cond_step
+  |     |     |
+  |     |     |-- emptyKVCacheState()   fresh cache (6 layers × [2,1,512,16,64])
+  |     |     |-- runCondStep() × ~141  one token per call, updates cache
+  |     |
+  |     |-- GENERATE LOOP (until EOS or max frames):
+  |     |     |
+  |     |     |-- runFlowLMStep()       → transformer_out [1,1024] + eos_logit
+  |     |     |-- flowDecode()          → 32-dim latent
+  |     |     |     |-- randn(32) * sqrt(temperature)
+  |     |     |     |-- runFlowDecoderStep() × 8 Euler steps
+  |     |     |     |-- latent += velocity * dt each step
+  |     |     |
+  |     |     |-- denormalize()         latent * std + mean
+  |     |     |-- quantize()            matmul [32] × [32,512] → [512]
+  |     |     |-- runMimiDecoder()      [512] → 1920 audio samples
+  |     |     |     updates 23 streaming state tensors
+  |     |     |
+  |     |     |-- createSequenceFromLatent()  feed latent back for next frame
+  |
+  |-- concatenate all frames
+  |-- applyTtsPostProcessing() (optional de-essing)
+  |-- AudioWAV.data()          wrap in WAV header (24kHz mono)
+```
+
+## Key State
+
+### KV Cache (`KVCacheState`)
+- 6 cache tensors `[2, 1, 512, 16, 64]` + 6 position counters
+- Written during prefill (voice + text tokens)
+- Read and extended during generation (one position per frame)
+- **Reset per chunk** — each chunk gets a fresh cache
+
+### Mimi State (`MimiState`)
+- 23 tensors: convolution history, attention caches, overlap-add buffers
+- Loaded once from `mimi_init_state/*.bin` files via `manifest.json`
+- Updated after every `runMimiDecoder()` call — outputs feed back as next input
+- **Continuous across chunks** — never reset, keeps audio seamless
+
+## Text Chunking
+
+Long text is split into chunks of <=50 tokens to fit the KV cache (512 positions, minus ~125 voice + ~25 overhead).
+
+Splitting priority:
+1. Sentence boundaries (`.!?`)
+2. Clause boundaries (`,;:`)
+3. Word boundaries (fallback)
+
+`normalizeText()` also capitalizes, adds terminal punctuation, and pads short text with leading spaces for better prosody.
+
+## EOS Detection
+
+`runFlowLMStep()` returns an `eos_logit`. When it exceeds `-4.0`, the code generates a few extra frames (3 for short text, 1 for long) then stops.
+
+## CoreML Details
+
+- All 4 models loaded with `.cpuAndGPU` compute units (ANE float16 causes artifacts in Mimi state feedback)
+- Models compiled from `.mlpackage` → `.mlmodelc` on first load, cached on disk
+- `PocketTtsModelStore` is an actor — thread-safe access to loaded models
+- Voice data cached per voice name to avoid reloading
+
+## Usage
+
+```swift
+import FluidAudioTTS
+
+let manager = PocketTtsManager()
+try await manager.initialize()
+
+let audioData = try await manager.synthesize(text: "Hello, world!")
+
+try await manager.synthesizeToFile(
+    text: "Hello, world!",
+    outputURL: URL(fileURLWithPath: "/tmp/output.wav")
+)
+```
+
+## License
+
+CC-BY-4.0, inherited from [kyutai/pocket-tts](https://huggingface.co/kyutai/pocket-tts).