Skip to content

A FastAPI wrapper for NVIDIA's new parakeet 0.6b v2 TTS 600-million-parameter model designed for high-quality English speech recognition

License

Notifications You must be signed in to change notification settings

Shadowfita/parakeet-tdt-0.6b-v2-fastapi

Repository files navigation

Parakeet-TDT 0.6B v2 FastAPI STT Service

A production-ready FastAPI service for high-accuracy English speech-to-text using NVIDIA's Parakeet-TDT 0.6B v2 model. Implements both REST and WebSocket endpoints following the OpenAI Audio API specification interface.

Features

  • RESTful transcription

    • POST /transcribe with multipart audio uploads
    • Word/character/segment timestamps
    • OpenAI-compatible response schema
  • WebSocket streaming

    • Real-time voice activity detection via Silero VAD
    • Partial/final transcription delivery
    • Supports 16kHz mono PCM input
  • Batch processing

    • Micro-batching for efficient GPU utilization
    • Configurable batch size and processing timeout
  • Production-ready deployment

    • Docker and Docker Compose support
    • Health checks and configuration endpoints
    • Environment variable configuration
  • Audio preprocessing

    • Automatic downmixing and resampling
    • File validation and chunking

Table of Contents

Prerequisites

  • Python 3.10+
  • NVIDIA GPU with CUDA 12.1+ (recommended)
  • Docker Engine 24.0+ (for container deployment)

Installation

Local Development

git clone https://github.com/your-repo/parakeet-fastapi.git
cd parakeet-fastapi

# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Docker Deployment

docker build -t parakeet-stt .
docker run -d -p 8000:8000 --gpus all parakeet-stt

Docker Compose

docker-compose up --build

Configuration

All configuration is managed through environment variables. Create a .env file with your preferences:

# Model configuration
MODEL_PRECISION=fp16
DEVICE=cuda
BATCH_SIZE=4

# Audio processing
TARGET_SR=16000
MAX_AUDIO_DURATION=30
VAD_THRESHOLD=0.5

# System
LOG_LEVEL=INFO
PROCESSING_TIMEOUT=60

Running the Server

Local Development

uvicorn parakeet_service.main:app --host 0.0.0.0 --port 8000

Production

docker-compose up --build -d

Usage

REST API

Health Check

curl http://localhost:8000/healthz
# {"status":"ok"}

Transcription

curl -X POST http://localhost:8000/transcribe \
  -F file="@audio.wav" \
  -F include_timestamps=true \
  -F should_chunk=true

Parameters:

Name Type Default Description
file audio/* Required Audio file (wav, mp3, flac)
include_timestamps bool false Return word/segment timestamps
should_chunk bool true Enable audio chunking for long files

Response:

{
  "text": "Transcribed text content",
  "timestamps": {
    "words": [
      {"text": "Hello", "start": 0.2, "end": 0.5},
      {"text": "world", "start": 0.6, "end": 0.9}
    ],
    "segments": [
      {"text": "Hello world", "start": 0.2, "end": 0.9}
    ]
  }
}

WebSocket Streaming

Connect to ws://localhost:8000/ws to stream audio:

  • Input: 16kHz mono PCM frames (int16)
  • Output: JSON messages with partial/final transcriptions

JavaScript Example:

const ws = new WebSocket("ws://localhost:8000/ws");
const audioContext = new AudioContext();
const processor = audioContext.createScriptProcessor(1024, 1, 1);

processor.onaudioprocess = e => {
  const pcmData = e.inputBuffer.getChannelData(0);
  const int16Data = convertFloat32ToInt16(pcmData);
  ws.send(int16Data);
};

ws.onmessage = evt => {
  const data = JSON.parse(evt.data);
  console.log("Transcription:", data.text);
};

Architecture Overview

graph LR
A[Client] -->|HTTP| B[REST API]
A -->|WebSocket| C[Streaming API]
B --> D[Batch Worker]
C --> E[VAD Processing]
E --> F[Chunker]
F --> D
D --> G[ASR Model]
G --> H[Response Formatter]
H --> A
Loading

Components:

  1. main.py - App initialization and lifecycle management
  2. routes.py - REST endpoints implementation
  3. stream_routes.py - WebSocket endpoint handler
  4. streaming_vad.py - Voice activity detection
  5. chunker.py - Audio segmentation
  6. batchworker.py - Micro-batch processing
  7. model.py - ASR model interface
  8. audio.py - Audio preprocessing utilities
  9. config.py - Configuration management

Environment Variables

Variable Default Description
MODEL_PRECISION fp16 Model precision (fp16/fp32)
DEVICE cuda Computation device
BATCH_SIZE 4 Processing batch size
TARGET_SR 16000 Target sample rate
MAX_AUDIO_DURATION 30 Max audio length in seconds
VAD_THRESHOLD 0.5 Voice activity threshold
LOG_LEVEL INFO Logging verbosity
PROCESSING_TIMEOUT 60 Processing timeout in seconds

Contributing

  1. Fork the repository and create your feature branch
  2. Submit a pull request with detailed description

About

A FastAPI wrapper for NVIDIA's new parakeet 0.6b v2 TTS 600-million-parameter model designed for high-quality English speech recognition

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published