A macOS command-line utility wrapping the FluidAudio library providing speech-to-text transcription with speaker diarization using the Parakeet v3 model.
- π΅ Audio Processing: Supports MP3 and other common audio formats
- π£οΈ Speech Recognition: Uses Parakeet TDT v3 (0.6b) for high-accuracy transcription
- π₯ Speaker Diarization: Identifies different speakers and segments their speech
- π On-Device Processing: Everything runs locally on Apple Neural Engine (ANE)
- π Multiple Output Formats: Individual transcripts, diarization results, and combined output
- macOS 13.0 or later
- Apple Silicon Mac (M1, M2, M3, M4) recommended for optimal performance
- Swift 5.10 or later
The easiest way to install STT is using Homebrew:
# Add the tap
brew tap jvsteiner/tap
# Install STT
brew install sttThat's it! The stt command will be available in your terminal.
git clone https://github.com/jvsteiner/stt.git
cd stt
swift build -c releaseThe compiled binary will be available at .build/release/stt
# Copy to a directory in your PATH
cp .build/release/stt /usr/local/bin/sttIf you installed via Homebrew:
brew uninstall stt
brew untap jvsteiner/tap# Transcribe and diarize an audio file
stt input.mp3
# Transcribe only (skip diarization)
stt input.mp3 --transcribe-only
# Verbose output
stt input.mp3 --verbose
# Custom output directory
stt input.mp3 --output ./results/
# Custom diarization threshold (0.0-1.0, default: 0.8)
stt input.mp3 --threshold 0.6USAGE: stt <input-file> [--output <output>] [--verbose] [--transcribe-only] [--threshold <threshold>]
ARGUMENTS:
<input-file> Input MP3 file path
OPTIONS:
-o, --output <output> Output directory for results (default: same as input file)
-v, --verbose Enable verbose output
--transcribe-only Skip diarization and only perform transcription
--threshold <threshold> Diarization clustering threshold (0.0-1.0, default: 0.7)
-h, --help Show help information.
The tool generates up to three output files:
Pure transcription text without speaker information.
I think we have finally got a real competitor for anthropic...
Detailed diarization results with timing and speaker information.
SPEAKER DIARIZATION RESULTS
==========================
Audio Duration: 296.2 seconds
Speaker Count: 2
Segments: 15
Processing Time: 2.14 seconds
Real-time Factor: 0.14x
SPEAKER SEGMENTS:
-----------------
Speaker 1: 00:00.240 - 00:45.680 (45.4s) [Quality: 85.2%]
Speaker 2: 00:45.800 - 01:32.120 (46.3s) [Quality: 91.7%]
Speaker 1: 01:32.240 - 02:15.800 (43.6s) [Quality: 88.9%]
...
Transcription combined with speaker information and timing.
COMBINED TRANSCRIPT WITH SPEAKER DIARIZATION
===========================================
TRANSCRIPT BY SPEAKER:
---------------------
[00:00.240 - 00:45.680] Speaker 1:
I think we have finally got a real competitor for anthropic. The new model seems to be performing really well in our tests.
[00:45.800 - 01:32.120] Speaker 2:
Yes, I agree. The performance metrics are impressive, especially for code generation tasks.
...
FULL TRANSCRIPTION:
-------------------
I think we have finally got a real competitor for anthropic. The new model seems to be performing really well in our tests. Yes, I agree. The performance metrics are impressive, especially for code generation tasks...
- Real-time Factor: Typically 0.05x - 0.2x (processes 1 minute of audio in 3-12 seconds)
- Memory Usage: Optimized for Apple Neural Engine with minimal CPU/GPU usage
- Accuracy: Competitive with state-of-the-art models
- WER: ~2.7% on LibriSpeech test-clean
- DER: ~17.7% on AMI benchmark for diarization
Parakeet v3 supports all 25 European languages:
- English, French, German, Italian, Spanish, Portuguese, Dutch, Polish, Russian, Ukrainian, Czech, Slovak, Hungarian, Romanian, Bulgarian, Croatian, Serbian, Slovenian, Estonian, Latvian, Lithuanian, Finnish, Danish, Swedish, Norwegian
# Process a meeting recording with custom threshold
stt meeting_recording.mp3 --threshold 0.6 --verbose# Process multiple podcast episodes
for file in *.mp3; do
echo "Processing $file..."
stt "$file" --output ./podcast_transcripts/
done# Fast transcription without speaker identification
stt interview.mp3 --transcribe-only- Models are downloaded automatically on first use
- Ensure internet connectivity for initial setup
- Models are cached locally (~2GB total)
- Use Apple Silicon Macs for optimal performance
- Ensure sufficient free memory (4GB+ recommended)
- Close other intensive applications during processing
- The tool automatically converts audio to the required format (16kHz mono)
- Most common formats are supported (MP3, WAV, M4A, FLAC, etc.)
- Adjust
--thresholdparameter (lower = more speakers, higher = fewer speakers) - Default 0.8 works well for most cases
- Try 0.6 for conversations with many speakers
- Try 0.8 for interviews with distinct speakers
- ASR: Parakeet TDT v3 (0.6b) - NVIDIA's transformer-based model
- Diarization: Pyannote-based speaker segmentation and embedding
- VAD: Silero VAD v2 for voice activity detection
- Audio format conversion (16kHz mono)
- Voice Activity Detection (VAD)
- Speaker embedding extraction
- Speaker clustering and diarization
- Speech-to-text transcription
- Output generation and formatting
Built with:
- FluidAudio - Native Swift SDK for local audio AI
- Swift Argument Parser - Command line parsing
- Apple's CoreML and AVFoundation frameworks
This project is licensed under the MIT License. See the FluidAudio library for its Apache 2.0 license.