A phased-parallel voice assistant system with multiple microservices communicating via gRPC.
The system consists of 7 processes running on localhost:
- Main: Bootstrap process
- Loader (port 5002): Orchestrator for phased-parallel startup
- Logger (port 5001): Centralized logging service
- KWD (port 5003): Keyword detection (wake word)
- STT (port 5004): Speech-to-text
- LLM (port 5005): Language model (Ollama)
- TTS (port 5006): Text-to-speech
-
Logger Service (port 5001)
- Application and dialog logging
- Log rotation support
- gRPC RPCs: WriteApp, NewDialog, WriteDialog
- Health check implementation
-
KWD Service (port 5003)
- OpenWakeWord integration with "Alexa" wake word
- 0.6 confidence threshold, 1s cooldown
- Real-time audio processing at 16kHz
- gRPC RPCs: Events (stream), Enable, Disable
- Health check implementation
-
STT Service (port 5004)
- Whisper integration (small.en model)
- WebRTC VAD for automatic finalization (~2s silence)
- CUDA acceleration for fast transcription
- gRPC RPCs: Start, Stop, Results (stream)
- Multi-session support for concurrent dialogs
- LLM Service (port 5005) - Ollama bridge
- TTS Service (port 5006) - Kokoro integration
- Loader Service (port 5002) - Phased orchestration
- Python 3.11
- CUDA-capable GPU with 8GB+ VRAM
- PortAudio (for audio capture)
- Ollama (for LLM)
# Create virtual environment with uv
uv venv
source .venv/bin/activate
# Install dependencies
uv pip install grpcio grpcio-tools grpcio-health-checking
uv pip install aiofiles pyyaml nvidia-ml-py3 psutil
uv pip install sounddevice numpy scipy
uv pip install onnxruntime openwakeword tqdmConfiguration is in config/config.ini:
- VRAM guardrail: 8000MB minimum
- Ports: 5001-5006 (localhost only)
- Wake word: "Alexa" (threshold 0.6)
# Check status of all services
python manage_services.py status
# Start individual service
python manage_services.py start logger
python manage_services.py start kwd
# Stop service
python manage_services.py stop kwd
# Restart service
python manage_services.py restart logger
# Start all services
python manage_services.py start all
# Stop all services
python manage_services.py stop all# Start logger
python manage_services.py start logger
# View logs
cat logs/app.log# Start KWD service
python manage_services.py start kwd
# Run test client
python tests/test_kwd.py
# Say "Alexa" to trigger wake word detection# Start STT service (and logger)
python manage_services.py start logger
python manage_services.py start stt
# Run test client
python tests/test_stt.py
# Speak after the prompt - recognition finalizes after 2s of silence
# Test continuous recognition
python tests/test_stt.py --continuousEach service writes to its own log file:
logger_service.logkwd_service.log- Application logs:
logs/app.log - Dialog logs:
logs/dialog_*.log
Alexa_W/
├── services/ # Service implementations
│ ├── logger/
│ ├── kwd/
│ ├── stt/
│ ├── llm/
│ ├── tts/
│ └── loader/
├── common/ # Shared modules
│ ├── base_service.py
│ ├── config_loader.py
│ ├── health_client.py
│ └── gpu_monitor.py
├── proto/ # gRPC definitions
│ ├── services.proto
│ └── generated files
├── config/ # Configuration
│ ├── config.ini
│ └── Modelfile
├── models/ # ML models
│ └── alexa_v0.1.onnx
├── logs/ # Log files
└── tests/ # Test scripts
- Create service directory under
services/ - Inherit from
BaseServiceclass - Implement service-specific RPCs
- Add to
manage_services.py - Test with health checks
# Find process using port
lsof -i :5001
# Kill process
kill -9 <PID>- Ensure microphone permissions are granted
- Check audio device with
python -m sounddevice - Verify sample rate compatibility (16kHz required)
- Check GPU with
nvidia-smi - Ensure 8GB+ VRAM available
- Monitor usage with
common/gpu_monitor.py
- Wake detection latency: <200ms
- First token latency (LLM): <800ms
- First audio latency (TTS): <150ms
- Dialog follow-up window: 4s
- All services bind to localhost (127.0.0.1) only
- No external network calls
- Config validation on startup
- VRAM guardrails enforced