Skip to content

Commit 2eec5f9

Browse files
committed
Added live stream recognition functionality and implemented a RESTful API for voice recognition services.
1 parent 9561d32 commit 2eec5f9

File tree

12 files changed

+840
-320
lines changed

12 files changed

+840
-320
lines changed

.gitignore

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Python artifacts
2+
__pycache__/
3+
*.py[cod]
4+
*.pyo
5+
6+
# Virtual environments
7+
.venv/
8+
venv/
9+
10+
# Editor configs
11+
.idea/
12+
.vscode/
13+
14+
# OS files
15+
.DS_Store
16+
17+
# Audio/model outputs
18+
audio_files/
19+
test_environment/
20+
21+
# Data/cache
22+
data/bst_data.pkl
23+
*.log

README.md

Lines changed: 84 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,9 @@ The Speaker Recognition Engine supports several commands for managing speaker au
2929

3030
1. **Enroll a Speaker**: Enroll a new speaker using an audio file.
3131
2. **Recognize a Speaker**: Identify a speaker from a given audio file.
32-
3. **List Enrolled Speakers**: Display a list of all enrolled speakers.
33-
4. **Delete a Speaker**: Remove a speaker's data from the system.
32+
3. **Recognize a Stream**: Feed audio chunks in near real-time and observe interim matches.
33+
4. **List Enrolled Speakers**: Display a list of all enrolled speakers.
34+
5. **Delete a Speaker**: Remove a speaker's data from the system.
3435
3536
Each command can be executed from the command line with the appropriate arguments.
3637
The general syntax for using the tool is:
@@ -60,3 +61,84 @@ python cli.py enroll <speaker_name> <audio_file_path> [optional parameters]
6061
```bash
6162
python cli.py enroll gena /home/gena/audio_files/gena.wav --sample_rate 16000 --num_filters 40 --num_ceps 13 --n_fft 512 --frame_size 0.025 --frame_step 0.01 --n_mixtures 8
6263
```
64+
65+
## Recognize a Speaker
66+
67+
Run the `recognize` command with a wav file. The CLI prints the best match and log-likelihood scores that come from the shared `VoiceRecognitionService`.
68+
69+
```bash
70+
python cli.py recognize /home/gena/audio_files/gena.wav --sample_rate 16000
71+
```
72+
73+
## Recognize a Stream (Real-Time Simulation)
74+
75+
The `recognize_stream` command reuses the same service façade but feeds the audio file in chunks (default 0.5 s). This mimics real-time capture and prints interim matches as soon as the likelihoods are high enough.
76+
77+
```bash
78+
python cli.py recognize_stream /home/gena/audio_files/gena.wav --chunk_duration 0.25
79+
```
80+
81+
## Live Microphone Demo
82+
83+
Use `src/live_recognition.py` to capture audio from the default input device and route it directly through the streaming API. Ensure `sounddevice` sees your microphone, then run:
84+
85+
```bash
86+
python src/live_recognition.py
87+
```
88+
89+
Speak into the microphone—interim matches will appear as the engine accumulates enough audio. Press `Ctrl+C` to stop.
90+
91+
## Embedding the Service API
92+
93+
For tighter integration with other applications (e.g., the upcoming voice engine), import `VoiceRecognitionService` and the request/response models:
94+
95+
```python
96+
from file_management.bst import BinarySearchTree
97+
from service.api import VoiceRecognitionService, EnrollmentRequest, EnrollmentConfig
98+
from service.audio_sources import BufferAudioSource
99+
100+
bst = BinarySearchTree()
101+
service = VoiceRecognitionService(bst=bst, base_directory="test_environment")
102+
103+
# Enroll using in-memory buffers
104+
req = EnrollmentRequest(
105+
speaker_id="alice",
106+
audio_source=BufferAudioSource(buffers=[pcm_chunk_1, pcm_chunk_2]),
107+
config=EnrollmentConfig(sample_rate=16000),
108+
)
109+
service.enroll(req)
110+
```
111+
112+
The same façade exposes `recognize`, `start_session`, `list_speakers`, and `delete_speaker`, allowing other repositories to depend on this module without invoking the CLI.
113+
114+
## Recording a Test WAV on Raspberry Pi with Jabra Speak 410
115+
116+
Use this workflow to capture a 16 kHz mono WAV file on the Raspberry Pi 5 connected to the Jabra speaker/mic. All commands assume the repository lives under `/home/gena/PROJECTS`.
117+
118+
1. Set the Jabra device as the default PipeWire sink/source:
119+
```bash
120+
./roomba_stack/audio_jabra_default.sh
121+
```
122+
2. Confirm the capture device name (needed in the next step):
123+
```bash
124+
pactl list short sources | grep -i jabra
125+
```
126+
You should see something like `alsa_input.usb-0b0e_Jabra_SPEAK_410_USB_...-mono-fallback` running at 16 kHz.
127+
3. Make sure there is a place to store recordings:
128+
```bash
129+
mkdir -p voice-recognition-engine/audio_files
130+
```
131+
4. Record a short sample (5–10 seconds) using the PipeWire/ALSA device discovered in step 2:
132+
```bash
133+
parecord \
134+
--device=alsa_input.usb-0b0e_Jabra_SPEAK_410_USB_50C2ED166881x011200-00.mono-fallback \
135+
--rate=16000 --channels=1 --format=s16le \
136+
voice-recognition-engine/audio_files/gmm_test.wav
137+
```
138+
Speak while the command runs and press `Ctrl+C` when finished.
139+
5. Validate the recording before using it with the GMM engine:
140+
```bash
141+
aplay voice-recognition-engine/audio_files/gmm_test.wav
142+
```
143+
144+
The resulting `gmm_test.wav` resides in `voice-recognition-engine/audio_files/` and can be supplied to the CLI commands (e.g., `python src/cli.py recognize voice-recognition-engine/audio_files/gmm_test.wav --sample_rate 16000`).

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,4 @@ librosa==0.10.2.post1
22
numpy==2.0.2
33
scikit-learn==1.5.2
44
matplotlib==3.9.2
5+
sounddevice==0.4.7

src/cli.py

Lines changed: 65 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,16 @@
11
import argparse
22
import os
33

4+
from file_management.bst import BinarySearchTree
5+
from service.api import EnrollmentConfig, RecognitionConfig, VoiceRecognitionService
46
from service.commands import (
57
EnrollSpeakerCommand,
68
RecognizeSpeakerCommand,
9+
RecognizeStreamCommand,
710
ListSpeakersCommand,
811
DeleteSpeakerCommand,
912
CommandHandler
1013
)
11-
from file_management.bst import BinarySearchTree
12-
from file_management.file_management import FileManagementInterface
1314

1415
def setup_environment(base_directory):
1516
# Ensure the base directory for models, audio files, and metadata exists
@@ -52,6 +53,19 @@ def main(command_line_args=None):
5253
recognize_parser.add_argument('--fft_size', type=int, default=512, help='FFT size for audio processing')
5354
recognize_parser.add_argument('--num_filters', type=int, default=26, help='Number of Mel filters')
5455
recognize_parser.add_argument('--num_ceps', type=int, default=13, help='Number of MFCC coefficients')
56+
recognize_parser.add_argument('--score_threshold', type=float, default=None, help='Minimum log-likelihood to accept a speaker match')
57+
58+
# Streaming recognition command
59+
recognize_stream_parser = subparsers.add_parser('recognize_stream', help='Stream audio chunks to recognize a speaker in near real-time')
60+
recognize_stream_parser.add_argument('audio_file', type=str, help='Path to the audio file')
61+
recognize_stream_parser.add_argument('--sample_rate', type=int, default=16000, help='Sample rate of the audio stream')
62+
recognize_stream_parser.add_argument('--frame_size', type=float, default=0.025, help='Frame size in seconds')
63+
recognize_stream_parser.add_argument('--frame_step', type=float, default=0.01, help='Frame step (overlap) in seconds')
64+
recognize_stream_parser.add_argument('--fft_size', type=int, default=512, help='FFT size for audio processing')
65+
recognize_stream_parser.add_argument('--num_filters', type=int, default=26, help='Number of Mel filters')
66+
recognize_stream_parser.add_argument('--num_ceps', type=int, default=13, help='Number of MFCC coefficients')
67+
recognize_stream_parser.add_argument('--score_threshold', type=float, default=None, help='Minimum log-likelihood to accept a speaker match')
68+
recognize_stream_parser.add_argument('--chunk_duration', type=float, default=0.5, help='Duration (seconds) of each streamed chunk')
5569

5670
# List Speakers Command
5771
subparsers.add_parser('list_speakers', help='List all enrolled speakers')
@@ -72,79 +86,84 @@ def main(command_line_args=None):
7286
# Ensure environment setup
7387
setup_environment(base_directory)
7488

75-
# Initialize Binary Search Tree
76-
bst = BinarySearchTree() # Placeholder for actual binary search tree implementation
89+
# Initialize Binary Search Tree and shared service
90+
bst = BinarySearchTree()
91+
service = VoiceRecognitionService(bst=bst, base_directory=base_directory)
7792

7893
# Process the command based on the parsed arguments
7994
if args.command == 'enroll':
80-
command = EnrollSpeakerCommand(
81-
speaker_name=args.speaker_name,
82-
audio_file=args.audio_file,
83-
bst=bst,
84-
base_directory=base_directory,
95+
enroll_config = EnrollmentConfig(
8596
sample_rate=args.sample_rate,
8697
num_filters=args.num_filters,
8798
num_ceps=args.num_ceps,
88-
n_fft=args.n_fft,
99+
fft_size=args.n_fft,
89100
frame_size=args.frame_size,
90101
frame_step=args.frame_step,
91-
n_mixtures=args.n_mixtures
102+
mixtures=args.n_mixtures,
103+
)
104+
command = EnrollSpeakerCommand(
105+
service=service,
106+
speaker_name=args.speaker_name,
107+
audio_file=args.audio_file,
108+
config=enroll_config,
92109
)
93110
handler.run(command)
94-
95-
# Serialize the BST before exiting the program
96-
bst.serialize_bst()
97111

98112
elif args.command == 'recognize':
113+
recognize_config = RecognitionConfig(
114+
sample_rate=args.sample_rate,
115+
frame_size=args.frame_size,
116+
frame_step=args.frame_step,
117+
fft_size=args.fft_size,
118+
num_filters=args.num_filters,
119+
num_ceps=args.num_ceps,
120+
)
99121
command = RecognizeSpeakerCommand(
100-
bst=bst,
122+
service=service,
101123
audio_file=args.audio_file,
102-
base_directory=base_directory,
124+
config=recognize_config,
125+
score_threshold=args.score_threshold
126+
)
127+
handler.run(command)
128+
129+
elif args.command == 'recognize_stream':
130+
recognize_config = RecognitionConfig(
103131
sample_rate=args.sample_rate,
104132
frame_size=args.frame_size,
105133
frame_step=args.frame_step,
106134
fft_size=args.fft_size,
107135
num_filters=args.num_filters,
108-
num_ceps=args.num_ceps
136+
num_ceps=args.num_ceps,
137+
)
138+
command = RecognizeStreamCommand(
139+
service=service,
140+
audio_file=args.audio_file,
141+
config=recognize_config,
142+
score_threshold=args.score_threshold,
143+
chunk_duration=args.chunk_duration,
109144
)
110145
handler.run(command)
111146

112147
elif args.command == 'list_speakers':
113-
file_management = FileManagementInterface(bst=bst, base_directory=base_directory)
114-
command = ListSpeakersCommand(file_management)
148+
command = ListSpeakersCommand(service=service)
115149
handler.run(command)
116150

117151
elif args.command == 'delete_speaker':
118-
file_management = FileManagementInterface(bst=bst, base_directory=base_directory)
119-
command = DeleteSpeakerCommand(args.speaker_name, file_management)
152+
command = DeleteSpeakerCommand(service=service, speaker_name=args.speaker_name)
120153
handler.run(command)
121154

122155
else:
123156
parser.print_help()
124157

158+
# Persist BST state after command execution
159+
bst.serialize_bst()
160+
125161
if __name__ == "__main__":
126-
#debug_args = [
127-
# 'enroll',
128-
# 'maria',
129-
# '/home/gena/PROJECTS/voice-recognition-engine/audio_files/maria.wav',
130-
# '--sample_rate', '16000',
131-
# '--num_filters', '40',
132-
# '--num_ceps', '13',
133-
# '--n_fft', '512',
134-
# '--frame_size', '0.025',
135-
# '--frame_step', '0.01',
136-
# '--n_mixtures', '8'
137-
#]
138-
139-
debug_args = [
140-
'recognize',
141-
'/home/gena/PROJECTS/voice-recognition-engine/audio_files/leah_recognize.wav',
142-
'--sample_rate', '16000',
143-
'--frame_size', '0.025',
144-
'--frame_step', '0.01',
145-
'--fft_size', '512',
146-
'--num_filters', '40',
147-
'--num_ceps', '13',
148-
]
149-
150-
main(debug_args)
162+
# To run with ad-hoc arguments during development, pass them explicitly, e.g.:
163+
# debug_args = [
164+
# 'recognize',
165+
# '/path/to/audio.wav',
166+
# '--sample_rate', '16000',
167+
# ]
168+
# main(debug_args)
169+
main()

src/live_recognition.py

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
"""Simple live recognition runner using the shared service façade."""
2+
from __future__ import annotations
3+
4+
import queue
5+
import sys
6+
from typing import Optional
7+
8+
import numpy as np
9+
import sounddevice as sd
10+
11+
from file_management.bst import BinarySearchTree
12+
from service.api import RecognitionConfig, VoiceRecognitionService
13+
14+
15+
def run_live_recognition(
16+
base_directory: str = "test_environment",
17+
sample_rate: int = 16000,
18+
chunk_duration: float = 0.25,
19+
threshold: Optional[float] = None,
20+
):
21+
"""Capture microphone audio and stream it to the recognition service."""
22+
23+
bst = BinarySearchTree()
24+
service = VoiceRecognitionService(bst=bst, base_directory=base_directory)
25+
config = RecognitionConfig(sample_rate=sample_rate)
26+
session = service.start_session(config=config, threshold=threshold)
27+
28+
chunk_samples = max(1, int(chunk_duration * sample_rate))
29+
audio_queue: "queue.Queue[np.ndarray]" = queue.Queue()
30+
31+
def _callback(indata, frames, time, status): # pylint: disable=unused-argument
32+
if status:
33+
print(status, file=sys.stderr)
34+
audio_queue.put(indata.copy().reshape(-1))
35+
36+
print("Listening... Press Ctrl+C to stop.")
37+
try:
38+
with sd.InputStream(
39+
samplerate=sample_rate,
40+
channels=1,
41+
blocksize=chunk_samples,
42+
dtype="float32",
43+
callback=_callback,
44+
):
45+
latest = None
46+
while True:
47+
chunk = audio_queue.get()
48+
result = session.consume(chunk)
49+
if not result:
50+
continue
51+
latest = result
52+
if result.speaker_id and not result.rejected:
53+
print(f"Recognized {result.speaker_id} (score {result.score:.2f})")
54+
55+
except KeyboardInterrupt:
56+
print("Stopping live recognition...")
57+
finally:
58+
session.close()
59+
bst.serialize_bst()
60+
61+
62+
if __name__ == "__main__":
63+
run_live_recognition()

0 commit comments

Comments
 (0)