Low-end speech-to-text running on $3 of hardware (Bluepill board + electret mic)
Most modern speech recognition systems use multi-gigabyte 'deep learning' systems running on high end multi-core multi-gigahertz processsors.
This is fine. But I wanted to see what was possible at the other end of the scale - using low end embedded boards that make the Raspberry Pi look like a mainframe. So this project demonstrates a 'voice keyboard' - emulating a standard USB keyboard, but keystokes are generated by speaking letters into a microphone, running simple lightweight code on the cheapest hardware available.
The specs of the system create severe limitations. We have 20K RAM, 64K flash storage for all the code and reference tables, no wifi or bluetooth, no floating point support, so (emulated) FP math has to be minimal for speed.
We sample up to 1 second of speech at a time, stored as 8-bit 8000 samples/sec. This is done by doing an analog read of pin A1 on the bluepill, that the output of the mic feeds into. Samples are maximum 12-bit resolution, peared down to a signed 8-bit delta for storage. Old-school speech processing (LPC10) combined with a narrow 32/17 FFT is then used to generate feature data from which phonemes are estimated.
The sequence of phonemes is then mapped to keystrokes, which are generated.
The code has 2 modes - 'dev mode', where the feature data for each sample is dumped out to the serial monitor to allow development of the lookup tables. An example of this output shown in 'alphabet.txt' here. 'run mode' is where the system properly acts as a keyboard when plugged in and generates relevant keystrokes.
Electret or MEMS analog mic - The one I'm using is a MAX9812L-based, but MAX4466 should be OK. Search ebay/aliexpress for 'electret microphone' or 'mems microphone'. Boards can be found for around $1.50.
Bluepill board - search ebay/aliexpress for 'stm32f103c8t6'. Price is generally $1.50 to $2
google 'bluepill specs' for details. The processor on these boards is ARM Cortex M3-based. It has 64K flash for program storage and 20K RAM. Hardware integer multiply and divide available, but no FPU, so floating-point needs to be minimal for reasonable speed.
It needs to have Ardiuno boot-loader installed to allow easy programming via the ardiono environment: https://www.onetransistor.eu/2017/11/stm32-bluepill-arduino-ide.html
The audio processing algorithm uses 32/17 FFT of 10th order LPC filter of Hann windowed 4000Hz samples as feature data. FFT bins are log-power quantized to 4 bits. bottom and top FFT bins ignored.
Basic min-max 15-vector FFT-to-phoneme codebook used. then first-2-phonemes-to-keystroke codebook used.
Demo at:
https://www.youtube.com/watch?v=BnJ6KpWVH4o
It kind-of sort-of works, more time spend on refining the hand-tweaked lookup tables would help.
The limitations on storing the audio samples mean the audio being captured is less than ideal for processing, being low resolution and noisy.
This is more a proof-of-concept than anything and is not meant to be a practical device. But it was something to mess about with during the lock-down.