voice2json is a collection of command-line tools for offline speech/intent recognition on Linux. It is free, open source (MIT), and supports 18 human languages.
From the command-line:
$ voice2json -p en transcribe-wav \
< turn-on-the-light.wav | \
voice2json -p en recognize-intent | \
jq .produces a JSON event like:
{
"text": "turn on the light",
"intent": {
"name": "LightState"
},
"slots": {
"state": "on"
}
}when trained with this template:
[LightState]
states = (on | off)
turn (<states>){state} [the] light
voice2json is optimized for:
- Sets of voice commands that are described well by a grammar
- Commands with uncommon words or pronunciations
- Commands or intents that can vary at runtime
It can be used to:
- Add voice commands to existing applications or Unix-style workflows
- Provide basic voice assistant functionality completely offline on modest hardware
- Bootstrap more sophisticated speech/intent recognition systems
Supported speech to text systems include:
- CMU's pocketsphinx
- Dan Povey's Kaldi
- Mozilla's DeepSpeech 0.9
- Kyoto University's Julius
- Catalan (
ca) - Czech (
cs) - German (
de) - Greek (
el) - English (
en) - Spanish (
es) - French (
fr) - Hindi (
hi) - Italian (
it) - Korean (
ko) - Kazakh (
kz) - Dutch (
nl)nl_kaldi-cgn(default)nl_kaldi-rhasspynl_pocketsphinx-cmu
- Polish (
pl)pl_deepspeech-jaco(default)pl_julius-github
- Portuguese (
pt) - Russian (
ru)ru_kaldi-rhasspy(default)ru_pocketsphinx-cmu
- Swedish (
sv)sv_kaldi-montrealsv_kaldi-rhasspy(default)
- Vietnamese (
vi) - Mandarin (
zh)
voice2json is more than just a wrapper around open source speech to text systems!
- Training produces both a speech and intent recognizer. By describing your voice commands with
voice2json's templating language, you get more than just transcriptions for free. - Re-training is fast enough to be done at runtime (usually < 5s), even up to millions of possible voice commands. This means you can change referenced slot values or add/remove intents on the fly.
- All of the available commands are designed to work well in Unix pipelines, typically consuming/emitting plaintext or newline-delimited JSON. Audio input/output is file-based, so you can receive audio from any source.
- download-profile - Download missing files for a profile
- train-profile - Generate speech/intent artifacts
- transcribe-wav - Transcribe WAV file to text
- Add
--openfor unrestricted speech to text
- Add
- transcribe-stream - Transcribe live audio stream to text
- Add
--openfor unrestricted speech to text
- Add
- recognize-intent - Recognize intent from JSON or text
- wait-wake - Listen to live audio stream for wake word
- record-command - Record voice command from live audio stream
- pronounce-word - Look up or guess how a word is pronounced
- generate-examples - Generate random intents
- record-examples - Generate and record speech examples
- test-examples - Test recorded speech examples
- show-documentation - Run HTTP server locally with documentation
- print-profile - Print profile settings
- print-downloads - Print profile file download information
- print-files - Print user profile files for backup