Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow singleUtterance: false #63

Open
rhclayto opened this issue Sep 24, 2017 · 5 comments
Open

Allow singleUtterance: false #63

rhclayto opened this issue Sep 24, 2017 · 5 comments

Comments

@rhclayto
Copy link

rhclayto commented Sep 24, 2017

Setting singleUtterance: true in recognizer.createRecognizeStream leaves it to Google to detect when an utterance has stopped & end the recognition stream. A problem I encountered is that often Google is too eager to stop the detection, i.e., it detects slight pauses in speech as an end of utterance. This is good for detecting short command-type utterances, such as 'turn off light', etc., but for longer free-form dictation it's problematic.

Adding an option to Sonus to allow configurable singleUtterance would give flexibility to developers to implement their own end-of-utterance detection & stream teardown. So I guess this is a feature request.

But I also wanted to post this here because I made some changes to my fork of Sonus to allow this, & thought I'd share it, even though it's got some stuff specific to my use hard coded into & is therefore not pull-request-worthy. But it might contain the seed of something that can be put into Sonus, if desired.

CloudSpeechRecognizer.startStreaming = (options, audioStream, cloudSpeechRecognizer) => {
// . . .
  const recognitionStream = recognizer.createRecognizeStream({
      // . . .
      singleUtterance: options.noSingleUtterance ? false : true,
      // . . .
  })
  // . . .
  // Add recognitionStream to the cloudSpeechRecognizer object so it can be shut down from
  // Sonus if the noSingleUtterance option has been passed in.
  if (options.noSingleUtterance) cloudSpeechRecognizer.recognitionStream = recognitionStream; 
 // . . .
Sonus.init = (options, recognizer) => {
  // . . .
  sonus.trigger = (index, hotword) => {
    // . . .
    let triggerHotword = (index == 0) ? hotword : models.lookup(index)
    // If trigger hotword is 'FreeDescription', set an option to send to CloudSpeechRecognizer
    // so that it will start a stream with singleUtterance: false, & we'll handle our own silence 
    // detection & recognitionStream teardown.
    if (triggerHotword === 'FreeDescription') opts.noSingleUtterance = true;
    // . . .
  // . . .
  // Add a sonus.on listener to receive events from an instantiated sonus in order to shut down 
  //  a recognitionStream.
  sonus.on('recognitionStreamShutdown', function() {
    // recognitionStream will be made available on the csr object, so we can shut it down as follows:
    if (csr.listening && csr.recognitionStream) {
      csr.listening = false;
      sonus.mic.unpipe(csr.recognitionStream);
      csr.recognitionStream.end();
      delete csr.recognitionStream;
    }
  });
}

These changes make it possible to configure singleUtterance to true or false based on the hotword used to trigger Sonus. They also make it possible to stop the recognition stream by emitting a recognitionStreamShutdown event from somewhere in an app. So now deciding when an utterance is over & when to stop the recognition stream with Google is in the developers hands. Developers will still need to handle the fact that Google is going to issue isFinal properties on its results when it thinks the utterance is over, so if you want your transcription to span the length of your configured timeout period, you'll need to concatenate the various isFinal results that Google issues.

As an example, I used the silence & sound events emitted by Sonus (as determined by the Snowboy detector), along with a setTimeout, to determine when an utterance had ended (i.e., after a certain amount of sustained silence, the utterance is determined to be over). Here's the code I used:

// Set Snowboy hotwords.
var hotwords = [{file: '/home/benja/app/ListenRobot.pmdl', hotword: 'ListenRobot', sensitivity: '0.5'}, {file: '/home/benja/app/FreeDescription.pmdl', hotword: 'FreeDescription', sensitivity: '0.5'}];
// Create an instance of Sonus.
var sonusLanguage = 'en-US';
var sonus = Sonus.init({hotwords, sonusLanguage, recordProgram: 'arecord', device: 'bluealsa:HCI=hci0,DEV=00:6A:8E:16:C5:F2,PROFILE=sco'}, speech);
// Start the Sonus instance.
Sonus.start(sonus);
var silenceTimeout = null, bufferSonus = false, sonusBuffer = '';
// Event: When Snowboy informs us a hotword has been detected.
sonus.on('hotword', function(index, keyword) {
  console.log('!');
  if (keyword === 'FreeDescription') bufferSonus = true;
});
// Event: When Snowboy detects silence.
sonus.on('silence', function() {
  // After an appropriate period of uninterrupted silence, send an event to sonus
  //  to tear down the recognitionStream, but only while streaming from a 
  //  'FreeDescription' trigger hotword.
  if (!silenceTimeout) silenceTimeout = setTimeout(function() {
    if (bufferSonus && sonusBuffer !== '') {
    console.log('Complete sonusBuffer flushed: ', sonusBuffer);
    Sonus.annyang.trigger('sonusBuffer ' + sonusBuffer);
    sonusBuffer = '';
    bufferSonus = false;
   }
   sonus.emit('recognitionStreamShutdown');
  }, 4300);
});
// Event: When Snowboy detects sound.
sonus.on('sound', function() {
  clearTimeout(silenceTimeout);
  silenceTimeout = null;
});
// Event: When a final transcript has been received from Google Cloud Speech.
sonus.on('final-result', function(result) {
  console.log(result);
  if (bufferSonus) sonusBuffer += result;
});

Perhaps this can help someone.

@evancohen
Copy link
Owner

Thanks for the super detailed suggestion! This is a good idea :) I'm working on a refactor of Sonus that allows for more cloud recognizers and a bit more control of their configuration, which should address this.

Another thought I've had is to add the ability to change configuration after a hotword is detected, but before speech begins streaming (essentially a synchronous config update via a callback function that gets passed to the 'hotword' event)

@mrdago
Copy link

mrdago commented Oct 5, 2017

@rhclayto - That's exactly what I'm looking for!
I've successful implemented sonus into my house automation solution and it works perfect (reliable and fast) detecting and translating short commands to control my devices.
Now I'm planning to use sonus for creating messages "yellow stickers" and to display these messages on my family dashboard (MagicMirror wo mirror). But this requires a bit more control of the Google API to optional switch off "interimResults" or to influence the singleUtterance parameter.
As I'm a novice in programming I will wait on evancohen sonus source code modifications.
@evancohen - Great work, thank you for this piece of software!

@mrdago
Copy link

mrdago commented Oct 7, 2017

@evancohen
I've installed [email protected]. and additionally extended index.js with the code proposed by @rhclayto to be able to process longer text phrases with the Google speech API.
It works reliable and stable and I would be pleased if the proposal from @rhclayto would be part of the next sonus release. To be more flexible I propose to use the second hotword to switch to opts.noSingleUtterance = true (see my code snipsel below).

sonus.trigger = (index, hotword) => {
    if (sonus.started) {
      try {
        let triggerHotword = (index == 0) ? hotword : models.lookup(index)
        // *** added
        // If there are 2 hotwords configured, set an option to send to CloudSpeechRecognizer
        // so that it will start a stream with singleUtterance: false, & we'll handle our own silence 
        // detection & recognitionStream teardown.
        opts.noSingleUtterance = false
        if(models.lookupTable.length > 1){
          if (triggerHotword === models.lookup(2)) opts.noSingleUtterance = true
        }
        // ***
        sonus.emit('hotword', index, triggerHotword)
        CloudSpeechRecognizer.startStreaming(opts, sonus.mic, csr)
      } catch (e) {
        throw ERROR.INVALID_INDEX
        console.log(e)

@evancohen
Copy link
Owner

Awesome :) I'll definitely integrate this into the next release. One change though: noSingleUtterace should be a parameter on the hotword - it shouldn't be hard-coded for a specific index. I'm traveling this week but next week I'll release the update (unless you want to send a PR for it)

@MrSponti
Copy link

Hi Evan, yesterday I checked out your latest release and tried to implement the discussed option for Sonus to allow configurable singleUtterance. I've that realized and running stable in sonus 0,1.9.-1 but my knowledge is not sufficient to implement it in your last release and I would be pleased if you could help. The proposed change is an addon and will not impact the existing functionality.

The approach I've made in v0.1.9-1 was to define an option longText with the index of the hotword to use to switch to singleUtterance = false.

const longText = 2;      // index pointing to hotword for use of 'long text form' in Google Speech API
const sonus = Sonus.init({ hotwords, language, audioGain, longText, recordProgram: "arecord" }, speech);

And here are the modification (*** added) in the index.js:

  const recognitionStream = recognizer.streamingRecognize({
    config: {
      encoding: 'LINEAR16',
      sampleRateHertz: 16000,
      languageCode: options.language,
      speechContexts: options.speechContexts || null
    },
    singleUtterance: options.noSingleUtterance ? false : true,        // *** changed
    interimResults: true,
  })
  // *** added
  //console.log('options.noSingleUtterance='+options.noSingleUtterance)
  if (options.noSingleUtterance){                               
    cloudSpeechRecognizer.recognitionStream = recognitionStream
  }
  // ***

...  
opts.language = opts.language || 'en-US' //https://cloud.google.com/speech/docs/languages
  // *** added
  opts.longText = opts.longText || 0
  // ***  

...

sonus.trigger = (index, hotword) => {
    if (sonus.started) {
      try {
        let triggerHotword = (index == 0) ? hotword : models.lookup(index)
        // *** added
        // If value of 'opt.longText' is the index to the actual hotword, set an option to start 
        // a stream with singleUtterance: false, and we'll handle our own silence detection & recognitionStream teardown.
        opts.noSingleUtterance = false
        if(opts.longText > 0 && models.lookupTable.length >= opts.longText){
          if (triggerHotword === models.lookup(opts.longText)) opts.noSingleUtterance = true
        }
        // ***        
        sonus.emit('hotword', index, triggerHotword)
        CloudSpeechRecognizer.startStreaming(opts, sonus.mic, csr)

...
  // ***
  // Add a sonus.on listener to receive events from an instantiated sonus in order to shut down 
  //  a recognitionStream.
  sonus.on('recognitionStreamShutdown', function() {
    // recognitionStream will be made available on the csr object, so we can shut it down as follows:
    if (csr.listening && csr.recognitionStream) {
      csr.listening = false
      sonus.mic.unpipe(csr.recognitionStream)
      csr.recognitionStream.end()
      delete csr.recognitionStream
    }
  })
  // ***
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants