Skip to content

Support force activated pipelines #144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

chatziko
Copy link

@chatziko chatziko commented Mar 27, 2024

This PR contains the client-side of a "push-to-talk"-like feature that allows the server to "force activate" the satellite, going directly to ASR. The change is simple: when the server sends run-satellite with start_stage = asr then we send back a run-satellite (as usual) with that stage and start streaming immediately. More details will be given in the corresponding PR in home-assistant/core.

#143 is recommended for this to work properly, otherwise the awake sound and debug recording will not be triggered for "force activated" pipelines. I made separate PRs for easier review, if you merge #143 first I can take care of the conflicts.

Note also that this wyoming change is needed by this PR.

Finally note that this is about a server-side "push-to-talk", not client side as in #82.

@founderio
Copy link

I feel like a remote-activated function like this needs a config option to enable/disable it, ideally defaulting to OFF. This feels like a privacy issue waiting to happen.

@chatziko
Copy link
Author

It could be easily made optional. But I'm not sure what scenario you have in mind, who is the adversary?
If your HA server is compromised you should have no privacy expectation at all, the server controls the satellites and pretty much everything else.

@founderio
Copy link

In general I'm leaning towards the "prank" level of privacy, so someone with access to the network.

However, from my understanding so far, there is no authentication/authorization in the wyoming protocol? (Please correct me if I'm wrong, but I did not find any reference that shows any kind of authentication happening)

This means, by extension, that the satellite could be controlled by someone who is NOT the HA server. Which means anyone in the network could trigger audio streaming - bypassing voice/hotword activation - to basically anywhere in the network.

@chatziko
Copy link
Author

This is an interesting security discussion in general, maybe encryption/authentication could be added to wyoming (similarly to esphome).

But it's completely orthogonal to this PR. Even without this PR a malicious user on the local network can connect to the satellite and stream audio. If wake-word detection happens locally on the satellite the malicious user would simply need to wait until the first detection, and then he could stream audio forever (pretending the pipeline never ends).

Or, to avoid waiting, he could even do a more exotic attack like send an audio-chunk to the satellite containing playback audio of the wake word (a wav file saying "ok nabu"), which the satellite will happily playback (thinking it's a TTS response) and wake itself up 😄

@founderio
Copy link

Thank you for this insight! I was not aware the protocol allowed that much freedom :D

I've filed an issue in the protocol repo here, as this looks like a more foundational topic: OHF-Voice/wyoming#11

@llluis
Copy link
Contributor

llluis commented Apr 3, 2024

Hey @chatziko, very interesting approach. Thanks for the contributiuon.
We did the same feature, but differently. :)

Don't know if you saw this one, it's a very long thread, but you can go bottom-up: #81

I did that and a few other "exploration" changes, however, PRs are not going through.
I don't want to create a parallel repo, so I'm waiting on those to be merged or rejected to continue adding new features. :(

@chatziko
Copy link
Author

chatziko commented Apr 4, 2024

Hey @llluis , wow, that's a long discussion in #81, I certainly missed it otherwise I would have participated. I hadn't followed wyoming-satellite for a while, I only posted a couple of issues back when it was first released then kept using the old homeassistant-satellite until a couple of weeks ago. Then I checked the PRs for recent developments, but not the issues (apart from #4 in which I saw no action). But I'm very happy to find out that people are pushing interesting ideas forward!

I had a quick look at the code in llluis/wyoming-satellite, if I understand correctly the general design for the "remote trigger" is that the server sends a "fake" detect event. It's certainly a simple and natural design. The reason I didn't follow this approach is that I thought it has a few drawbacks:

  1. Just handling detect in the usual way doesn't work with VadStreamingSatellite, cause streaming is triggered by VAD, not detect. So it needs a different implementation there, complicating the code.

  2. Conceptually, "faking" an event is a bit messy. detect has specific semantics and can happen at specific moments in the pipeline. Adding new semantics and allowing it to happen at any moment will make the code more complicated. (And then adding the question_id param which has nothing to do with a real detection makes things even more "hacky".)

  3. Sending detect will not work if the satellite is paused, cause the server is not connected (in my use-case I want to pause the satellite, eg when the TV is on, and then be able to trigger it while paused). So you'd need to unpause first and coordinate the detect afterwards, which again would complicate the code.

So in the end I found the approach of adding start_stage to run-satellite cleaner, but of course it's not a huge difference and in the end any approach that works is fine. Hopefully @synesthesiam will find some time soon to have a look and let us know what he thinks and whether he plans to merge just a feature.

PS. Concerning question_id, if I understood correctly your goal was to only do ASR? This sounds like an interesting goal (although I find the question_id solution messy). In my design one could easily achieve this by also adding end_stage to run-satellite (fits well with start_stage).

@Mincka
Copy link

Mincka commented May 15, 2024

Very nice PRs. I've tested them, including the button in HA.
Everything works fine, including the awake sound which is kept when the button.press service is called.

Thank you very much. I sincerely hope that this work can be reviewed and merged in the future.

I wrote a small guide there for anyone interested to test and use it in their setup.

@chatziko
Copy link
Author

Awesome, thanks for testing and writing a guide.

@Mincka
Copy link

Mincka commented May 18, 2024

Not sure at that point if that's related to one of the PR, but the button entity was named button.my_satellite_none, and now I see that after removing the integration and adding the satellite again that none of the entities has a name related to their setting.

image

select.my_satellite_none
select.my_satellite_none_2
switch.my_satellite_none
number.my_satellite_none
number.my_satellite_none_2

@Mincka
Copy link

Mincka commented May 28, 2024

Do you know if this activation can be associated with one of the current events like --detection-command?
I tried to use it but it does not seem to be called when using the button press.
My goal is to keep a chime or send a tts to let the user know that the satellite is ready to listen.

@chatziko
Copy link
Author

It should work, but of course with the events that are actually happening. No wakeword was detected so it makes sense that --detection--command does not fire. But --stt-start-command should work (haven't tested it though).

@Mincka
Copy link

Mincka commented May 28, 2024

I think I've tested with all the events around the detection without success, but I'll try again with that one and report here for the results. It could also an issue in the original project so I'll test with the standard workflow with the wake word. Thank you for the quick answer.

@Mincka
Copy link

Mincka commented Jun 3, 2024

Quite sad that OHF-Voice/wyoming#10 did not make it to 1.5.4 since it was ready for a long time.
However, it's good to see a little bit of progress around this. Now the requirements for this PR are fulfilled. :)

@qJake
Copy link

qJake commented Jul 6, 2024

Can't wait for this! I'm really hoping that after this is implemented, we can tweak the LLM call to include a "Follow up?" parameter which can just trigger an immediate detection event on the satellite so it can start listening right away.

Basically, this gets us one step closer to true two-way, multi-step communication with LLMs. Very exciting.

@slyticoon
Copy link

Excited for this to be implemented. This would allow me to completely replace my commercial voice assistants. Actionable notification and complex conversations will be great to have.

@jjdenhertog
Copy link

@chatziko I noticed that changes to the core are made referencing this pr. But it's not fully integrated. Is the current code still usable? Would love to see this working.

@qJake
Copy link

qJake commented Feb 7, 2025

Home Assistant's native assist_satellite entity type recently got support for announcing a prompt and listening for a response, initiated from Home Assistant. I think that's a comparable replacement to this...

@jjdenhertog
Copy link

@qJake Ar those features not used to initiate a conversation only? This PR could help for keeping a conversation running. An example of this would be to activate the satellite when the response from the voice assistant contains a question. Than we wouldn't want to re-announce the question to start the conversation.

@qJake
Copy link

qJake commented Feb 10, 2025

@jjdenhertog Yes, that is true - and I think the issue there is, how do you get an LLM to respond with metadata that flags "I am asking the user a question?" reliably? There isn't a way to do that with LLM APIs currently as far as I know.

My thoughts for this PR (as well as the new upcoming assist_satellite action) are more for automations - you could make a TTS announcement, ask a question, then listen for a response. Even if the action requires a prompt, you could probably just put some inaudible text into the prompt that the TTS engine ignores or produces little to no output for.

@jjdenhertog
Copy link

@qJake The most basic implementation could be some automation to let an LLM determine if it's a question. And if so activate it. For example when the LLM is not able to determine the context. It now sometimes responds with "Which lights should I turn on?". It would be nice if that would actually work. I think that LLM should be able to determine that.

I could also see fail safe scenarios. For example when the user says "Can you turn the volume to fifteen" but the LLM thinks it "turn to fifty" 😂. Would be nice if you could be able to insert an "are you sure about that" question when some values are being set - to which you can reply with "yes" or "no".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants