Skip to content

add "FSM" satellite type #258

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

emmaconnor
Copy link

Adds a new satellite type designed for advanced back-and-forth conversations. The satellite uses a finite state machine to track conversation state, and uses a combination of VAD and wakeword detection to detect state changes.

States:

  • Paused: initial state. No audio streaming, no wakeword detection, no vad. When a server connects, the satellite enters Monitor state.
  • Monitor: listens for wakeword detection events. When a wake word is detected, starts streaming audio to server.
  • Stream: streams mic input to server. When VAD detects the user is done speaking, enters playback state.
  • Playback: plays TTS response from server, awaits AudioStop event from server. Mic input is not processed in this state. This prevents TTS playback from activating VAD, but it has the downside that the user cannot interrupt playback. Can consider adding interruption functionality in the future. When the TTS playback ends, enters followup state.
  • Followup: Uses VAD to detect if the user starts speaking again. Wakeword is not required. If user starts speaking, enters stream state again. Otherwise, if no VAD after 10s, enters monitor state again.

State diagram is roughly:

+----------------+                                                                                                
|    Paused      |                                                                                                
+----------------+                                                                                                
       |                                                                                                
 (server connects)                                                                                                
       |                                                                                                
       V                                                                                                
+----------------+                                                                                                
|    Monitor     |<--------------------<-----------<                                                                                                
+----------------+                                 |                                                               
       |                                           |                                                     
(wakeword detected)                                |                                                                
       |                                           |                                                     
       V                                           |                                                     
+----------------+                                 ^                                                               
|    Stream      |<----------<----------<          |                                                               
+----------------+                      |          |                                                               
       |                                |          |                                                     
(VAD no longer detected)                |          |                                                                     
       |                                |          |                                                     
       V                                |          |                                                     
+----------------+                      ^          ^                                                               
|    Playback    |                      |          |                                                               
+----------------+                      |          |                                                               
       |                                |          |                                                     
(TTS playback finished)                 |          |                                                                    
       |                                |          |                                                     
       V                                |          ^
+----------------+                      |          |
|    Followup    |-->-(vad detected)----^          |
+----------------+                                 |
       |                                           |
(no vad detected for 10s)                          |
       |                                           |
       +---->---------(no vad detected for 10s)----^

@mitrokun
Copy link

mitrokun commented Feb 3, 2025

As long as we don't have a mechanism to interrupt the response, your solution looks quite interesting
Do these changes work with the latest release?

I've tried replacing files and restarting the service, but got no change.

@mercuryin
Copy link

Sorry but I would love to implement this, is this already supported in some way ?

@emmaconnor
Copy link
Author

the changes are implemented in this PR, we're just waiting on the project owner to respond about accepting these changes into the main repo. For now you should be able to apply these commits locally if you need this feature.

@emmaconnor
Copy link
Author

As long as we don't have a mechanism to interrupt the response, your solution looks quite interesting Do these changes work with the latest release?

I've tried replacing files and restarting the service, but got no change.

To use the FSM stat you must enable both VAD and wakeword detection from the cli

@mercuryin
Copy link

Hi Emma,

I was wondering if you’d mind sharing some example values that work well for the FSM system you’ve implemented?

I’ve tried various combinations, but I just can’t seem to get a smooth conversation flow. For example, when I give a command and the wake word is triggered, sometimes I don’t even have time to say anything before I get a response from Home Assistant. It all feels too fast — with multiple overlapping sounds: the wake sound, the confirmation sound, the response… and I struggle to figure out the right timing for when I’m supposed to talk, when the system is listening, and when it’s Home Assistant’s turn to reply.

It’s all a bit chaotic, and I feel like I’m missing that dynamic flow. Maybe you have some default values or a setup that just works for you? I’d really appreciate any tips or examples you could share — I’m a bit stuck right now.

Thanks so much in advance!

Best,
Joseba

@mitrokun
Copy link

mitrokun commented Jul 2, 2025

I’ve tried various combinations, but I just can’t seem to get a smooth conversation flow.

I agree that the Pi Zero 2 W is very unstable.
Even without this modification, the board has problems with the wireless network.

I would advise you to switch to esp32S3. I have a configuration on GitHub that implements the same idea — I've been using it for 3 months without any problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants