Sad news: improvements not possible with this hardware revision #100

tetele · 2024-12-13T15:57:43Z

tetele
Dec 13, 2024
Maintainer

Hello everyone!

I've been repeatedly trying to update the Onju Voice config in order to make it take advantage of the newest developments in ESPHome and Home Assistant, but I've hit snags every time and now I've come to realize why.

Objectives

What I've tried to achieve is:

have the Onju act as both a voice assistant and a media player
have the voice assistant use micro_wake_word in order to avoid constant streaming. Plus, it's by far the better supported variant of wake word on ESP devices by HA
be able to pause music playback or lower media player volume when the wake word is uttered
have good quality audio output

Facts

the I2S bus clock is shared between the microphones and the speaker on the Onju board. This means that they both have to use the same sampling rate if running simultaneously
both the microphones and the speaker can be sampled at both 16kHz and 48kHz
micro_wake_word and the voice_assistant pipeline in ESPHome both need audio input at 16kHz sampling rate to function
to maintain quality, the sampling rate of an audio signal must be (at the very least) twice as much as the largest frequency within the signal. As a result, resampling a 48kHz signal to 16kHz requires firstly eliminating all frequencies above 8kHz
16kHz for audio output means (very) bad audio quality
software echo cancellation is very computationally expensive
there is no hardware active echo cancellation on the Onju Voice
software low-pass filtering is very computationally expensive
there is no hardware low pass filter on the Onju Voice

If you add all these together, you will realize that the ESPHome config needs to make some compromises, as it can not attain all objectives simultaneously.

Potential drawbacks

Either one of the following needs to happen to make the Onju work with ESPHome:

the device cannot listen for wake words during media playback. Either it plays or it listens, so that the I2S bus is consumed alternatively by the microphones and the speaker, each with its own sampling rate. This is how the device functioned initially, ~1 year ago
the audio output quality is crap, as we need to use 16kHz for both components. Even if that is bearable, audio output will mess with the Micro Wake Word detection

Solutions

In terms of software, they are outside the scope of this config. Either MWW and VA need to be reworked to work at 48kHz or a very computationally cheap resampler needs to be developed for ESPHome in order to downsample the audio input to 16kHz. And I know for a fact that neither of these are priorities for the Nabu Casa guys at the moment.

In terms of hardware, there are things that could improve the Onju dramatically (or, at the very least, make software development much easier for it), but the developer of the Onju Voice PCB repeatedly said that he is not interested in improving the board any further.

And even if a new hardware revision was available, with separate I2C buses and a chip for active noise cancelling, everyone would still need to go out and buy new PCBs.

Conclusion

I will need to think about what the best approach is here. First thought is multiple configs, each for a single purpose. But I could use the community's help on this. Please drop your opinion or comment below.

Thanks!
Tudor

andrew-codechimp · 2024-12-14T17:42:39Z

andrew-codechimp
Dec 14, 2024

First of all thanks for all your work on getting ESPHome working on the Onju Voice, I went into this knowing it was bleeding edge and the hardware could be a limitation for long term use but my 5 devices are performing extremely well within the known limitations.

Perhaps a way to do this would be to adopt a pattern similar to the Everything Presence One where there are common packages that are then imported to gain the desired set of features.
https://github.com/EverythingSmartHome/everything-presence-one

I'm a tinkerer with ESPHome so never attempted this level of packaging but I'd certainly try to help out where time allows. I'm using a custom modification of the mww implementation with some additional convenience features such as a clumsy flip of the device I've created myself.

5 replies

tetele Dec 14, 2024
Maintainer Author

Very nice way to architecture things. I will try that, thanks!

tetele Jan 15, 2025
Maintainer Author

Here we go with a first try 🤷‍♂️ 🤞 #108

dreimer1986 Jan 15, 2025

OMG!! Something to tinker with later at home 😄

dreimer1986 Jan 20, 2025

Flashed both my Onju with USB to increase the partition size as it was needed as it seems. End result is not ready for use, but I see where you wanna get with that. Nice work so far :D The sensor buttons are a bit random in use and the MWW voice detection seems to work on one better than on the other. Both have the same firmware base though. (Yes I saw the not really working yet part of course ^^) I will follow closely now as I can do OTA again.

tetele Jan 20, 2025
Maintainer Author

It is very much a work in progress 😁 I'm trying to get help to overcome the issues.

I might publish the first beta (or maybe alpha?) and go from there.

tbrasser · 2024-12-16T16:33:12Z

tbrasser
Dec 16, 2024

For me personally I prefer streaming ww over mww, and I was thinking as a consequence the 48 to 16 resampling could be done HA-side.

It would be nice to support audio ducking etc, I believe for non-mww to work with that the same va "stages" need to be exposed?

Edit: I'm already using openww with StreamAssist, so the extra couple of streams from Onju's are not a burden & all can use the same wake word.

I guess what I'm trying to say I'd be most interested in a firmware (variant) where mww is sacrifised for audio quality, and would hope non-mww gets the same va treatment as mww in the future.

The above mentioned source structure looks nice. Another option would be how the voice-pe repo is structured?

0 replies

dreimer1986 · 2024-12-18T19:02:08Z

dreimer1986
Dec 18, 2024

Well, justLV said he thinks that AEC might be possible on software side. Still... If the board lacks power... We all have a Home Assistant with some power left. (I have a i5 NUC for it, so it's bored all the time) Why not passing some work on it via WIFI?

And if this is not possible or better, not feasable, I like the idea of choosing where you want your drawback and where not. There are ppl who prefer the discussion with the AI over music streaming, some prefer music and then there are the ones like me... I like music and talking with my AI catgirl with sarcasm included, but if she does not hear me while the music plays... SO be it.

Tbh.. My Amazon Echo does not hear me all the time when anything except me is in the room. I think they made it less useful by firmware as this was clearly better in the past...

Right now I must say that Onju voice does a better job most of the time compared to my Echo devices all over the house. Only the reminders feature maybe with HA calender integration is still missing and that Onju tells you that you have no timers running even if they are.

In understanding, in playing music and in AI matters it's far better. Try to tell your Echo to do two things at once... BOOM, dead. I can tell this thing a whole list of things to do and it just does all of it. The platform in background makes it great and if the assistant itself has one little flaw... It's still way better than anything you can buy right now, because the current assistant already allows the platform to shine better than the big competitors out there. If can only get better, flaws or not.

Beleive me when I say that however you decide here, anything is better than stopping the project. Never had so much fun with a voice assistant before and that with it's flaws which are neglectible IMO.

1 reply

tetele Jan 15, 2025
Maintainer Author

Well, justLV said he thinks that AEC might be possible on software side

I've heard that from Olivier, the owner Raspiaudio, too. But that would need implementation in ESPHome. We'll see if either me or someone else can do it.

tbrasser · 2024-12-20T23:18:15Z

tbrasser
Dec 20, 2024

Honestly I love the nice design of the minis (have them floating wall mounted), and the ww sensitivity is ok-ish without xmos, let's work out together the optimum firmware (flavors) for these and keep 'em going!

0 replies

dreimer1986 · 2024-12-21T12:15:33Z

dreimer1986
Dec 21, 2024

My two Home Assistant voice just arrived a few minutes ago. I tested some basic things by now and even though they run fine and more stable than my n00b try to get the onju voice config running on newer ESPHome versions, still it has flaws where Onju will shine. Biggest one: Audio. HA Voice is a bit like a tin can compared to Onju. Second one, looks... Transparent Hockey puck vs nice black cloth. 3rd one: You did it yourself. Never underestimate the selfmade factor! When I look more closely into the new devices I will find even more, that for sure! One thing is great though. HA Voice is open source, too. So maybe we can borrow some ideas from there?

1 reply

tetele Dec 21, 2024
Maintainer Author

HA Voice is open source, too. So maybe we can borrow some ideas from there?

As a language leader, I've had the HA Voice PE fir a while now. That is where i tried to draw inspiration from and I do have some ideas for improvement. But hardware is a limiting factor.

rmeissn · 2024-12-26T21:45:51Z

rmeissn
Dec 26, 2024

I'm voting for the option of having separate configuration files for different purposes.

I'm not entirely sure about the config I wish for. Is the following possible?

mww: active if media player is paused. If media player is playing (songs), mww is deactivated
play/pause button on top triggers the media-player, long press (stops the media player and) triggers the va
thus, input and output pipelines are never active at the same time and allow for:
- 48khz audio output (va output is a mp3/wav file that is streamed to the media player, which is optionally upsampled?)
- 48khz audio input, which needs to be downsampled as input for mww and va

Due to the missing possibility of echo cancellation, detecting a wakeword while playing a song is prone to failures and if detecetd, recorded audio contains music, and is thus not feasible at all -> don't go for it, unless someone implements an efficient method for echo cancellation.

4 replies

rmeissn Dec 27, 2024

As an addition to my post: Having an equalizer for the device (either server-side or client-side) would be nice.
In comparison to the original Google speaker, the onju lacks some bass and got too much mids.

rmeissn Dec 27, 2024

This is the config for the upper touch button regarding my wish from above. I'm not sure everything is needed, but it works perfectly, where the code from this repository causes the speaker to reboot on pressing the button.

- platform: esp32_touch
    id: action
    pin: GPIO3
    threshold: 751000
    on_press:
      then:
        - if:
            condition: voice_assistant.is_running
            then:
              - voice_assistant.stop:
              - wait_until:
                  condition:
                    not:
                      media_player.is_playing
              - delay: 200ms
              - micro_wake_word.start
            else:
            - if:
                condition: media_player.is_playing
                then:
                  - media_player.pause
                  - wait_until:
                      condition:
                        not:
                          media_player.is_playing
                else:
                - if:
                    condition: media_player.is_paused
                    then:
                      - media_player.play
        - delay: 750ms
        - if:
            condition:
              binary_sensor.is_on: action
            then:
              - media_player.stop
              - micro_wake_word.stop
              - voice_assistant.start:

dreimer1986 Dec 27, 2024

I did lend your button to stop the reboot if you accidentally touch the device in the wrong way and yes, in first tests it does the job you intended for it. The feature it was intended to have would still be nice to have though. Even if activated from Home Assistant UI. I miss a way to continue a conversation after the AI answered. Like... it waits for more input after speaking the reply so that you can go on without needing to trigger the wakeword again. Will try to get this working later as I think it's a nice idea which should not be removed.

rmeissn Dec 27, 2024

You can also use something like the following code (tested on my onju) instead of my code from above. In my opinion, this code is more clear. Conversation mode starts with a double click. Keep in mind to properly lift your fingers. If your finger remains too close to the sensor, it won't be recognized as "off".

- platform: esp32_touch
    id: action
    pin: GPIO3
    threshold: 751000
    on_multi_click:
        - timing: # double click
          - ON for at most 0.5s
          - OFF for at most 0.5s
          - ON for at most 0.5s
          - OFF for at least 0.2s
          then:
            - media_player.stop
            - micro_wake_word.stop
            - binary_sensor.template.publish:
                id: conversation_mode
                state: ON
            - delay: 250ms
            - voice_assistant.start_continuous:
        - timing: # single click
          - ON for at most 1s
          - OFF for at least 0.5s
          then:
            - if:
                condition: media_player.is_playing
                then:
                  - media_player.pause
                else:
                - if:
                    condition: media_player.is_paused
                    then:
                      - media_player.play
        - timing: # long press
          - ON for 1s to 3s
          - OFF for at least 0.2s
          then:
            - voice_assistant.stop:
            - media_player.stop
            - wait_until:
                condition:
                  not:
                    media_player.is_playing
            - binary_sensor.template.publish:
                id: conversation_mode
                state: OFF
            - delay: 200ms
            - script.execute: reset_led
            - script.wait: reset_led
            - micro_wake_word.start

And you need to change the va on_end action to:

on_end:
    - wait_until:
        condition:
          not:
            media_player.is_playing
    - script.execute: reset_led
    - if:
        condition:
          and:
            - switch.is_on: use_wake_word
            - binary_sensor.is_off: mute_switch
            - binary_sensor.is_off: conversation_mode # if in conversation mode, do not activate microwakeword after a pipeline finished -> stop conversation made with a long press of the upper button
        then:
          - micro_wake_word.start

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sad news: improvements not possible with this hardware revision #100

{{title}}

Replies: 6 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Sad news: improvements not possible with this hardware revision #100

tetele Dec 13, 2024 Maintainer

Objectives

Facts

Potential drawbacks

Solutions

Conclusion

Replies: 6 comments · 11 replies

tetele Dec 14, 2024 Maintainer Author

tetele Jan 15, 2025 Maintainer Author

tetele Jan 20, 2025 Maintainer Author

tetele Jan 15, 2025 Maintainer Author

tetele Dec 21, 2024 Maintainer Author

tetele
Dec 13, 2024
Maintainer

Replies: 6 comments 11 replies

tetele Dec 14, 2024
Maintainer Author

tetele Jan 15, 2025
Maintainer Author

tetele Jan 20, 2025
Maintainer Author

tetele Jan 15, 2025
Maintainer Author

tetele Dec 21, 2024
Maintainer Author