Skip to content

Conversation

@GiviMAD
Copy link
Member

@GiviMAD GiviMAD commented Jan 26, 2024

This is a WIP PR for the issue #2275

PR work with latest snapshot Nov 30 2025.

How it works

Under bundles/org.openhab.ui/web/src/js/voice I've added a library that handles the usage of the WebWorker and the Audio APIs to connect by websocket to openHAB and stream the audio.

These are the main classes inside:

  • AudioWorker: WebWorker implementation that connects to the openHAB server /ws/audio-pcm endpoint.
  • AudioMain: Main orchestrator that handles the interaction between the AudioWorker and the Audio api. (ex. when a sound needs to be play it registers a audioWorklet transfers its messagePort instance to the AudioWorker so it can stream the data without impacting the browser's main thread)
  • AudioSink: Used by AudioMain to setup the audio playback and register the audio worklet implementation used to handle audio coming from the AudioWorker thread.
  • AudioSource: Used by AudioMain to setup access to the microphone and register the audio worklet implementation used to send audio to the AudioWorker thread.

Requirements

It requires to access the WebUI over https (localhost domain is excluded).
To make it work over http in other domains with chrome you can go to chrome://flags/#unsafely-treat-insecure-origin-as-secure and add the openhab url there.
These is because the browser requires a secure connection to allow the access to media devices (mic, webcam...).
The options in the about page are hidden if access to media devices is not possible.

It's required to configure in the openHAB server the default speech-to-text, text-to-speech and interpreter services in the voice settings.

Current state:

Options added in the about section (they will not be shown if the getUserMedia API is not available in the browser):

Screenshot 2025-11-30 at 14 55 09

If you enable the voice dialog, a button will be shown at the top right of the overview page which shows a different icon indicating the state of the dialog:

Screenshot 2025-11-30 at 14 54 41 -> Mic not initialized (initial state before first user interaction with the page as a user event is required to setup the microphone access), or disconnected due to a network error or a error in the server.

Screenshot 2025-11-30 at 14 54 14 -> Ready to use, you can click it to trigger a dialog interaction

Screenshot 2025-11-30 at 14 53 50 -> Microphone audio is been streamed to openHAB, takes prevalence over the audio playing icon

Screenshot 2025-11-30 at 14 54 03 -> Audio is playing

@GiviMAD GiviMAD requested a review from a team as a code owner January 26, 2024 17:49
@GiviMAD GiviMAD force-pushed the feature/voice_dialog branch 2 times, most recently from 89bd614 to 355c5fc Compare January 31, 2024 19:31
@relativeci
Copy link

relativeci bot commented Jan 31, 2024

#3915 Bundle Size — 12.33MiB (+0.2%).

56567a8(current) vs 967c616 main#3911(baseline)

Warning

Bundle contains 2 duplicate packages – View duplicate packages

Bundle metrics  Change 5 changes Regression 1 regression
                 Current
#3915
     Baseline
#3911
Regression  Initial JS 1.52MiB(+0.19%) 1.52MiB
No change  Initial CSS 0B 0B
Change  Cache Invalidation 7.07% 7.06%
Change  Chunks 620(+0.16%) 619
Change  Assets 705(+0.57%) 701
Change  Modules 2422(+0.33%) 2414
No change  Duplicate Modules 0 0
No change  Duplicate Code 0% 0%
No change  Packages 126 126
No change  Duplicate Packages 1 1
Bundle size by type  Change 2 changes Regression 2 regressions
                 Current
#3915
     Baseline
#3911
Regression  JS 10.66MiB (+0.23%) 10.64MiB
Regression  CSS 845.36KiB (~+0.01%) 845.29KiB
No change  Fonts 526.1KiB 526.1KiB
No change  Media 295.6KiB 295.6KiB
No change  IMG 45.73KiB 45.73KiB
No change  Other 847B 847B

Bundle analysis reportBranch GiviMAD:feature/voice_dialogProject dashboard


Generated by RelativeCIDocumentationReport issue

@florian-h05 florian-h05 added enhancement New feature or request main ui Main UI awaiting other PR Depends on another PR labels Feb 28, 2024
@GiviMAD GiviMAD force-pushed the feature/voice_dialog branch from 355c5fc to 0091917 Compare January 8, 2025 23:01
@GiviMAD GiviMAD force-pushed the feature/voice_dialog branch 3 times, most recently from ce0a9e6 to 25d363c Compare January 10, 2025 00:08
@florian-h05 florian-h05 removed the awaiting other PR Depends on another PR label Nov 29, 2025
Signed-off-by: Miguel Álvarez <[email protected]>
@GiviMAD GiviMAD force-pushed the feature/voice_dialog branch from da88089 to 23126c5 Compare November 30, 2025 12:16
@GiviMAD GiviMAD force-pushed the feature/voice_dialog branch from 550622d to ffb3b9d Compare November 30, 2025 13:39
@GiviMAD
Copy link
Member Author

GiviMAD commented Nov 30, 2025

Hello @florian-h05.

After the core PR has been merged I've updated the branch to the recent web-ui changes.
Everything seems to work correctly also I managed to fix some small glitches on the audio playback that I had before and wasn't able to find the cause.

I have added some descriptions to the main comment about the current state of the PR. Do you think the PR is ok for a first version? Let me know want you think when you have a moment.

In case you want to test it:

  • I downloaded the server snapshot and installed the whisper and piper add-ons.
  • For whisper, I downloaded from here https://huggingface.co/ggerganov/whisper.cpp/tree/main the ggml-large-v3-turbo-q5_0.bin model and placed in the whisper folder under userdata, and then you need to select that model using the UI in the Whisper configuration.
  • For piper I downloaded a voice from here https://huggingface.co/rhasspy/piper-voices/tree/main and placed it on the piper folder under userdata (both file names need to start with the same, in my case es_ES-sharvard-medium.onnx and es_ES-sharvard-medium.onnx.json).
  • Then in the voice setting I set them as Default Speech-to-text and Text-to-speech services and I changed the Default Human Language Interpreter to the build in one, so it answer to you commands.
  • Then launch the WebUI in dev mode and enable the dialog option and it should work.

@digitaldan
Copy link
Contributor

@GiviMAD excited to try this out as well, i'll give it a go today.

@GiviMAD
Copy link
Member Author

GiviMAD commented Nov 30, 2025

@GiviMAD excited to try this out as well, i'll give it a go today.

Nice to hear that it will be good to have it tested on different devices.

For testing, once the connection is done, the openHAB cli can be used to record an audio and play it:

openhab> openhab:audio sinks
* PCM Audio WebSocket (ui-77-80) (pcm::ui-77-80::sink)
  System Speaker (enhancedjavasound)
  Web Audio (webaudio)
openhab> openhab:audio sources
* PCM Audio WebSocket (ui-77-80) (pcm::ui-77-80::source)
  System Microphone (javasound)
openhab> openhab:audio record pcm::ui-77-80::source 5 test_audio.wav
Recording completed
openhab> openhab:audio play pcm::ui-77-80::sink test_audio.wav

@GiviMAD
Copy link
Member Author

GiviMAD commented Nov 30, 2025

Also for using it over http in other domains than localhost you need to go to chrome://flags/#unsafely-treat-insecure-origin-as-secure and add the openHAB server url there. If not, there is not access to the media devices and the options are hidden. I forgot to add it on the first comment.

Signed-off-by: Miguel Álvarez <[email protected]>
@digitaldan
Copy link
Contributor

digitaldan commented Dec 1, 2025

Ok, i have upgraded my build of openHAB with the latest nightly docker image which i think has the right parts:

openhab> list -s|grep websock
196 │ Active │  80 │ 5.1.0.202511300254    │ org.openhab.core.io.websocket
197 │ Active │  80 │ 5.1.0.202511300258    │ org.openhab.core.io.websocket.audio

I have whisper and piper configured. I also have this turned on in the Main UI (which i updated with this PR)

image

I'm connecting to OH using SSL and a real lets encrypt cert, but the mic is disabled

image

and i don't think i see anything in the logs. Not sure if i'm missing a step? I'll try debugging a bit my self tonight and tomorrow when i have a free moment, but maybe there's something obvious @GiviMAD you can suggest?

@GiviMAD
Copy link
Member Author

GiviMAD commented Dec 1, 2025

Yes, I tested it last night over https and it seems to be a problem with the content security policy, I think because of the way the worklet code is packaged, I'll try to solve it in the afternoon.

Signed-off-by: Miguel Álvarez <[email protected]>
@GiviMAD
Copy link
Member Author

GiviMAD commented Dec 1, 2025

The problem with the content security policy seems to be solved by the last commit, @digitaldan let me know if it works for you now.

@digitaldan
Copy link
Contributor

Excellent, i just deployed and did a real quick test an now audio is working 👍 I'll spend some time this afternoon with it. Thanks!

@digitaldan
Copy link
Contributor

@GiviMAD excellent work, its been working great all morning, can't wait to play with this more.

I think its about time to start looking at building a "real" AI/LLM human language interpreter for our back end that can act as a Alexa/Siri/Google replacement. We have a ChatGPT binding, but its HLI functionality is very limited.

I was planning on writing this a year or more ago, but got wrapped up in Matter and decided that was a better use of my time for openHAB. The other, maybe larger deterrent was that all the end-to-end voice plumbing was very overwhelming to tackle at the time......... which is exactly what you have done, and that was really the hardest part, bravo !

@GiviMAD
Copy link
Member Author

GiviMAD commented Dec 1, 2025

@GiviMAD excellent work, its been working great all morning, can't wait to play with this more.

Thank you for giving it a try, nice to know it's working correctly.

One question, do you think it could be better to only load the worker and connect to the websocket by clicking in the disabled mic icon instead of doing it on the first user event in any part of the screen as is done right know? Because I did it this way but I think now it is useless to setup all of that if you are not going to use it. Maybe I should change the behavior or allow to switch it from the options.

I think its about time to start looking at building a "real" AI/LLM human language interpreter for our back end that can act as a Alexa/Siri/Google replacement. We have a ChatGPT binding, but its HLI functionality is very limited.

There is a issue created for adding a chat to the WebUI.
I was thinking about it, we can add a class in the server called Conversation that hold a history of messages, and we can pass this conversation as context to the interpreter as it's done right know with the location item, that way the AI interpreters (as the one in the ChatGPT binding) can feed that history to the LLM/LLM-API in order to have a conversation with them. I think this conversation can be also used as the "backend" of the chat for the UI, adding and API that allows to retrieve the conversation history and a new parameter to the interpret API that allows to send the conversation id with the new message to avoid sending the entire conversation from the UI. Also this way the chat and the voice dialog can be part of the same conversation.

I hope I have explained it correctly. I think it could be a good start point.

bravo !

Thank you!

@digitaldan
Copy link
Contributor

do you think it could be better to only load the worker and connect to the websocket by clicking in the disabled mic icon instead of doing it on the first user event in any part of the screen as is done right know?

I think at a minimum, if the first thing the user does is click on the mic, it should connect and then actually activate the mic, right now i think you have to click twice if its the first thing interacted with? (once to activate, then again to start the stream)

I hope I have explained it correctly. I think it could be a good start point.

So i think you have definitely touched on part of this, and to be fair i have not looked at the interpreter framework to really understand whats built in right now, but i was thinking this LLM HLI would support:

  1. Generic LLM interaction (like the chatGPT binding, so "Whats the capital of France?", "Top 10 John Hughes Movies?")
  2. Tool Calling (or the equivalent functionality with whatever framework we use)
    1. Item Control (with semantic support , synonyms, room location awareness, etc.. )
    2. Item reporting ("Whats the temperature in the Kitchen?")
    3. Rule control, execution and even the ability to create rules and maybe one time rules (like "turn the lights off in 10 mins")
  3. Memory, history and awareness of previous interactions
  4. LLM agnostic, can use open ai, anthropic, gemini .....or even better, local models through ollama and other services which support the common API many local inference servers use.
  5. Probably other things i am not remembering right now.....

I need to spend some time to really think about it, but again, you solved the hardest part in my mind, while chat is neat and a great way to test (and i could see hooking up to SMS or slack for chatting with your home while away), voice is really the killer application for this.

@GiviMAD
Copy link
Member Author

GiviMAD commented Dec 2, 2025

So i think you have definitely touched on part of this, and to be fair i have not looked at the interpreter framework to really understand whats built in right now, but i was thinking this LLM HLI would support:

1. Generic LLM interaction (like the chatGPT binding, so "Whats the capital of France?", "Top 10 John Hughes Movies?")

2. Tool Calling (or the equivalent functionality with whatever framework we use)
   
   1. Item Control (with semantic support , synonyms, room location awareness, etc..  )
   2. Item reporting ("Whats the temperature in the Kitchen?")
   3. Rule control, execution and even the ability to create rules and maybe one time rules (like "turn the lights off in 10 mins")

3. Memory, history and awareness of previous interactions

4. LLM agnostic, can use open ai, anthropic, gemini .....or even better, local models through ollama and other services which support  the common API many local inference servers use.

5. Probably other things i am not remembering right now.....

I have a similar vision but I think not exactly the same. What I think is that most of those capabilities should be encapsulated in the core (history, tools, memory....) and the HLI interface keeps been a black box that allows to connect to whatever implementation you want (OpenAI, Ollama, Gemini... or the current built-in interpreter). I haven't got much experience with this but I was looking around and that seems to be a good way to go and I think is not hard going from the current state to that.

I think at a minimum, if the first thing the user does is click on the mic, it should connect and then actually activate the mic, right now i think you have to click twice if its the first thing interacted with? (once to activate, then again to start the stream)

Right now, once you enable it, a click on any place of the page loads the Webworker and connects to the Websocket.
About triggering it on the first click I see it can be useful depending on how you use it, but some times I want to launch it and not use it in the moment, I just want the sink and source to be registered on the server. And in case of adding a keyword spotter (which I hope I can add in the future) it makes sense to have the setup and the trigger in different stages.

I think I will add those two things as options and move the PR to ready, thank you for the feedback.

@florian-h05
Copy link
Contributor

Can't wait to try this out, but unfortunately I am quite busy at the moment ...

Just wanted to share a few issues/PRs with you wrt HLI:

@GiviMAD
Copy link
Member Author

GiviMAD commented Dec 2, 2025

Can't wait to try this out, but unfortunately I am quite busy at the moment ...

Just wanted to share a few issues/PRs with you wrt HLI:

* [Add a intelligent smart home chat bot to the UI #2995](https://github.com/openhab/openhab-webui/issues/2995)

* [Add HLI bundle with EnhancedHLIInterpreter and shared Card/Component model openhab-core#5016](https://github.com/openhab/openhab-core/pull/5016)

* [[ChatGPT] enhanced HLI service openhab-addons#19267](https://github.com/openhab/openhab-addons/pull/19267)

Thank you for the links @florian-h05

I have taken a quick look and I don't think the best choice is changing the interpreters response.

I think that if you want to allow the LLM model to have the ability to display cards on the chat in the UI you should expose that as a tool to the model instead of it been the model response. For example if you want to implement the capability of generating images using the chat, you will make a new tool available to the model and when it calls that tool the image will be send to the chat, it doesn't need to be integrated into the interpreter response, and also changing the interface will made then unusable by audio.

Signed-off-by: Miguel Álvarez <[email protected]>
Signed-off-by: Miguel Álvarez <[email protected]>
@GiviMAD GiviMAD force-pushed the feature/voice_dialog branch from 8fdee10 to 56567a8 Compare December 3, 2025 00:28
@florian-h05
Copy link
Contributor

I have taken a quick look and I don't think the best choice is changing the interpreters response.

My proposal would be to extend the response with tool calls, so a model can return both text and tool calls. That’s what you suggest as well.
I have to admit I haven’t looked at the PRs I linked yet (lack of time).

@GiviMAD
Copy link
Member Author

GiviMAD commented Dec 3, 2025

I have taken a quick look and I don't think the best choice is changing the interpreters response.

My proposal would be to extend the response with tool calls, so a model can return both text and tool calls. That’s what you suggest as well. I have to admit I haven’t looked at the PRs I linked yet (lack of time).

I think that by adding a "Conversation" that the interpreter can interact with in real time without waiting for the interpretation to end makes more sense, there you can have a conversation between different roles, the basic ones like the user and openhab but also the thinking and the tool calling, and the chat can get the changes of the conversation in real time and make the user aware of the tool usage and thinking, and also implement the cool text streaming that is common on the AI chats.

I think it solves most of the problems without requiring any breaking change. I'll try to write a draft PR to see if it makes sense or I'm missing something.

@digitaldan
Copy link
Contributor

I have a similar vision but I think not exactly the same. What I think is that most of those capabilities should be encapsulated in the core (history, tools, memory....) and the HLI interface keeps been a black box that allows to connect to whatever implementation you want (OpenAI, Ollama, Gemini... or the current built-in interpreter). I haven't got much experience with this but I was looking around and that seems to be a good way to go and I think is not hard going from the current state to that.

Yes that sounds like a solid plain, i like the idea of breaking out the functionality into separate features or even bundles

Just wanted to share a few issues/PRs with you wrt HLI:

Well, thats always strange when you start reading a thread and stumble upon your own posts which you totally don't remember writing ;-). ....... so thanks for the reminder. Like groundhog's days.

I think it solves most of the problems without requiring any breaking change. I'll try to write a draft PR to see if it makes sense or I'm missing something.

That would be great, I would love to see a high level design of how this would all fit together end-to-end, i think that would really speed things up.... right now its a bit ambiguous to me ( hoping to dive into the PRs mentioned as well as the HLI code this weekend as i am totally not up to speed)

@GiviMAD GiviMAD changed the title [WIP] [UI] Add voice dialog [UI] Add voice dialog Dec 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request main ui Main UI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants