[UI] Add voice dialog #2285

GiviMAD · 2024-01-26T17:49:08Z

This is a WIP PR for the issue #2275

PR work with latest snapshot Nov 30 2025.

How it works

Under bundles/org.openhab.ui/web/src/js/voice I've added a library that handles the usage of the WebWorker and the Audio APIs to connect by websocket to openHAB and stream the audio.

These are the main classes inside:

AudioWorker: WebWorker implementation that connects to the openHAB server /ws/audio-pcm endpoint.
AudioMain: Main orchestrator that handles the interaction between the AudioWorker and the Audio api. (ex. when a sound needs to be play it registers a audioWorklet transfers its messagePort instance to the AudioWorker so it can stream the data without impacting the browser's main thread)
AudioSink: Used by AudioMain to setup the audio playback and register the audio worklet implementation used to handle audio coming from the AudioWorker thread.
AudioSource: Used by AudioMain to setup access to the microphone and register the audio worklet implementation used to send audio to the AudioWorker thread.

Requirements

It requires to access the WebUI over https (localhost domain is excluded).
To make it work over http in other domains with chrome you can go to chrome://flags/#unsafely-treat-insecure-origin-as-secure and add the openhab url there.
These is because the browser requires a secure connection to allow the access to media devices (mic, webcam...).
The options in the about page are hidden if access to media devices is not possible.

It's required to configure in the openHAB server the default speech-to-text, text-to-speech and interpreter services in the voice settings.

Current state:

Options added in the about section (they will not be shown if the getUserMedia API is not available in the browser):

If you enable the voice dialog, a button will be shown at the top right of the overview page which shows a different icon indicating the state of the dialog:

-> Mic not initialized (initial state before first user interaction with the page as a user event is required to setup the microphone access), or disconnected due to a network error or a error in the server.

-> Ready to use, you can click it to trigger a dialog interaction

-> Microphone audio is been streamed to openHAB, takes prevalence over the audio playing icon

-> Audio is playing

relativeci · 2024-01-31T19:34:34Z

#3915 Bundle Size — 12.33MiB (+0.2%).

56567a8(current) vs 967c616 main#3911(baseline)

Warning

Bundle contains 2 duplicate packages – View duplicate packages

Bundle metrics

5 changes

1 regression

	Current #3915	Baseline #3911
Initial JS	`1.52MiB`(`+0.19%`)	`1.52MiB`
Initial CSS	`0B`	`0B`
Cache Invalidation	`7.07%`	`7.06%`
Chunks	`620`(`+0.16%`)	`619`
Assets	`705`(`+0.57%`)	`701`
Modules	`2422`(`+0.33%`)	`2414`
Duplicate Modules	`0`	`0`
Duplicate Code	`0%`	`0%`
Packages	`126`	`126`
Duplicate Packages	`1`	`1`

Bundle size by type

2 changes

2 regressions

	Current #3915	Baseline #3911
JS	`10.66MiB` (`+0.23%`)	`10.64MiB`
CSS	`845.36KiB` (`~+0.01%`)	`845.29KiB`
Fonts	`526.1KiB`	`526.1KiB`
Media	`295.6KiB`	`295.6KiB`
IMG	`45.73KiB`	`45.73KiB`
Other	`847B`	`847B`

Bundle analysis report Branch GiviMAD:feature/voice_dialog Project dashboard

^{Generated by RelativeCI Documentation Report issue}

Signed-off-by: Miguel Álvarez <[email protected]>

GiviMAD · 2025-11-30T16:47:35Z

Hello @florian-h05.

After the core PR has been merged I've updated the branch to the recent web-ui changes.
Everything seems to work correctly also I managed to fix some small glitches on the audio playback that I had before and wasn't able to find the cause.

I have added some descriptions to the main comment about the current state of the PR. Do you think the PR is ok for a first version? Let me know want you think when you have a moment.

In case you want to test it:

I downloaded the server snapshot and installed the whisper and piper add-ons.
For whisper, I downloaded from here https://huggingface.co/ggerganov/whisper.cpp/tree/main the ggml-large-v3-turbo-q5_0.bin model and placed in the whisper folder under userdata, and then you need to select that model using the UI in the Whisper configuration.
For piper I downloaded a voice from here https://huggingface.co/rhasspy/piper-voices/tree/main and placed it on the piper folder under userdata (both file names need to start with the same, in my case es_ES-sharvard-medium.onnx and es_ES-sharvard-medium.onnx.json).
Then in the voice setting I set them as Default Speech-to-text and Text-to-speech services and I changed the Default Human Language Interpreter to the build in one, so it answer to you commands.
Then launch the WebUI in dev mode and enable the dialog option and it should work.

digitaldan · 2025-11-30T18:50:24Z

@GiviMAD excited to try this out as well, i'll give it a go today.

GiviMAD · 2025-11-30T19:44:45Z

@GiviMAD excited to try this out as well, i'll give it a go today.

Nice to hear that it will be good to have it tested on different devices.

For testing, once the connection is done, the openHAB cli can be used to record an audio and play it:

openhab> openhab:audio sinks
* PCM Audio WebSocket (ui-77-80) (pcm::ui-77-80::sink)
  System Speaker (enhancedjavasound)
  Web Audio (webaudio)
openhab> openhab:audio sources
* PCM Audio WebSocket (ui-77-80) (pcm::ui-77-80::source)
  System Microphone (javasound)
openhab> openhab:audio record pcm::ui-77-80::source 5 test_audio.wav
Recording completed
openhab> openhab:audio play pcm::ui-77-80::sink test_audio.wav

GiviMAD · 2025-11-30T19:49:26Z

Also for using it over http in other domains than localhost you need to go to chrome://flags/#unsafely-treat-insecure-origin-as-secure and add the openHAB server url there. If not, there is not access to the media devices and the options are hidden. I forgot to add it on the first comment.

Signed-off-by: Miguel Álvarez <[email protected]>

digitaldan · 2025-12-01T00:14:24Z

Ok, i have upgraded my build of openHAB with the latest nightly docker image which i think has the right parts:

openhab> list -s|grep websock
196 │ Active │  80 │ 5.1.0.202511300254    │ org.openhab.core.io.websocket
197 │ Active │  80 │ 5.1.0.202511300258    │ org.openhab.core.io.websocket.audio

I have whisper and piper configured. I also have this turned on in the Main UI (which i updated with this PR)

I'm connecting to OH using SSL and a real lets encrypt cert, but the mic is disabled

and i don't think i see anything in the logs. Not sure if i'm missing a step? I'll try debugging a bit my self tonight and tomorrow when i have a free moment, but maybe there's something obvious @GiviMAD you can suggest?

GiviMAD · 2025-12-01T09:23:43Z

Yes, I tested it last night over https and it seems to be a problem with the content security policy, I think because of the way the worklet code is packaged, I'll try to solve it in the afternoon.

Signed-off-by: Miguel Álvarez <[email protected]>

GiviMAD · 2025-12-01T13:57:21Z

The problem with the content security policy seems to be solved by the last commit, @digitaldan let me know if it works for you now.

digitaldan · 2025-12-01T14:09:52Z

Excellent, i just deployed and did a real quick test an now audio is working 👍 I'll spend some time this afternoon with it. Thanks!

digitaldan · 2025-12-01T17:03:28Z

@GiviMAD excellent work, its been working great all morning, can't wait to play with this more.

I think its about time to start looking at building a "real" AI/LLM human language interpreter for our back end that can act as a Alexa/Siri/Google replacement. We have a ChatGPT binding, but its HLI functionality is very limited.

I was planning on writing this a year or more ago, but got wrapped up in Matter and decided that was a better use of my time for openHAB. The other, maybe larger deterrent was that all the end-to-end voice plumbing was very overwhelming to tackle at the time......... which is exactly what you have done, and that was really the hardest part, bravo !

GiviMAD · 2025-12-01T23:25:48Z

@GiviMAD excellent work, its been working great all morning, can't wait to play with this more.

Thank you for giving it a try, nice to know it's working correctly.

One question, do you think it could be better to only load the worker and connect to the websocket by clicking in the disabled mic icon instead of doing it on the first user event in any part of the screen as is done right know? Because I did it this way but I think now it is useless to setup all of that if you are not going to use it. Maybe I should change the behavior or allow to switch it from the options.

I think its about time to start looking at building a "real" AI/LLM human language interpreter for our back end that can act as a Alexa/Siri/Google replacement. We have a ChatGPT binding, but its HLI functionality is very limited.

There is a issue created for adding a chat to the WebUI.
I was thinking about it, we can add a class in the server called Conversation that hold a history of messages, and we can pass this conversation as context to the interpreter as it's done right know with the location item, that way the AI interpreters (as the one in the ChatGPT binding) can feed that history to the LLM/LLM-API in order to have a conversation with them. I think this conversation can be also used as the "backend" of the chat for the UI, adding and API that allows to retrieve the conversation history and a new parameter to the interpret API that allows to send the conversation id with the new message to avoid sending the entire conversation from the UI. Also this way the chat and the voice dialog can be part of the same conversation.

I hope I have explained it correctly. I think it could be a good start point.

bravo !

Thank you!

digitaldan · 2025-12-02T15:55:41Z

do you think it could be better to only load the worker and connect to the websocket by clicking in the disabled mic icon instead of doing it on the first user event in any part of the screen as is done right know?

I think at a minimum, if the first thing the user does is click on the mic, it should connect and then actually activate the mic, right now i think you have to click twice if its the first thing interacted with? (once to activate, then again to start the stream)

I hope I have explained it correctly. I think it could be a good start point.

So i think you have definitely touched on part of this, and to be fair i have not looked at the interpreter framework to really understand whats built in right now, but i was thinking this LLM HLI would support:

Generic LLM interaction (like the chatGPT binding, so "Whats the capital of France?", "Top 10 John Hughes Movies?")
Tool Calling (or the equivalent functionality with whatever framework we use)
1. Item Control (with semantic support , synonyms, room location awareness, etc.. )
2. Item reporting ("Whats the temperature in the Kitchen?")
3. Rule control, execution and even the ability to create rules and maybe one time rules (like "turn the lights off in 10 mins")
Memory, history and awareness of previous interactions
LLM agnostic, can use open ai, anthropic, gemini .....or even better, local models through ollama and other services which support the common API many local inference servers use.
Probably other things i am not remembering right now.....

I need to spend some time to really think about it, but again, you solved the hardest part in my mind, while chat is neat and a great way to test (and i could see hooking up to SMS or slack for chatting with your home while away), voice is really the killer application for this.

GiviMAD · 2025-12-02T20:09:31Z

So i think you have definitely touched on part of this, and to be fair i have not looked at the interpreter framework to really understand whats built in right now, but i was thinking this LLM HLI would support:

1. Generic LLM interaction (like the chatGPT binding, so "Whats the capital of France?", "Top 10 John Hughes Movies?")

2. Tool Calling (or the equivalent functionality with whatever framework we use)
   
   1. Item Control (with semantic support , synonyms, room location awareness, etc..  )
   2. Item reporting ("Whats the temperature in the Kitchen?")
   3. Rule control, execution and even the ability to create rules and maybe one time rules (like "turn the lights off in 10 mins")

3. Memory, history and awareness of previous interactions

4. LLM agnostic, can use open ai, anthropic, gemini .....or even better, local models through ollama and other services which support  the common API many local inference servers use.

5. Probably other things i am not remembering right now.....

I have a similar vision but I think not exactly the same. What I think is that most of those capabilities should be encapsulated in the core (history, tools, memory....) and the HLI interface keeps been a black box that allows to connect to whatever implementation you want (OpenAI, Ollama, Gemini... or the current built-in interpreter). I haven't got much experience with this but I was looking around and that seems to be a good way to go and I think is not hard going from the current state to that.

I think at a minimum, if the first thing the user does is click on the mic, it should connect and then actually activate the mic, right now i think you have to click twice if its the first thing interacted with? (once to activate, then again to start the stream)

Right now, once you enable it, a click on any place of the page loads the Webworker and connects to the Websocket.
About triggering it on the first click I see it can be useful depending on how you use it, but some times I want to launch it and not use it in the moment, I just want the sink and source to be registered on the server. And in case of adding a keyword spotter (which I hope I can add in the future) it makes sense to have the setup and the trigger in different stages.

I think I will add those two things as options and move the PR to ready, thank you for the feedback.

florian-h05 · 2025-12-02T21:43:06Z

Can't wait to try this out, but unfortunately I am quite busy at the moment ...

Just wanted to share a few issues/PRs with you wrt HLI:

GiviMAD · 2025-12-02T22:47:19Z

Can't wait to try this out, but unfortunately I am quite busy at the moment ...

Just wanted to share a few issues/PRs with you wrt HLI:

* [Add a intelligent smart home chat bot to the UI #2995](https://github.com/openhab/openhab-webui/issues/2995)

* [Add HLI bundle with EnhancedHLIInterpreter and shared Card/Component model openhab-core#5016](https://github.com/openhab/openhab-core/pull/5016)

* [[ChatGPT] enhanced HLI service openhab-addons#19267](https://github.com/openhab/openhab-addons/pull/19267)

Thank you for the links @florian-h05

I have taken a quick look and I don't think the best choice is changing the interpreters response.

I think that if you want to allow the LLM model to have the ability to display cards on the chat in the UI you should expose that as a tool to the model instead of it been the model response. For example if you want to implement the capability of generating images using the chat, you will make a new tool available to the model and when it calls that tool the image will be send to the chat, it doesn't need to be integrated into the interpreter response, and also changing the interface will made then unusable by audio.

Signed-off-by: Miguel Álvarez <[email protected]>

florian-h05 · 2025-12-03T07:55:19Z

I have taken a quick look and I don't think the best choice is changing the interpreters response.

My proposal would be to extend the response with tool calls, so a model can return both text and tool calls. That’s what you suggest as well.
I have to admit I haven’t looked at the PRs I linked yet (lack of time).

GiviMAD · 2025-12-03T13:36:58Z

I have taken a quick look and I don't think the best choice is changing the interpreters response.

My proposal would be to extend the response with tool calls, so a model can return both text and tool calls. That’s what you suggest as well. I have to admit I haven’t looked at the PRs I linked yet (lack of time).

I think that by adding a "Conversation" that the interpreter can interact with in real time without waiting for the interpretation to end makes more sense, there you can have a conversation between different roles, the basic ones like the user and openhab but also the thinking and the tool calling, and the chat can get the changes of the conversation in real time and make the user aware of the tool usage and thinking, and also implement the cool text streaming that is common on the AI chats.

I think it solves most of the problems without requiring any breaking change. I'll try to write a draft PR to see if it makes sense or I'm missing something.

digitaldan · 2025-12-03T16:01:47Z

I have a similar vision but I think not exactly the same. What I think is that most of those capabilities should be encapsulated in the core (history, tools, memory....) and the HLI interface keeps been a black box that allows to connect to whatever implementation you want (OpenAI, Ollama, Gemini... or the current built-in interpreter). I haven't got much experience with this but I was looking around and that seems to be a good way to go and I think is not hard going from the current state to that.

Yes that sounds like a solid plain, i like the idea of breaking out the functionality into separate features or even bundles

Just wanted to share a few issues/PRs with you wrt HLI:

Well, thats always strange when you start reading a thread and stumble upon your own posts which you totally don't remember writing ;-). ....... so thanks for the reminder. Like groundhog's days.

I think it solves most of the problems without requiring any breaking change. I'll try to write a draft PR to see if it makes sense or I'm missing something.

That would be great, I would love to see a high level design of how this would all fit together end-to-end, i think that would really speed things up.... right now its a bit ambiguous to me ( hoping to dive into the PRs mentioned as well as the HLI code this weekend as i am totally not up to speed)

GiviMAD requested a review from a team as a code owner January 26, 2024 17:49

GiviMAD force-pushed the feature/voice_dialog branch 2 times, most recently from 89bd614 to 355c5fc Compare January 31, 2024 19:31

florian-h05 added enhancement New feature or request main ui Main UI awaiting other PR Depends on another PR labels Feb 28, 2024

GiviMAD force-pushed the feature/voice_dialog branch from 355c5fc to 0091917 Compare January 8, 2025 23:01

GiviMAD requested review from florian-h05 and ghys as code owners January 8, 2025 23:01

GiviMAD force-pushed the feature/voice_dialog branch 3 times, most recently from ce0a9e6 to 25d363c Compare January 10, 2025 00:08

florian-h05 removed the awaiting other PR Depends on another PR label Nov 29, 2025

[UI] Add voice dialog

23126c5

Signed-off-by: Miguel Álvarez <[email protected]>

GiviMAD force-pushed the feature/voice_dialog branch from da88089 to 23126c5 Compare November 30, 2025 12:16

Fix bug in audio-sink-worklet that caused audio gliches

ffb3b9d

Signed-off-by: Miguel Álvarez <[email protected]>

GiviMAD force-pushed the feature/voice_dialog branch from 550622d to ffb3b9d Compare November 30, 2025 13:39

fixes for android

f5ddc3e

Signed-off-by: Miguel Álvarez <[email protected]>

avoid inline code

8a66f5f

Signed-off-by: Miguel Álvarez <[email protected]>

GiviMAD mentioned this pull request Dec 1, 2025

Remove manual service worker registration #3536

Merged

Merge branch 'main' into feature/voice_dialog

c70bf6c

GiviMAD added 2 commits December 3, 2025 01:16

update options

0dbb901

Signed-off-by: Miguel Álvarez <[email protected]>

remove added dependency

56567a8

Signed-off-by: Miguel Álvarez <[email protected]>

GiviMAD force-pushed the feature/voice_dialog branch from 8fdee10 to 56567a8 Compare December 3, 2025 00:28

GiviMAD changed the title ~~[WIP] [UI] Add voice dialog~~ [UI] Add voice dialog Dec 3, 2025

Uh oh!

[UI] Add voice dialog #2285

Are you sure you want to change the base?

[UI] Add voice dialog #2285

Uh oh!

Conversation

GiviMAD commented Jan 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How it works

Requirements

Current state:

Uh oh!

relativeci bot commented Jan 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

#3915 Bundle Size — 12.33MiB (+0.2%).

Uh oh!

GiviMAD commented Nov 30, 2025

Uh oh!

digitaldan commented Nov 30, 2025

Uh oh!

GiviMAD commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GiviMAD commented Nov 30, 2025

Uh oh!

digitaldan commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GiviMAD commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GiviMAD commented Dec 1, 2025

Uh oh!

digitaldan commented Dec 1, 2025

Uh oh!

digitaldan commented Dec 1, 2025

Uh oh!

GiviMAD commented Dec 1, 2025

Uh oh!

digitaldan commented Dec 2, 2025

Uh oh!

GiviMAD commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

florian-h05 commented Dec 2, 2025

Uh oh!

GiviMAD commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

florian-h05 commented Dec 3, 2025

Uh oh!

GiviMAD commented Dec 3, 2025

Uh oh!

digitaldan commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GiviMAD commented Jan 26, 2024 •

edited

Loading

relativeci bot commented Jan 31, 2024 •

edited

Loading

GiviMAD commented Nov 30, 2025 •

edited

Loading

digitaldan commented Dec 1, 2025 •

edited

Loading

GiviMAD commented Dec 1, 2025 •

edited

Loading

GiviMAD commented Dec 2, 2025 •

edited

Loading

GiviMAD commented Dec 2, 2025 •

edited

Loading