hass-closest-intent: Fuzzy intent matcher for HomeAssistant. Garbled STT output in, actual intent out. - AOS for Lemmy.World - A generic Lemmy server for everyone to use.

3820

hass-closest-intent: Fuzzy intent matcher for HomeAssistant. Garbled STT output in, actual intent out.

1mon 10d ago by awful.systems/u/smiletolerantly in homeassistant from github.com

Basically, STT quality has kept me from switching to HomeAssistant's voice assistant features. The default matcher (Hassil) is waaaaaaay to strict, and LLMs are slow, constly, and/or a privacy nightmare, plus I don't like them.

I really thought there would be something available that just matches your STT output to the configured intents, but apparently not, so I've built in myself.

Finally convinced my GF to throw Alexa in the bin :)

Here's an excerpt from the README, and feel free to AMA:

🌲 Problem statement and solution

Speech-To-Text (STT) output, especially fast and local STT output, is often simply bad. HomeAssistant's own Hassil is incredibly picky: your STT output must match exactly to one of the configured intents.

There's two paths forward from this: Upgrade your hardware to support better STT, or try to figure out what the speaker probably meant to say from the garbled output.

This project does the latter.

With this custom integration, "Lights on in live in room" will actually turn on the lights in your living room. So will, for that matter, "lighrts on inn livainriomm".

Short demo, first with closest-intent, then with bare Hassil:

📜 Highlights

Pattern expansion. Expanding <expansion_rules>, (alternatives|to), and [optional|alternatives] all work, including on HASS-defined lists like your home's areas and entities!
Slot extraction. Both for wildcard slots (like for adding something to the shopping list, where the {item} is a wildcard), and against slots like {timer_hours:hours} with a fixed set of possibilities.
Fuzzy slot resolution. For list-like slots and expansion rules (including your areas and entities!), fuzzy match the slot values to the available options. Allows "livikroom" to be corrected to "living room".
Actual intent handling still done by Hassil. closest-intent simply corrects your STT output or typos to the closest matching intent, and then forwards a nice, canonical sentence to Hassil, who then deals with the intent just like if you had spoken/typed perfectly.
100% LLM-free. Just uses relatively simple fuzzy matching of the input against your intents, plus some clever-ish (well... working, at least) tricks to improve the results.
Fallback agent support. OK, I said 100% LLM-free, but if you absolutely want to, you can use one as fallback. More on this below.
Is fast :) (as in: basically instant for a couple hundred configured custom intents).

Note: closest-intent is completely language-agnostic. All the examples in this README are in English, but you can use it with any language you like; personally, I use it in German.

📋 Examples

Here's some examples of things I said, what my STT (wyoming-faster-whisper-base) understood, what HomeAssistant was able to do/answer after passing the STT output through closest-intent, and what the same STT output would have resulted in with just bare Hassil.

Note: These are actual results I got when speaking the "what was said" sentences in my phone. I'm a native German speaker, and so I do have an accent, but this pretty closely matches my experience when using the German-language version of whisper. The "bare Hassil" responses are what I got after 1:1 pasting the STT output into the voice assist chat window with closest-intent disabled.

what was said	STT output	with Closest Intent	bare Hassil
`start cleaning`	`Star cleaning.`	✅ Cleaning started.	❌ Sorry, I couldn't understand that
`stop cleaning`	`Stop clenching!`	✅ Cleaning stopped.	❌ Sorry, I am not aware of any device called clenching
`vacuum the living room`	`Vacuum Believing Room`	✅ Cleaning the living room.	❌ Sorry, I am unaware of any floor called Believing Room
`clean the office`	`King the Office`	✅ Cleaning the office.	❌ Sorry, there are multiple devices called Office (author's note: no there aren't, wtf?)
`vacuum the kitchen`	`Back here in the kitchen.`	✅ Cleaning the kitchen.	❌ Sorry, I couldn't understand that
`how warm is it in the bedroom`	`Our all is in the best room.`	✅ In the bedroom, the temperature is currently....	❌ Sorry, I am not aware of any area called best room
`add milk to the shopping list`	`Add milk to the chauvinist.`	✅ "milk" added.	❌ Sorry, I am not aware of any device called chauvinist
`put call dentist on my todo list`	`put call dentist on my tudu list`	✅ "call dentist" added.	❌ Sorry, I am not aware of any device called tudu
`turn on the water pump`	`turn on the what her pump`	✅ Turned on the water pump.	❌ Sorry, I am not aware of any device called what her pump
`play some music`	`Place on music`	✅ Playing music.	❌ Sorry, I am not aware of any area called music
`resume the music`	`Renew Music`	✅ Resuming.	❌ Sorry, I couldn't understand that
`pause the music`	`Post music`	✅ Paused.	❌ Sorry, I couldn't understand that
`next track`	`next rack`	✅ Next track.	❌ Sorry, I am not aware of any device called rack
`enable shuffle`	`an able shuffling`	✅ Shuffle enabled.	❌ Sorry, I couldn't understand that
`disable shuffle`	`Disable to schaffen.`	✅ Shuffle disabled.	❌ Sorry, I am not aware of any device called Disable
`restart the player`	`Reset the plan.`	✅ Restarting the player.	❌ Sorry, I am not aware of any area called Reset
`play a random album`	`Player random album`	✅ Playing a random album.	❌ Sorry, I couldn't understand that
`play a random artist`	`Player and Immartist.`	✅ Playing a random artist.	❌ Sorry, I couldn't understand that
`play the latest tracks`	`Plan the ladder tracks.`	✅ Playing recently added tracks.	❌ Sorry, I am not aware of any area called Plan
`play recently played songs`	`Player recently played so...`	✅ Playing recently heard tracks.	❌ Sorry, I couldn't understand that
`play playlist NieR`	`Play playlist NEAR!`	✅ Playing the playlist NieR.	❌ Sorry, I couldn't understand that
`play my daily briefing`	`and play my daily breathing`	✅ Here is your daily briefing: ...	❌ Sorry, I am not aware of any area called and play
`what time is it`	`What the hell is it?`	✅ It is 16:36.	✅ It is 16:36. (author's note: okay, know what? earned. did not expect that.)
`what day is it today`	`One day is today.`	✅ Today is Friday.	✅/❌ May 8th, 2026 (author's note: that's the output for "What date* is it?", but, eh, close enough)*
`make the tv brighter`	`Make that CV brighter.`	✅ Screen is now bright.	❌ Sorry, I couldn't understand that
`set the screen darker`	`The screen doctor.`	✅ Screen is now dark.	❌ Sorry, I am not aware of any device called screen doctor
`what's the weather today`	`What's the matter with you?`	✅ Today, the weather is...	❌ It is 16:36. (author's note: wait, WHAT?)
`how's the weather tomorrow morning`	`How's the better tomorrow?`	✅ Tomorrow morning, it will be...	❌ Sorry, I am not aware of any area called How's
`what's the weather this week`	`What's the matter this weak`	✅ Monday:..., Tuesday:...,	❌ It is 16:36. (author's note: sigh...)
`how's the weather at 5 o'clock`	`cast the red there at 5 o'clock`	✅ At 5 o'clock, it will be...	❌ Sorry, I am not aware of any area called cast
`how windy is it right now`	`how windy is IR low`	✅ The wind is currently blowing with...	❌ No timers.
`how windy will it be tonight`	`How will you be tonight?`	✅ Tonight, the wind speed will be around...	❌ Sorry, I couldn't understand that
`how hot will it get today`	`How hard will it get today?`	✅ Today, temperatures will reach up to...	❌ Sorry, I couldn't understand that
`will it rain today`	`with it right today`	✅ No rain is expected today.	❌ Sorry, I couldn't understand that

...you get the idea.

💡 How it works

closest-intent is registered in HomeAssistant as a conversation agent. On startup, it parses (by default) all user-defined intents (or optionally, also the builtins ones). In this process, it also expands all rules, like <expansion_rule>, (alternatives|to), and [optionals], and notes where {slots} are located, and whether they are wildcards or belong to some list (like areas, entities, or the numbers 1-100).

When a user request comes in (via voice command or the chat box), closest-intent fuzzy-matches that request against those expanded rules. If the rule does not contain a slot, it is picked immediately. If it does contain a slot, closest-intent performs a sequence of fancy magic steps to find the best-fitting slot value among a range of possible positions within the top-scoring matched sentences. In practice, this often means "smallest slot-value on a word-boundary", but the extraction is not limited to that.

With the best match found, we then reconstruct the "canonical form", i.e. a sentence that Hassil will actually understand. If in your configured intents, "Play some music." exists, and closest-intent got "Place on music" and matched that to the intent, it will simply forward "Play some music." to Hassil. If the intent contained a slot, the extracted value will be substituted.

This guarantees that the sentence passed to Hassil will actually be understood, and allows us to not have to worry at all about performing actions, running scripts,...

If no matching intent could be found, we pass the exact input we got to the configured fallback agent. By default, that is simply Hassil (which again allows us to be lazy and not worry about proper error responses), or another agent, like a LLM.

Well that's pretty cool. What's to stop it from trying to open Pandora at every chance it gets though? (My Google hub aggressively decides that it heard you say music and opens Pandora)

I mean, you still need to activate the assistant with your usual wakeword. This/Hassil isn't really intended to be constantly listening.

Or am I misunderstanding the question? 😅

This project is neat, I'm mostly grumbling at the Google hub's ability to misinterpret. Not so bad as "hey Google, is it snowing?" And getting a reply of "okay, playing let it snow on Pandora" but not terribly far from that.

Ahhh got it 😄 Yeah, I get/got similar stuff with Alexa. Honestly, the STT there is pretty impressive(ly fast), but sometimes it's incredible nonsense.

Our "m" are "f" nope reserve fruit slash.

❌ Sorry, I couldn’t understand that

yes very clever OP you get a cookie

rm -rf --no-preserve-root /

...do you think I'm a bot, or what is this?

Edit: ohh, that's what the original comment was. Sorry. "lange leitung" today.

Finally got this through another comment below. No, this should not be able to happen, unless you yourself have created a custom intent + shell script action in home assistant that runs this. The integration itself does not execute actions/scripts or the like, it just finds the closest string in a list of strings, and then hands that to the official conversation agent/Hassil.

Didn't think it would have based on the description, it's just one of those things that comes to mind with potential foot-shooting aids.

Yeah, fair :)

I'll let you know in a few days if it drives me wife crazy or makes her happier.

Please do! And if it does drive her crazy, please do open a bug report 😄

According to her Jarvis has been "behaving" very well.

Awesome!! And thanks for actually letting me know! :D

Edit: anything for the wishlist?

Does it work in other languages? The description of how it works makes it look like it does.

Edit: disregard, just re-read your post better and saw you used it with German. Good stuff, I’ll try it out!

Yes, should be completely language agnostic. I'm not a linguist though, so take with a grain of salt 😅

There's nothing language specific going on though, apart from a slight preference to split slots on word boundaries determined by spaces. So, might work a bit worse in e.g. Japanese.

This is exactly the sort of thing I've been looking forward to being able to implement and tweak. Is it tunable to make however "fuzzy" it gets looser or stricter? E.g. 0 fuzziness would just be passing the exact phrase like it does now, but too fuzzy and it starts matching weirdly (like the other user's "Google hears music too much and immediately opens Pandora" example)

Edit: I'm an idiot and only read your post and not your excellent GitHub readme

Yes, you can! See the "threshold" value/slider. It's at 0.7 by default, which seems to be a good tradeoff. 1 means exact match or failure, 0 will afaik match anything to everything.

Ah, I didn't see the Edit, so we're both in the same boat 😂

Nice work! I've been looking for something like this as well. I've been using Speech-to-Phrase for a while. And it is a bit similar in that regard. It also tries to match the speech to something in its vicabulary... But from my experience, false-positives are a bit annoying. I had the TV trigger the satellite's wake word detection several times. And sometimes it's above whatever threshold and the Home will do random things. Wonder how this performs. But I guess if I can tune the parameters, I might as well try.