Talk to Your Datastar Chat: Adding Voice Input with the Web Speech API
Typing every prompt into the chat widget got old fast. I'd built the thing to stream tokens back at me in real time — the whole point was that it felt immediate — and then I was still hunting and pecking to ask it anything. So I added a mic button. No new backend, no transcription service, no line item on a bill. Just the browser's own SpeechRecognition API, feeding into the same signal the text input already uses.
The catch, up front
SpeechRecognition isn't universal, and I'd rather say that in paragraph two than have you find out after wiring it all up. Chrome, Edge, and Safari (via the webkit prefix) support it out of the box. Firefox implemented it years ago and still ships it disabled by default, behind a dom.webspeech.recognition.enable flag in about:config. Practically, that means most Firefox visitors won't have it, full stop.
That's fine — this is an enhancement, not a requirement. Feature-detect it, and if it's missing, the mic button just doesn't render. The text input, which was already there, doesn't care either way.
const SpeechRecognition = window.SpeechRecognition || (window as any).webkitSpeechRecognition;
const speechSupported = typeof SpeechRecognition !== "undefined";
Wiring it into the signal that already exists
The chat widget's text input binds to a Datastar signal — call it $message — with data-bind. Everything downstream (the submit handler, the streaming request) already reads from that signal. So the mic button doesn't need its own pipeline. It just needs to write into $message the same way typing does.
<input type="text" data-bind="message" placeholder="Ask something..." />
<button
type="button"
data-show="$speechSupported"
data-on:click="$listening ? @post('/api/mic/stop') : @post('/api/mic/start')"
data-attr:aria-pressed="$listening"
>
🎤
</button>
That data-show="$speechSupported" is doing the graceful-degradation work from the section above — the signal gets set once on load based on the feature check, and the button simply isn't in the DOM for anyone without support.
The recognition instance itself lives in a small script that starts on click and pushes transcripts back into the signal via Datastar's own signal-patching API, so a custom speech-result event ends up doing the same job a data-on:input would on the text field:
const recognition = new SpeechRecognition();
recognition.continuous = true;
recognition.interimResults = true;
recognition.onresult = (event) => {
let finalTranscript = "";
let interimTranscript = "";
for (let i = event.resultIndex; i < event.results.length; i++) {
const transcript = event.results[i][0].transcript;
if (event.results[i].isFinal) {
finalTranscript += transcript;
} else {
interimTranscript += transcript;
}
}
document.dispatchEvent(
new CustomEvent("speech-result", {
detail: { text: finalTranscript || interimTranscript },
}),
);
};
And on the markup side, a listener that patches the signal whenever that event fires:
<div data-on:speech-result__window="$message = evt.detail.text"></div>
Everything after that — the submit button, the streaming fetch to the AI SDK route — is untouched. It doesn't know or care whether $message got there by keyboard or by voice.
Making interim results feel alive
The continuous and interimResults flags are what turn this from "record, then transcribe" into something that feels like it's listening while you talk. continuous = true keeps the recognizer running instead of stopping after one utterance; interimResults = true fires onresult with best-guess text before it's finalized, not just at the end.
The loop above concatenates both kinds every time the event fires, which means the naive version — writing whichever string is non-empty into $message — is enough to get words appearing in the input as you speak, then settling once isFinal flips true. If you want to visually distinguish the two states, that's a $listening or $interim signal and a CSS class, not a change to the recognition logic.
The edges that'll actually bite you
A few things I hit building this that aren't obvious from the API docs:
- Permissions. The first click prompts for microphone access, same as any
getUserMediacall. If the user denies it,onerrorfires withnot-allowed— handle it by falling back silently to typing rather than showing a dead mic icon. - No-speech timeouts. Recognition sessions can end on their own after a period of silence, even with
continuous: true. Listen foronendand decide whether to restart automatically or just flip$listeningback to false. - Stop it when you submit. If the user hits enter mid-sentence, call
recognition.stop()so it's not still capturing audio into a message that's already been sent.
I went with toggle instead of push-to-talk — click once to start listening, click again to stop. Typing doesn't require holding anything down either, so a toggle felt like the closer match to how the rest of the widget already behaves.
Why bother
None of this required a new dependency, a new API key, or a line on an invoice. It's the same trade I keep making with this site: reach for what the browser already gives you before reaching for a service that bills per request. The chat widget streams tokens back because that felt like the honest way to show an LLM thinking. Now it can listen the same way — no new infrastructure, just one more signal getting written into from a different direction.
It's turned into something I actually reach for when I'm thinking out loud about a prompt rather than composing it carefully — for quick, exploratory questions it's faster than typing. For anything that needs precise wording, I still default back to the keyboard.