Dictating to AI on Linux, No Internet Required

I've been dictating to AI for a long time. I've had my own transcription script running on Linux for years, built by reusing some of the code I wrote back in the day for audiotranscripciones.com, an audio-transcription SaaS I ended up shutting down. That project didn't work out, but it left me with a codebase worth salvaging. The script does the obvious thing: you press a keyboard shortcut, you speak, you press it again, and the text appears wherever your cursor is. The repo is still public at linux-voice-to-text-ai, in case anyone wants to poke around.

I've iterated on that script a fair bit over the years. Some versions called my home AI server, the one that lives under my desk and not on my laptop. Others hit Whisper from OpenAI. Others hit Deepgram. The part I never managed to fix was always the same: whatever I tried, the laptop needed an internet connection, full stop. And the connection drops every now and then for the usual reasons. When it drops mid-transcription, you're left with half a sentence, or with nothing, or with some weirdly truncated text and you don't even notice where it got cut off until you read the result. And, worst of all, you lose your flow.

A few days ago someone in a group chat I'm in dropped a link to Handy. I've been testing it for a couple of days now and it works really well. Everything runs locally, no internet, no external APIs, no remote server. This article is about why I recommend using this kind of application, and this one specifically, especially if you're on Linux and especially if you work with AI.

Dictating to AI isn't about convenience, it's about context

A lot of people think dictating to your computer is just about not wanting to type, or being able to work in a more relaxed posture. It isn't, or at least not entirely. The real difference is in how much context you hand over to the model.

When you have to type out a task for Claude Code, OpenCode or any other assistant, there's an implicit cost to drafting the prompt. And that cost makes you cut corners without even realizing it. You send a paragraph, two at most, with the bare minimum. When you dictate, that cost collapses. You send twenty paragraphs. A multi-minute ramble where you've laid out everything you'd take into account yourself to do the task properly from the start.

It's the difference between firing a one-liner at a junior over Slack and sitting down with them for five minutes to walk them through the problem with all its nuance: what you want, what you don't want, the things you've already tried, the details you know are going to bite later, the files worth checking and the ones to skip. Like you're sending a mini podcast to the AI. And today's models eat it up. They don't care if you repeat yourself, change your mind mid-sentence or stumble over a word. They process all of it and put it to use. Output quality goes up dramatically when context goes up dramatically.

At DrupalCamp Spain 2025 I already showed in one of my slides that I'd been working this way for over a year. Back then it was still a bit unusual. Today it's practically the standard among people using AI assistants intensively. Anyone still typing prompts by hand is just being less productive than they could be.

Why Handy fills a gap that's been open for years

I tried fully local solutions a couple of years ago. They were slow and the quality was bad. They took long enough that going back to typing was faster. I wrote them off and stuck with the API dependency, assuming that was the price to pay for decent quality and speed. That's why Handy caught me off guard: two years ago, doing this locally and on Linux wasn't viable.

Handy is a free, open-source desktop app, written in Rust on top of Tauri. It does the same thing my script does: press a shortcut, speak, release, and the text shows up wherever your cursor is. But fully offline. Under the hood it uses whisper.cpp via whisper-rs, or Parakeet V3 via transcribe-rs, depending on which model you pick. GPU acceleration if you have one. Silence filtering with Silero VAD. Global shortcuts through rdev. A clean architecture, built by someone who knows what they're doing and hasn't tried to reinvent the parts that are already solved.

The second thing that matters is that it works on Linux. I tried several similar tools back in the day and they all had issues: either no Wayland support, or an assumed desktop environment, or shortcuts that worked only halfway. That's why I ended up rolling my own. On Linux, Handy asks you to install xdotool (X11) or wtype/dotool (Wayland) so it can paste the text properly, and that's pretty much it.

How I've got it set up

I'm running Whisper Turbo, which weighs in at about 1.6 GB and asks for a machine with some muscle. On my laptop, which has a decent GPU, the response is essentially instant and the transcription quality is noticeably better than what I used to get from OpenAI's Whisper API back when I tested it. I wasn't expecting that. If your machine can't handle Turbo, the smaller models (Small, Medium, or Parakeet V3 which runs on CPU) are a reasonable alternative, even if you lose a bit of the immediacy.

The second thing I'd recommend setting up is Custom Words. It's a field in the settings where you list the terms the model consistently makes up or mistranscribes. In my case I've added a good chunk of Drupal and programming vocabulary the model, trained on general English, never quite gets right: hook names like hook_form_alter, concepts like entityQuery, Paragraph, BigPipe, S3FS, modules like simplenews, metatag, block_class, the usual stuff. Without that list, you end up correcting half your technical transcriptions by hand. With it, it starts to feel like having a transcriber who actually understands your jargon.

Who this is for (and who it isn't)

If you work a lot with AI assistants, sending them long tasks several times a day, dictating instead of typing changes how you work. And, as I said earlier, it's not just about speed: it lets you be much more thorough about the context you pass. You go from rationing every interaction to just letting it out. You drop the mental friction of "let me write this properly" and you just talk to it like you'd talk to a colleague. If on top of that your job is programming, having Custom Words tuned up means the output is almost always usable straight away.

Who it isn't for: anyone with a machine that's barely scraping by, especially without a GPU. The smaller models work, but you lose the sense of immediacy and that's where the use case falls apart. If you have to wait 30 seconds every time you release the shortcut, you'll be back on the keyboard within a week. You also won't get much out of it if you only dictate occasionally. If you only use it twice a day, setting up the whole flow takes more effort than it's worth.

And a year from now, who knows

I don't know how long Handy will hold up as a daily driver. In this field, the thing that solves your life today might be irrelevant in six months because something better has shown up, or because the operating system ships it built-in, or because local models take another leap forward. What I do know is that my old script, the one with the APIs and the home server, isn't getting touched again. That chapter's closed.

If you've been thinking about this for a while and, like me, you've already tried a couple of solutions that didn't quite click, give Handy half an hour of setup and see how it fits into your flow. The surprise, in my case, has been a pretty positive one.

Need a Drupal Expert?

Senior Drupal developer, freelance, specialized in what's hardest: migrations, multilingual sites, SaaS platforms and Stripe integration. I leverage AI to cut delivery times and costs, with expert review on every line of code.

No agency, no middlemen. Direct contact with the one who does the work.