My OpenAI Whisper Voice Typing Setup For Faster, More Convenient Text Entry


As an early student of Mavis Beacon's excellent typing courses, I've long been accustomed to thinking that as a relatively fast 100 WPM typist, touch typing is the most efficient way of getting text into a computer.
But even for those with slightly to significantly faster than average typing speeds, voice dictation remains a far faster way of achieving text input.
With the average person speaking at around 120-150 WPM, depending on your typing speed, voice typing can be a significantly quicker way of capturing text. If you're a “hunt and peck” typist, then voice typing would be massively quicker. If you spend most of your day typing, these small differences aren't a question of a few words per minute. They can be transformative impacts on your productivity.
But for all their shining potential, my first dalliances with using voice typing, however, weren't all that encouraging.
Even speaking slowly and clearly with a professional studio microphone, tools that you would expect to be highly reliable in this age, are sometimes surprisingly disappointing (ahem, Google Chrome voice typing). But the speech-to-text market isn't static and recent years have brought transformative change.
Although it may seem like a distant connection, the same architecture that underpins large language models, the transformer architecture, has also enabled a new crop of speech-to-text tools that are far more accurate than those of yesteryear. OpenAI, Deepgram, Speechmatics and others are some of the companies that have led the charge.
A few selected hardware editions and one of these tools can give you everything you need to ditch your conventional keyboard for good
The Case For Using Whisper By API
Enter the world of speech-to-text dictation software, however, and you'll quickly see many advocating deploying these models locally.
To paraphrase Shakespeare only a little, the big question when using Whisper is ”to self-host or not to self-host?” At the time of writing, Whisper has the honor of being the last model which OpenAI open-sourced.
For those looking to experiment with how voice control can disrupt conventional computer control mechanisms, this is something of a dream come true. Thus a new crop of dictation software, some of it relatively early stage, has emerged as the first exponents of this new technology.
But if you can self-host or deploy these models for free, then why pay to use them via APIs? You might be wondering.
Locally deployed speech-to-text models attract two main types of user, at least as far as I can tell. The first is the privacy conscious - a staple archetype in the world of self-hosting. These users like the idea of their voice data never leaving their local machine. Secondly - also emblematic of the open source community, but by no means representative of the whole - are those those looking to avoid costs associated with using Whisper by API.
So, let me present the counter-argument: why I think that even when deploying Whisper locally is technically feasible, it probably makes more sense to use it through commercial APIs.
Firstly, speech-to-text APIs, in the broad scheme of things, simply aren't that expensive. It's easy to rack up costs of a few dollars per day through using this technology intensively. But judged against the yardstick of how much time can be spent using technologies like these, it's relatively small change. Furthermore, and thankfully, OpenAI isn't the only game in town. If OpenAI's pricing is too expensive for you, there are cheaper speech-to-text APIs. All the major cloud platforms, Google Cloud Platform, Azure and AWS, offer speech-to-text consoles that provide competitive pricing for direct API use.
Secondly, in order to perform as effectively as possible, like other AI tools, these models demand well-optimized and generously-provisioned hardware. While it's theoretically possible to deploy your own self-hosted Whisper model on a cheap VPS, it's also highly likely that the machine simply won't have the RAM and GPU to deliver optimal performance. Deploying sufficiently provisioned hardware in the cloud would likely cost significantly more than API usage. Quantized models of the Whisper models do exist (such that they can even be run on Android devices). But these often rely on using less powerful variants of Whisper to achieve acceptable levels of accuracy. No such compromises are necessary when using the API, which by default uses Whisper's most performant variant.
So, speech-to-text is really just a particular case of cloud computing in general. There is a trade-off between convenience and capability.
My Whisper Speech To Text “Stack”
As a very long-term Linux user now using OpenSUSE, I'm accustomed to my operating system making it more challenging to find out-of-the-box clients for things that I need. Speech Notes is an interesting app which runs Whisper locally. But it requires capturing the transcribed text into a notepad that you would then have to copy into other websites.
My speech-to-text workflow, however, entails using it wherever I can enter text, mostly in a browser. The most popular speech-to-text Chrome extension is Voice In, but I found that its performance was underwhelming, even in audio environments that were fairly close to ideal. This is a downstream effect of the subpar accuracy provided by Google's default STT engine. Voice-In is a nice and well-developed tool, but it's only as good as the speech recognition technology powering it.
Fortunately, my search hit a decisive end when, with a little bit of prompting on ChatGPT, I discovered WhisperAI, one of the first Chrome extensions for speech typing that specifically sends text through Whisper.
It requires a monthly paid plan to use, and in the future higher tiers may become available.
But for now, I've saved so much time using voice typing that I think it's been an extremely worthwhile investment.
My only complaint is that it's a Chrome extension and not a desktop tool, although if you're using Windows that should change soon as a desktop client is in the works.
Hardware: USB Macro Keys, Foot Pedals, And Getting Them To Work On Linux
After a couple of weeks of full-time use, I decided that voice typing was the way forward and began thinking widely about how I could use it for every aspect of computer use. As voice becomes increasingly embedded in AI tools, voice platforms are becoming increasingly ambitious in the type of functionality their platforms enable. So rather than thinking of voice typing as the be-all and end-all, increasingly it's just a part of the emerging voice stack.
But having access to efficient voice typing alone unleashes many benefits.
Prompting large language models with extensive contextual information becomes vastly more efficient and easy when you can dictate those prompts through speech-to-text.
Although speech-to-text is integrated into ChatGPT itself, this isn't the case for some models and LLM frontends. So having your own speech-to-text setup in the browser means that you have total versatility in where the tool can be used.
Although you can get started using voice typing with your laptop's built-in microphone and a keyboard shortcut, it's a pity not to do such powerful technology justice in the form of buying a few specific pieces of hardware.
The classic weird peripheral popular in transcription and dictation workflows is the humble but highly capable USB foot pedal.
For those unfamiliar with what is admittedly one of the more obscure pieces of desktop hardware, these are simply control pedals that are attached by USB. Commonly they have three pedals and are used by transcriptionists for playing back through recordings and starting and stopping a dictation flow. One button serves as a reverse key, the other as a fast forward and the middle one is commonly mapped onto a start stop button.
I wasn't sure that I'd get the hang of using these however (or get them to work on Linux), so to keep my initial spend light, I picked up a cheap generic USB HID foot pedal from AliExpress.
The good thing about generic USB HID devices however is that they're fairly easy to modify on Linux by using HWDB entries in order to configure them with custom key remappings. This provides quite low level functionality, essentially taking the default key entry delivered by the peripheral and mapping it on to something more convenient.
Of course, if you're able to get the manufacturer drivers working by a more conventional installation method, that is the preferable way to go.
In order to avoid interfering with shortcuts configured into things like web browsers, I decided to use F13 as my custom shortcut key and remapped both my foot pedal and macro key onto that shortcut. F13 to F24 are sort of the forgotten extra function keys on keyboards and mapping on to one of them is generally quite a straightforward process.
On Linux, You Can Use Evtest & HWDB To Create Custom Interception Mappings
If you're not a Linux user, you'll hopefully be able to get yours running easier or use Windows software for making this modification. But on Linux, the process is somewhat straightforward once you get the hang of it.
Firstly run: sudo evtest
and then Identify the input event that you want to listen to. You might find that your USB HID device actually provides a few events and have to try them one by one to identify the one that is mapped button presses or clicks
Once you find the right event, you then need to find the MSC scan value for the key press and then create a database entry to map that. Although I couldn't get it working on my distro, Input Remapper is a GUI that provides another way to do this
These files are given a numeric value reflecting their priority:
The event I used for my foot pedal looks like this:
Creating your own push-to-talk button using cheap USB macro keys
But if, like me, you've watched too many war movies with dramatically depicted command centres, you might decide that a push-to-talk peripheral is essential for your dictation workflow.
As a marketplace, AliExpress has its pros and its cons, but if you're into weird and obscure USB computer peripherals then you're in luck you’re in for a pleasant surprise.
There is a surprisingly diverse and wide ranging selection of shortcut keys, MIDI controllers and just about every other unusual input peripheral out there.
And in a strange inversion of the usual manner of things, the cheapest hardware which simply presents itself to operating systems as generic USB HID devices is often actually the easiest to customize
Thus, although my foot pedal promised only vaguely the ability to work with Linux, thanks to a few AI prompts, I was able to create the necessary customization.
While my USB foot pedal was fun to start out with, I picked up a couple of additional macro keys to try find the one that I would really enjoy actually using. So far my favorite is this simple single button programmable key which - though priced at a mere $6 - is an absolute delight to use.
It fits extremely snugly in your hand and if you're using a dictation tool with built-in PTT support (like Whisper AI) you can keep the button held down, much as if you were using a walkie-talkie.
Alternatively, if you'd like to take out some of your energy on your USB peripherals, you can give it a whack to start and stop dictation.
Mapping Your Peripheral To An Unused Shortcut Key
One pro tip if I can offer it: If you're setting up one or more custom peripherals for dictation workflows, it's a good idea to use a key that is minimally likely to interfere with existing shortcuts on applications you commonly use.
One approach is to use obscure key combinations, but this comes at the price of convenience because reaching for Ctrl + Alt + Ctrl Break just isn't practical if you're doing this a thousand times a day.
My preferred method is to use one of the obscure function keys (F13 to F24) which are recognized out of the box in most common keyboard configurations but which are rarely accommodated for in hardware buttons on standard keyboards. As these function keys are rarely assigned in application shortcuts, it's usually safe to assign one of these to a dictation button with no impact on other systems.
The final part of the approach was assigning F13 to the shortcuts in my Chrome whisper extension:
The result is that I now have a handheld shortcut button that I can click to start and stop dictation.
The Headset
While buying obscure macros, it's also important not to forget about the basics!
Even with the superior performance achieved by powerful automatic speech recognition tools like OpenAI Whisper, it's still important to record audio in a quiet environment and ideally with a microphone that's as close to your mouth as possible.
If you want a boring tech rabbit hole to dive down, look into the somewhat sorry state of input codec support in Bluetooth device in the year 2025. If you don't, then although it's admittedly less convenient, you can bypass Bluetooth codec issues by simply buying a wired USB headset.
Jabra and Poly are among the leaders in the field. And if you're on a slightly tighter budget, Yealink make headsets that are well regarded and a bit cheaper.
I personally picked up a BH-70 for exactly this purpose and I've been very happy with the device.
What Can You “Voice Type”?
You can use voice typing for just about anything that you would type on a keyboard.
I use mine all the time for capturing prompts for large language models, for dictating emails, writing documents. For whatever reason, I find voice typing particularly effective in capturing meeting minutes while they're fresh in the mind.
Voice typing and large language models are, in my opinion, like a match made in heaven. You can capture detailed voice memos, then use a large language model to organize them more coherently, and finally use a second model to, for example, create to-do lists out of what you captured. But being able to efficiently capture your train of thought through voice typing greatly improves the workflow.
But even for more direct text entry, the benefits of voice typing can be surprisingly transformative. Whisper, for example, even comes with automatic punctuation support, so it will automatically infer the punctuation missing from the text that you dictate.
The one limitation that takes a bit of getting used to is that it will sometimes capture incorrect words that nevertheless pass a spellcheck.
So there's something of a learning curve involved in working with speech capture to make sure that it doesn't introduce embarrassing or awkward typos. Likewise, learning how to dictate effectively isn't an overnight process and takes a bit of trial and error. But so does learning to type, and I would argue that there's far more to be gained by becoming an effective voice typist.
This hardware software combination has been transformative. I highly recommend it and all the tools mentioned above.
Subscribe to my newsletter
Read articles from Daniel Rosehill directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Daniel Rosehill
Daniel Rosehill
I believe that open source is a way of life. If I figure something out, I try to pass on what I know, even if it's the tiniest unit of contribution to the vast sum of human knowledge. And .. that's what led me to set up this page!