Text-to-Speech (TTS) and Speech-to-Text (STT) are complementary technologies that enable voice-based human-computer interaction. TTS converts written text into synthetic speech, while STT (automatic speech recognition) transcribes spoken language into text. The pursuit of these capabilities spans many decades and disciplines, from early mechanical speaking machines in the 18th century to the sophisticated AI-driven systems of today. This report follows the major developments in both TTS and STT in roughly chronological order, highlighting key technical methods (rule-based synthesis, formant and concatenative techniques, hidden Markov models, neural networks, vocoders, end-to-end and transformer-based architectures) and explaining why each paradigm shift occurred. We also note influential contributors and organizations (e.g. Bell Labs, IBM, academia, and modern AI labs like Google DeepMind and OpenAI) that have pioneered breakthroughs along the way. The history of speech technology reflects a steady move towards data-driven and learning-based approaches, driven by the desire for more natural and accurate voice interfaces.

Early Developments (18th Century — 1950s)

Efforts to artificially produce or recognize speech date back centuries. As early as the 18th century, inventors built mechanical devices to mimic the human vocal tract. Notably, in 1779 Christian Gottlieb Kratzenstein created functional models of human throat anatomy that could produce sustained vowel sounds. Building on such ideas, Wolfgang von Kempelen (yes, the one who built The Turk) demonstrated in 1791 an “acoustic-mechanical speech machine”: a bellows-operated apparatus with modelled tongue and lips capable of producing consonant-vowel combinations. These early contraptions, along with others like Joseph Faber’s 1846 Euphonia, showed it was possible to synthesize crude speech sounds by mechanical means. They were precursors to modern TTS in concept, though true electronic speech synthesis had to await the 20th century.

By the 1930s, with the advent of electronic signal processing, speech research gained a new foundation. Bell Labs (now Nokia Bell Labs) developed the vocoder (voice coder) in the mid-1930s, which analyzed speech into fundamental frequency tones and resonances for transmission. Bell Labs engineer Homer Dudley then created the Voder, a keyboard-operated speech synthesizer showcased at the 1939 World’s Fair, which could generate recognizable speech sounds from electrical signals. Around the same time, early attempts at speech recognition also emerged. In 1952, researchers at Bell Labs built a system nicknamed Audrey that could recognize spoken digits (0–9) from a single speaker by analyzing the spectral formant frequencies of each utterance. This was the first effective STT device, albeit extremely limited in vocabulary and sensitive to the speaker. In 1962, IBM demonstrated its own early recognizer, the IBM Shoebox, which could understand 16 spoken English words (notably digits and commands) at the World’s Fair.

Another foundational concept was the source–filter model of speech production, articulated by Gunnar Fant in 1960. This theory separated speech into a sound source (vocal cord vibration or noise) and a filter (the shaping by the vocal tract), and it deeply influenced speech synthesis design. Throughout the 1950s and 60s, academic labs pushed TTS forward: for example, Noriko Umeda and colleagues in Japan developed one of the first general English text-to-speech software systems by 1968. At Bell Labs in 1961, John Larry Kelly Jr. used an IBM 704 mainframe to synthesize the song “Daisy Bell”…

https://youtu.be/yIwhx3NQSLg

… a landmark in computer speech that so impressed author Arthur C. Clarke that he worked it into his novel 2001: A Space Odyssey (the HAL 9000 computer’s famous singing scene).

On the whole, by the late 1960s, electronic speech synthesis could produce intelligible (if robotic-sounding) speech from text, and speech recognition devices could handle very small vocabularies under constrained conditions. These accomplishments set the stage for more systematic, theory-driven progress in subsequent decades.

Rule-Based Synthesis and Template-Based Recognition (1960s-1970s)

During the 1960s and 1970s, research in TTS and STT followed largely separate tracks but shared common limitations in computing power and linguistic knowledge. Text-to-Speech systems of this era were predominantly rule-based. A prominent approach was formant synthesis, grounded in Fant’s source-filter theory. Engineers manually designed rules to control a formant synthesizer (effectively a highly simplified vocal tract model) by specifying formant frequencies, amplitudes, and noise bursts to simulate phones (speech sounds). The formant synthesizer generates speech by creating a periodic waveform for voiced sounds and filtered noise for unvoiced sounds, with adjustable parameters corresponding to vowel and consonant resonances. Famous formant-based TTS systems (e.g. the MITalk system and later the DECtalk synthesizer by Digital Equipment Corp.) demonstrated that rule-based control could yield highly intelligible speech, though often with a monotonic or robotic quality. The upside was that these systems required minimal memory and could run in real time even on early computers, since they did not rely on large prerecorded datasets. For example, DECtalk was a portable device using Dennis Klatt’s formant synthesis techniques to produce speech; it became well known as the voice of Stephen Hawking’s computer interface.

Researchers also explored articulatory synthesis (directly modelling the physics of the vocal tract) but found it computationally intractable and not yet able to produce natural-sounding fluent speech (early articulatory models could generate recognizable sustained vowels, but most consonants and transitions sounded unnatural). Thus, formant synthesizers with carefully tuned rules remained the workhorse for TTS through the 1970s, achieving understandable output albeit with limited naturalness.

In speech recognition during the same period, systems were initially dominated by template-based and rule-basedmethods. Early STT often required users to speak isolated words with pauses, so that each word could be matched against a stored template. One common technique was Dynamic Time Warping (DTW), which was invented by Soviet researchers in the late 1960s to align and compare speech patterns regardless of speed differences. A speech recognizer could record a prototype template (e.g. a recording of a word) and then use DTW to find the best match between an unknown input utterance and the stored templates. These methods worked for small vocabularies (dozens of words) but became exponentially harder as vocabulary grew, since every new word required a template. Moreover, early systems relied on hand-crafted acoustic-phonetic rules or grammars. For instance, Raj Reddy at Stanford (CMU) built some of the first continuous speech systems in the late 1960s, enabling voice control of a chess game, but even his system had to constrain the problem (users spoke commands in a limited domain).

Notably, in 1969 Bell Labs executive John Pierce wrote an influential letter casting doubt on speech recognition, temporarily cutting off funding for the area. This skepticism stemmed from the immense difficulty of the task using the pattern-matching and knowledge-based approaches then available. It wasn’t until the early 1970s, when the U.S. Defense Advanced Research Projects Agency (DARPA) launched its Speech Understanding Research program, that STT research regained momentum. DARPA’s five-year project (1971–1976) invested in several teams (Carnegie Mellon, IBM, Stanford Research Institute, etc.) to push vocabulary sizes to 1,000 words and beyond. The outcome included CMU’s Harpy system(1976) which could handle a 1000-word vocabulary using finite-state grammars and heuristic search, setting a new benchmark. Still, through the 1970s, most recognition systems were speaker-dependent (trained on a specific speaker’s voice) and struggled with natural, continuous speech. Progress was steady but limited by the need for expert-defined templates/rules and by computational costs.

https://youtu.be/32KKg3aP3Vw

A parallel development relevant to both TTS and STT was the advancement of speech coding and vocoders. In 1966, Fumitada Itakura and Shuzo Saito in Japan pioneered Linear Predictive Coding (LPC) as a method to compactly represent speech signals. LPC models the speech waveform through a linear predictive filter (approximating the vocal tract) and a simple excitation signal, essentially a computerized descendant of the vocoder concept. Throughout the 1970s, Bell Labs researchers (notably Bishnu Atal and Manfred Schroeder) refined LPC for efficient encoding of speech. LPC became extremely important in low-bitrate speech transmission and also found its way into speech synthesis: for example, the first widely available electronic toy that “spoke,” the Texas Instruments Speak & Spell (1978), used an LPC synthesizer chip to generate speech from text programmed into the toy. This showed that even consumer devices could perform rudimentary TTS with limited vocabulary using a vocoder-like approach.

Overall, by 1980 we had a mix of techniques: rule-based formant synthesizers and LPC vocoders for TTS, and template/DTW or simple probabilistic models for STT. The stage was set for more data-driven statistical methods that would emerge in the 1980s as both computing hardware and theoretical tools improved.

Statistical Models and Data-Driven Approaches (1980s-1990s)

The 1980s marked a major transition in speech technology, particularly in speech recognition, with the adoption of statistical modelling. Researchers moved from manually designed rules to probabilistic models trained on data. Central to this transition was the introduction of Hidden Markov Models (HMMs) for acoustic modelling. In the late 1970s, James Baker and Janet Baker pioneered the use of HMMs, probabilistic state machines capable of learning sound sequences and their variability through example data. HMMs unified acoustic patterns, pronunciation variations, and basic grammar into one mathematical framework. Rather than explicitly programming pronunciation rules or storing templates, researchers now estimated sound sequence probabilities from real speech data. By the mid-1980s, HMM-based recognizers had decisively overtaken older methods such as dynamic time warping and template matching.

IBM, under Frederick Jelinek’s leadership, significantly advanced statistical speech recognition. Jelinek famously remarked, “Every time a linguist leaves the group, the recognition rate goes up,” underscoring his team’s focus on data-driven methods over hand-crafted linguistic rules. By 1985, IBM developed Tangora, capable of recognizing a vocabulary of 20,000 words using HMM acoustics and statistical n-gram language models. This marked a dramatic increase from the 1,000-word systems of the previous decade. Dragon Systems, founded by the Bakers in 1982, also produced competitive HMM-based speech recognizers for personal computers. By the late 1980s, most research groups adopted HMM/Gaussian-mixture models, steadily improving accuracy.

A major milestone occurred in 1992 with CMU’s development of Sphinx-II, the first system capable of speaker-independent, large-vocabulary continuous speech recognition. Developed by Xuedong Huang under Raj Reddy, Sphinx-II could transcribe fluent, uninterrupted speech from any speaker, handling thousands of words. Huang subsequently joined Microsoft in 1993, marking the entry of software giants into speech recognition technology. (Huang later served as CTO for Azure AI and, subsequently, Zoom.)

Parallel advancements occurred in text-to-speech (TTS) synthesis, though at a slower pace initially. Bell Labs’ 1980s TTS systems combined extensive natural language processing with rule-based synthesis. Concurrently, Bell Labs researcher Lawrence Rabiner developed influential digital signal processing algorithms used in speech analysis and synthesis, significantly impacting technologies like Dragon Dictate and AT&T’s voice recognition systems.

A significant leap came in the mid-1990s with unit selection synthesis. Researchers at ATR in Japan demonstrated the effectiveness of large, carefully annotated speech databases. Unlike earlier diphone methods, unit selection dynamically selected optimal speech segments from extensive databases, matching context, intonation, and rhythm, achieving highly natural-sounding results. By the late 1990s, unit selection systems by companies like AT&T and Microsoft, branded as “natural voices,” approached human-like quality, particularly for limited-domain applications such as weather forecasts or announcements.

In parallel, statistical methods emerged in TTS with HMM-based or statistical parametric speech synthesis. Developed by researchers including Tokuda, Zen, and Black, this approach used trained statistical models to generate speech acoustics. Early HMM-based synthesis, though robotic, offered advantages in memory usage and voice modulation flexibility compared to unit selection. These systems found use in resource-constrained mobile and embedded applications, setting the stage for later neural network-driven innovations.

STATISTICAL PARAMETRIC SPEECH SYNTHESIS Alan W Black, Heiga Zen, Keiichi Tokuda

By the late 1990s, both STT and TTS had fully embraced data-driven approaches: speech recognition was dominated by HMMs combined with n-gram language models, and TTS increasingly relied on large, annotated speech datasets. Improvements in both fields were propelled by growing computational power, larger speech corpora, and advances in statistical signal processing. During this period, we also saw the speech technology industry consolidate. Notably, Lernout & Hauspie (L&H), a Belgian company, acquired competitors like Kurzweil and Dragon Systems to dominate the market until an accounting scandal in 2001. L&H’s assets eventually became part of Nuance Communications, a leading provider in the 2000s and 2010s, powering systems like Apple’s Siri.

The Deep Learning Revolution (2010s)

Entering the 21st century, speech technology benefited from steady increases in computing power and the advent of big data. Still, by the early 2010s there was a sense that the traditional architectures had plateaued in performance; for instance, even the best speaker-independent HMM-based speech recognizers were still making significant errors, and unit-selection TTS, while natural for neutral read-out speech, was inflexible. The deep learning revolution of the 2010s dramatically shifted both STT and TTS by leveraging large neural network models trained on massive datasets, enabled by modern GPUs.

In speech recognition, a pivotal breakthrough came around 2010–2012. Although neural networks had been experimented with in speech as far back as the 1980s (e.g. simple multi-layer perceptrons for phoneme classification), they hadn’t surpassed the carefully optimized HMM+Gaussian systems. This changed when researchers including Geoffrey Hintonand colleagues at University of Toronto, in collaboration with Microsoft Research (Deng, Dahl, Yu, etc.), applied deep neural networks (DNNs) to acoustic modelling and achieved a dramatic drop in error rates. In a 2012 paper they showed that replacing the GMM in a speech recognizer with a deep feed-forward neural net (trained on lots of speech data with the then-new ReLU activation and better weight initialization) reduced word error rates by 30% relative — a seismic improvement in a field where 1–2% gains were noteworthy. This prompted a Microsoft executive to call it “the most dramatic change in accuracy since 1979,” underscoring that an entire generation of incremental HMM improvements had been eclipsed by deep learning virtually overnight. Soon, IBM and Google too had adopted DNNs in their production STT systems, leading to significantly more reliable voice input for applications like smartphone voice search and dictation.

Deep neural networks were first used to improve the acoustic model (the part that maps audio features to phonetic probabilities), but researchers quickly began to re-imagine the entire speech recognition pipeline with end-to-end deep learning. Around 2014, Baidu introduced DeepSpeech, a system that leveraged an end-to-end approach: it used a recurrent neural network (RNN) trained with connectionist temporal classification (CTC) to map input audio spectrograms directly to text, without an explicit phoneme dictionary or grammar model. (Mozilla built an open-source DeepSpeech engine in 2017 in that you can find here.)

What is DeepSpeech? Stephen M. Walker II, Co-Founder / CEO of Klu Inc.

This was inspired by earlier academic work from Alex Graves and others on sequence learning. At the same time, at Google, researchers built Listen, Attend and Spell, sequence-to-sequence models with attention for STT, treating speech recognition like a translation task from audio to text. These end-to-end models initially performed on par with, and later surpassed, the traditional HMM systems by mid-decade. A notable example is Google’s deployment of RNN-based models for voice search: by 2015, an LSTM-based acoustic model trained with CTC yielded a 49% relative error reduction in Google’s English speech recognition compared to their previous model. These advances were fuelled by massive datasets (for example, the collection of anonymized voice search queries, or the publicly released LibriSpeech corpus with 1000 hours of transcribed audiobooks) and by improved algorithms for training very deep networks. It also helped that around this time GPUs and distributed computing made it feasible to train networks on tens of thousands of hours of audio. In short, the 2010s saw STT transition to deep neural network dominance — first as hybrid systems (DNN/HMM combinations) and then as purely neural end-to-end systems. By the late 2010s, error rates on benchmarks like Switchboard telephone speech fell to near-human levels (around 5% word error rate), something unimaginable a decade prior. This enabled the explosion of voice assistants (Apple’s Siri, Amazon’s Alexa, Google Assistant, etc.), all of which rely on deep learning-based speech recognition under the hood.

Text-to-Speech followed a similar transformation in the 2010s, albeit a few years behind ASR. The high naturalness of unit selection synthesis was hard to beat, but it lacked flexibility and required laborious data preparation. Deep learning opened new possibilities to make TTS both more flexible and eventually more natural. A watershed development was WaveNet, introduced by Google DeepMind in 2016. WaveNet is a deep generative model that produces speech waveforms directly, one audio sample at a time, using a convolutional neural network trained on raw audio. This was a radical departure from previous vocoders or concatenation methods as WaveNet essentially learned to mimic a human voice by learning the distribution of sound wave patterns. In tests, WaveNet synthesized voices that were significantly more natural in listening tests than the best unit-selection systems used by Google at that time. For example, in English and Mandarin, the WaveNet-generated speech achieved mean opinion scores closer to human recordings than the earlier parametric and concatenative systems. This proved that neural networks could generate highly realistic speech, capturing subtleties like breathing and mouth sounds which previous methods missed. WaveNet-style models (called neural vocoders) were soon adopted as the new gold-standard method to generate high-quality audio from intermediate representations (e.g., from a spectrogram or from linguistic features). The main drawback of WaveNet was efficiency (generating audio sample-by-sample is slow) so researchers addressed this with optimized architectures and distillation (leading to faster variants like Parallel WaveNet, WaveGlow, and others by 2018).

Researchers also attacked the TTS problem from the text side using deep learning. Google presented Tacotron, a fully end-to-end neural TTS model. Tacotron uses an encoder-decoder recurrent network with attention (very similar to sequence-to-sequence models in machine translation) to convert character input sequences directly into a spectrogram representation of speech, essentially learning the entire mapping from text to sound in one model. This spectrogram can then be converted to waveform using a vocoder (initially Griffin-Lim algorithm, later a neural vocoder like WaveNet or its descendants for higher quality). Tacotron demonstrated that a single neural network could learn pronunciation, emphasis, and some aspects of intonation directly from example <text, audio> pairs.

A subsequent improved version, Tacotron 2 (2018), combined the Tacotron spectrogram prediction network with a WaveNet vocoder, yielding naturalness comparable to human speech in some evaluations. (NVIDIA have created an open-sorce PyTorch implementation you can find here.)

TEXT-TO-SPEECH WITH TACOTRON2, Yao-Yuan Yang, Moto Hira

Around the same time, other companies and labs introduced similar architectures (e.g. Deep Voice by Baidu in 2017, which was a multi-component deep learning TTS pipeline, and Transformer TTS models slightly later). The success of these systems was due to their ability to learn features like pronunciation and prosody from data rather than relying on manual rules. It also became easier to train new voices: one simply needed recordings of a target voice and the neural model could be trained (or fine-tuned) to produce that voice, without manually redesigning a speech unit database. By the late 2010s, it became routine for state-of-the-art TTS systems to use a neural network for the main synthesis (whether end-to-end or as components for prosody and vocoder), thereby surpassing the old unit-selection methods in generality. Neural TTS can generate expressive speech (by training on expressive datasets or using fine-grained controls) and even do things like voice cloning, mimicking a person’s voice with limited samples, which were extremely challenging before. The tradeoff initially was that training these models requires a lot of data and compute, and they were harder to deploy on-device due to heavy models, but those issues have gradually been mitigated with model compression techniques and more efficient architectures.

In summary, the 2010s revolutionized both STT and TTS through deep learning. The transition happened because earlier technologies were hitting accuracy or quality ceilings: HMM-based STT couldn’t easily push past certain error rates, and unit-selection TTS couldn’t generate anything beyond its recorded domain or style. Neural networks offered a way to learn the complexities of speech directly from examples, without as many simplifying assumptions. This allowed modelling subtleties of pronunciation, accents, intonation, and voice qualities that previously had to be ignored or averaged out. The shift was also enabled by big data (e.g., thousands of hours of speech for training) and powerful GPUs. Without these, the complex neural models couldn’t have been trained. By 2019, deep learning had fully permeated commercial speech tech: Google’s voice input and text-to-speech services, Amazon Alexa’s speech engine, Microsoft’s Azure cognitive services (now Azure AI services), and many others were all driven by neural models (often based on research papers from just a few years prior).

Transformers and Self-Supervised Models (late 2010s–2020s)

As deep learning matured, the next wave of innovation integrated transformer architectures and self-supervised learning(SSL), further propelling speech technology to new heights. The transformer model (Vaswani et al. 2017 in the context of NLP) relies on self-attention mechanisms instead of recurrence, and it has proven exceptionally powerful for sequential data. Speech researchers soon began adapting transformers for both STT and TTS tasks, especially as datasets continued to grow.

In speech recognition, transformers enabled larger and more accurate models, but they also came alongside a paradigm shift: self-supervised pre-training. One of the most influential developments was wav2vec by AI at Meta. The initial wav2vec (Schneider et al. 2019) and its successor wav2vec 2.0 (Baevski et al. 2020) introduced a method to pre-train a deep model on unlabeled audio by having it learn to predict parts of the audio from other parts (a sort of audio analogy to language model pre-training). Wav2vec 2.0 in particular used a transformer-based architecture and showed that the representations learned from tens of thousands of hours of raw audio can be fine-tuned to achieve state-of-the-art speech recognition with much less labeled data than traditionally required. In fact, wav2vec 2.0 achieved top performance on benchmarks like Librispeech, even outperforming carefully engineered hybrid systems, and it sparked a wave of research into leveraging unlabeled data for speech. The significance is that now data collection bottlenecks for STT were alleviated: one can use vast amounts of public audio (like podcasts or YouTube videos) without transcriptions to pre-train, and only a smaller supervised set to teach the model to map to text. This approach has become standard in ASR by the early 2020s.

Another transformative system is OpenAI’s Whisper (see on GitHub here) released in 2022. Whisper is a large-scale encoder-decoder transformer model trained in a supervised but weakly supervised fashion on 680,000 hours of multilingual audio-data pairs collected from the web. Notably, much of the training data is “noisy” or imperfect (hence weak supervision), but the sheer scale allowed Whisper to generalize extremely well. When OpenAI open-sourced Whisper, it was found to be remarkably robust to accents, background noise, and even able to handle around 100 languages for transcription and translation. According to OpenAI, Whisper approaches human-level reliability on English tasks and has excellent robustness to varying input conditions. Technically, Whisper uses a transformer encoder to process audio features and a transformer decoder to output text, jointly modelling speech recognition and translation tasks. Its release meant that anyone could use a powerful pre-trained model (on GitHub or via APIs) to achieve high-quality STT without training a model from scratch. This reflects a general trend in AI development: large pre-trained models becoming the foundation for many tasks. In STT, we see that by the mid-2020s, the best systems are those that combine advanced architectures (like transformers or conformers) with huge-scale training on diverse data. The error rates continue to drop, and capabilities increase (e.g. automatic punctuation, speaker identification, multilingual code-switching) as models like Whisper set new standards.

For text-to-speech, transformers have also played a role and modern TTS has continued to advance in naturalness and adaptability. After Tacotron and WaveNet, researchers looked to improve efficiency and speed. Models like FastSpeech(2019) from Microsoft used transformer networks to achieve parallel generation of speech (avoiding the autoregressive bottleneck of Tacotron) while maintaining quality, by predicting duration for each phoneme and then generating all frames in parallel. Another line of work integrated generative models like GANs or normalizing flows: e.g., Glow-TTS and NVIDIA’s Flowtron (see here), which use flow-based models for generating speech features, and HiFi-GAN (2020), a GAN-based vocoder achieving very high-fidelity speech. Many of these models use transformer components or attention mechanisms under the hood for modeling long text sequences or aligning text and speech. The result is that by the early 2020s, TTS systems can synthesize long-form speech with proper intonation and even emotion, often in real time or faster.

An important contemporary development is the rise of commercial AI voice platforms that leverage these research advances. For instance, ElevenLabs (founded 2022) is known for its highly natural and expressive speech synthesis service. ElevenLabs uses AI models (not fully disclosed, but likely large transformer-based or similar architectures) that can capture vocal emotion and intonation based on context, and even perform voice cloning from a short sample. The system analyzes the input text for emotional cues and adjusts the speech prosody (pitch, pacing, emphasis) to sound more human-like, rather than reading in a flat tone. It supports multiple languages and long-form speech generation, showing how far TTS has come in terms of flexibility — one can generate an audiobook narration, complete with expressive delivery, entirely by AI voice.

https://youtu.be/hX1Oa8Mqfvg

Other notable modern TTS systems include Google’s latest WaveNet-based Cloud TTS (which by 2020 offered dozens of voices built on WaveNet and Tacotron 2 technology), Amazon Polly, Microsoft’s AI Speech, and research projects like VITS (2021, an end-to-end GAN TTS that combines vocoder and mel-prediction models).

It’s also worth noting the convergence of TTS and STT in some respects: technologies like speech-to-speech _translation_combine both by listening to speech in one language (STT), then generating speech in another language (TTS). The progress in each component (ASR and speech synthesis) makes such applications feasible. We also see shared techniques (for example, the same transformer that can model text sequences in ASR might be used to model spectrogram sequences in TTS). Furthermore, the concept of a vocoder has evolved: historically a vocoder was a hand-designed method to encode and synthesize speech (like LPC or channel vocoders), but today WaveNet and its descendants serve as “neural vocoders” for both TTS and low-bitrate coding. Even STT models sometimes use ideas from vocoders (e.g., learning internal representations of phonetic content akin to a coding of the speech).

By 2025, we’re seeing automatic transcription with near-human accuracy used everywhere. From voicemail and video captioning to virtual assistants. Synthetic voices are used in gaming, film, education, and accessibility with quality such that end users sometimes cannot distinguish AI speech from real. Yet, each evolutionary step did not make the previous one entirely obsolete. For example, HMM-based methods and unit selection are still taught and occasionally used for specialized cases, but they have been largely supplanted in practice by deep learning approaches that simply achieve better results. The transitions occurred because each new paradigm addressed the shortcomings of the previous. Rule-based systems couldn’t handle variability, statistical HMMs could learn variability but still had limiting assumptions and needed lots of fine-tuning, and neural networks could further learn abstract representations given enough data, overcoming many of the earlier limitations in naturalness and robustness.

From mechanical speaking heads and rudimentary “digit recognizers” to AI models that can mimic human speech or transcribe conversations in real time, the evolution of TTS and STT technologies has been driven by the pursuit of natural, accurate communication between humans and machines. Historically, progress came in waves: an early wave of rule-based designs that established the feasibility of speech synthesis and recognition, a second wave of statistical models (HMMs and related methods) that brought these technologies from toy demos to practical applications, and the latest wave of deep learning and end-to-end architectures that has achieved truly human-like performance in many scenarios. Transitions between paradigms were motivated by the need for improvements in quality and capability: for instance, unit selection synthesis emerged once larger memory made it possible to drastically improve TTS naturalness, and neural end-to-end models arose when accumulated data and computation enabled learning speech patterns directly, thereby outperforming carefully layered statistical systems.

Throughout this journey, key organizations and people played pivotal roles. Bell Labs stands out in the early and mid 20th century, from Homer Dudley’s Voder and vocoder work to the first OCR-based recognizers, Bell Labs fostered many foundational ideas in speech science. It was a Bell Labs team that achieved the Audrey digit recognizer in 1952, and Bell Labs researchers (e.g., Itakura, Rabiner, Jelinek after he joined from IBM) contributed to breakthroughs like LPC and statistical modelling. In the late 20th century, academia and government labs (Carnegie Mellon, Stanford, MIT, NTT, etc., often under DARPA projects) pushed the envelope, exemplified by Raj Reddy’s students who pioneered HMM speech recognition and by MIT’s Dennis Klatt who advanced formant TTS. Companies like IBM and Dragon Systems commercialized speech recognition in the 80s/90s, while Microsoft and Apple invested in the 90s (Apple’s early Siri prototype and Microsoft’s SAPI engine). The 21st century saw tech giants like Google and Facebook (Meta) assume leadership: Google’s research produced novel architectures (e.g., the sequence-to-sequence model in Tacotron) and leveraged deep learning at massive scales to deploy voice in every Android phone, while Facebook’s contributions like wav2vec and multilingual models expanded speech tech across languages. DeepMind, Google’s AI research arm, blended academic depth with industrial might to create WaveNet and other seminal models. And OpenAI showed with _Whisper_that even as newcomers, they could open-source a model that sets a new standard for robust STT. Meanwhile, startups like ElevenLabs demonstrate that there’s still room for innovation and productization in speech, focusing on hyper-realistic voice cloning and expressiveness.

The evolution of TTS and STT technologies illustrates how increased knowledge of speech, more data, and better algorithms have incrementally (and sometimes in leaps) brought synthetic and recognized speech ever closer to natural human performance. Today’s systems cannot only read text aloud nearly indistinguishably from a human voice, but also transcribe spontaneous speech with very high accuracy, even across multiple languages and speakers. These capabilities are the result of decades of interdisciplinary research spanning signal processing, linguistics, and machine learning.

As we stand in the mid-2020s, speech interfaces are ubiquitous, yet, challenges remain for the next chapters. Truly understanding context and meaning (beyond just transcribing words), generating fully expressive and emotionally nuanced speech, and doing all this in a privacy-preserving, computationally efficient way are still a ways away. The historical trend suggests that further integration with language understanding models (like large language models) and multimodal learning could drive the next set of breakthroughs. The journey of TTS and STT thus far gives confidence that these technologies will continue to advance, making computers ever better listeners and speakers.

DJ Leamen is a Machine Learning and Generative Al Developer and Computer Science student with an interest in emerging technology and ethical development.

Subscribe to my free newsletter!

Special thanks Tomasz Puzio for the topic of this week’s article!

A Comprehensive History of Text-to-Speech (TTS) and Speech-to-Text (STT) Technologies