Building Voice-Controlled Web App with the Web Speech API

Prince AzubuikePrince Azubuike
6 min read

Introduction

In recent years, voice-controlled interfaces have become increasingly popular, revolutionizing the way we interact with technology. From virtual assistants to smart speakers, to native apps. We have seen this feature on big social media applications where users can use their microphones to send voice messages. The voice commands offer a hands-free and convenient way to engage with applications. With the advent of the Web Speech API, developers now have the power to bring voice control to web applications, enabling users to interact with their creations using speech input and output.

In this article, we will dive into the fascinating world of voice-controlled web applications using the Web Speech API. By the end, we will build a simple page reader web app, and also you'll have the knowledge to build interactive web applications that can understand and respond to voice commands.

Prerequisite

  • This article expects you to have prior knowledge in HTML, CSS, and JavaScript

  • Have a code editor installed e.g: (vs-code)

What is Web Speech API?

The web speech API is the browser's SpeechSynthesis controller for the speech service; which offers a handful of speech synthesis on the device, such as, start, pause, and getVoices(), etc.

Browser's Compatibility

SpeechSynthesis Interface

SpeechSynthesis is the controller interface for the speech service. It’s a text-to-speech module that allows programs to read out their text content.

To start using the speech synthesis, we need to initialize the constructor

// verify the browser compatibility
const SpeechRecognition =
        window.SpeechRecognition || window.webkitSpeechRecognition;
// initialize the constructor
const recognition = new SpeechRecognition();

Once the constructor is initialized, we get a very handful of properties such as recognition.onlang lets us set a preferred language the recognition will speak out loud, and the default is en-US. recognition.onstart start listing to speech synthesis, recognition.onresult the recognition has finished the result is stored in the event of this result method, etc.

const recognition = new SpeechRecognition();

recognition.lang = 'en-US';

recognition.onstart = () => {
  console.log('Speech recognition started');
};

recognition.onresult = (event) => {
  const results = event.results;
  for (let i = event.resultIndex; i < results.length; i++) {
    const transcript = results[i][0].transcript;
    console.log('Recognized speech:', transcript);
  }
};

recognition.onend = () => {
  console.log('Speech recognition ended');
};

recognition.onerror = (event) => {
  console.error('Speech recognition error:', event.error);
};

SpeechSynthesis utterance

The SpeechSynthesisUtterance interface of the Speech API represents a speech request. It contains the content the speech methods should read and how to read it e.g. (language, pitch, and volume).
The speech synthesis again needs to be initialized, this lets us access the properties of the utterance.

const utterance = new SpeechSynthesisUtterance();

We can then set the text, pitch, rate, volume, and language.

const utterance = new SpeechSynthesisUtterance();
utterance.text = 'Hello, how are you?';
utterance.lang = 'en-US';
utterance.volume = 1;
utterance.rate = 1;
utterance.pitch = 1;

Here is a breakdown of the speech synthesis utterance methods.

  1. text: Gets or sets the text content of the utterance. This is the text that will be spoken by the speech synthesis engine.

  2. lang: Gets or sets the language of the utterance. It specifies the language and locale for speech synthesis. For example, setting it to 'en-US' indicate American English.

  3. volume: Gets or sets the volume of the utterance. It represents the volume level and can range from 0 to 1, where 0 is silent and 1 is the default volume.

  4. rate: Gets or sets the rate of the utterance. It determines the speed at which the speech is spoken and can be adjusted within a range, where 1 is the default rate.

  5. pitch: Gets or sets the pitch of the utterance. It controls the pitch or the perceived "highness" or "lowness" of the speech. The pitch can be adjusted within a range, where 1 is the default pitch.

Speaking out to the device

To speak out to the device output, we will call the speak() object's method of speechSynthesis constructor, along with the SpeechSynthesisUtterance. The text to be spoken is stored in the text property of the Utterance.

const utterance = new SpeechSynthesisUtterance();
utterance.text = 'Hello, how are you?';
utterance.lang = 'en-US';
utterance.volume = 1;
utterance.rate = 1;
utterance.pitch = 1;

speechSynthesis.speak(utterance);

Page Reader App

With this Speech Recognition and SpeechSynthsis, we are ready to build a single-page website that only read out our page text content to the user's device speakers, with a click of a button.
We will only use HTML and JavaScript.

Set up your coding environment and create two files namely: (index.html, and app.js).
Generate simple content inside your index.html file.

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Web Speech API</title>
  </head>
  <body>
    <h1>Web Speech API - Page Reader Example</h1>

    <button class="readButton">Read Page</button>
    <div class="to-read">
      <h2>Overview</h2>

      <p>
        <bold>Plot! </bold> <br />
        Avengers is a blockbuster superhero film franchise produced by Marvel
        Studios, based on the Marvel Comics superhero team of the same name. The
        movies bring together a diverse group of superheroes, each with their
        own unique powers and personalities, to save the world from various
        threats. The Avengers movies revolve around a team of extraordinary
        individuals who join forces to protect humanity from powerful villains
        and world-ending catastrophes. As the Earth faces imminent danger, the
        Avengers assemble to defend it using their incredible abilities,
        advanced technology, and unwavering courage. Throughout the franchise,
        the Avengers face formidable foes such as Loki, Ultron, and Thanos, each
        posing their own set of challenges and requiring the heroes to come
        together and fight as a unified force. The movies showcase epic battles,
        dramatic character development, and unexpected twists that keep
        audiences on the edge of their seats.
      </p>

      <h2>Memorable Characters</h2>

      <ul>
        <li>Iron Man (Tony Stark)</li>
        <li>Captain America (Steve Rogers)</li>
        <li>Thor (God of Thunder)</li>
        <li>Hulk (Bruce Banner)</li>
        <li>Black Widow (Natasha Romanoff)</li>
        <li>Hawkeye (Clint Barton)</li>
      </ul>

      <h2>Legacy</h2>

      <p>
        The Avengers movies have left an indelible mark on the superhero genre
        and have become a cultural phenomenon. With their engaging storytelling,
        spectacular visual effects, and a perfect blend of action, humor, and
        emotion, these films have garnered immense popularity and critical
        acclaim.
      </p>

      <p>
        The success of the Avengers franchise has also paved the way for the
        Marvel Cinematic Universe (MCU), a vast interconnected universe of
        superhero movies and TV shows that spans multiple storylines and
        characters.
      </p>
    </div>
    <script src="./app.js"></script>
  </body>
</html>

At the end of the body element, include the app.js file in this HTML file. Then we are ready to write a function that reads our page content using SpeechSynthesis.

const textToRead = document.querySelector(".to-read");
const readBtn = document.querySelector(".readButton");

function speakPageContent() {
  const content = textToRead.innerText;

  const utterance = new SpeechSynthesisUtterance(content);
  utterance.pitch = 0.5;
  utterance.rate = 1;
  utterance.volume = 1;
  utterance.lang = "en-US";

  window.speechSynthesis.speak(utterance);
}

readBtn.addEventListener("click", speakPageContent);

the code snippet above gets the HTML elements using the document querySelector method. It also defines a speakPageContent function that gets the text to be spoken from the page and initialized the speechSynthesisUtterance. Then we set the properties of the utterance such as pitch, rate, volume, and language, then finally we call the window speechSynthesis speak() object's method.
When a user visits the webpage and wants to listen to the page content instead of reading it, they can simply click the "Read Page" button. The JavaScript code triggers the speech synthesis functionality, converting the text content into audible speech using the Web Speech API.

Conclusion

The Web Speech API, opens up a world of possibilities for web development, allowing users to consume page content easier and faster. This also brings opportunities to allow web apps to build voice-controlled chat interfaces, that behave like native chat apps and also better engage with customers.

By harnessing the power of the Web Speech API, developers can integrate speech recognition and synthesis capabilities into their web applications. This technology brings the potential for more inclusive, engaging, and efficient web experiences.

Find this article helpful? Drop a like and comment.

Happy SpeechSynthesis!

12
Subscribe to my newsletter

Read articles from Prince Azubuike directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Prince Azubuike
Prince Azubuike

As a versatile front-end developer, I specialize in translating UI/UX wireframes into captivating web applications using JavaScript, frameworks, and libraries. I work collaboratively with backend teams to ensure the creation of stellar finished products. In addition to my development expertise, I am a skilled technical writer. I have a passion for conveying complex concepts in a clear and concise manner, making them accessible to diverse audiences. Whether it's crafting engaging articles, comprehensive tutorials, or precise documentation, my goal is to provide informative and easily understandable content.