Building a Real-Time AI Assistant With Simultaneous Speech and Listeni

Traditional Speech Language Models (SLMs) and conversational AI apps can only have one-sided (turn-based) conversations. For example, you speak to Siri and wait for it to respond. SLMs are incapable of real time interruptions that make conversations interactive to mimic how humans communicate. However, a recent finding shows that a language model can Listen While Speaking, paving the way to more natural human-to-SLM interactions.

This article introduces the listening-while-speaking language model (LSLM), which makes speech conversations with AI feel more dynamic and natural.

Traditional Turn-Based Conversational Models

SLMs are generally voice-based conversational AI models trained for speech-to-speech interactions. Think of turn-based major conversational models and voice assistants like Siri, Amazon Alexa, Google Assistant, etc. Although these voice assistants can have turn-by-turn conversations, they cannot handle real-time interruptions when users are not okay with the answers they provide. Additionally, SLMs find it challenging to operate in noisy surroundings, like speaking to Siri with a background construction noise or other people talking close to you. The reasons above occur because SLMs are half duplex systems and cannot speak and listen to users simultaneously.

Continue reading the following sections to learn about a proposed advanced SLM designed to handle turn-taking and human interruptions like the GPT-4o voice assistant.

A Proposed SLM That Can Speak and Listen Simultaneously

The listening-while-speaking language model (LSLM) is designed to handle simultaneous speaking and listening capabilities using a communication mechanism known as Full-Duplex Modeling FDM. An FDM communication allows voice assistants or conversational models and users to speak and listen at the same time, embracing real-time user interference. Implementing FDM communication in a voice assistant like Siri removes the need to always wait for turns during human-AI speech conversations.

The new GPT-4o model demonstrates an excellent integration of the LSLM technology in a real-world scenario. It can handle real-time user interruptions originating from vision and sound during conversations. For example, when you start speaking to the model and another person suddenly appears in the environment near you, the model can quickly distinguish you from the other person. The model can also be interrupted with sound while it is answering a prompt. For instance, while the GPT-4o model describes a scenario, you can interrupt it with your voice to stop and sing instead or do something else.

Key Features of LSLMs

The following are the main characteristics of LSLMs that differentiate them from traditional SLMs like Google Assistant.

Speak and listen at the same time: Unlike traditional turn-based voice assistants, LSLMs and their users can speak and listen simultaneously. This feature is feasible due to its underlying full duplex mechanism.
Interruption detection: Regardless of the modalities, such as speech, text, and vision, LSLMs can be obstructed at any point during an ongoing conversation. They can recognize and respond instantly to obstructions like task switching.
Adaptation and noise detection: During an active or ongoing conversation with these models, they can detect sudden and continuous noises that may occur in the speaker's environment. This ability to detect noise is helpful in other situations, such as differentiating the main speaker from the voices of other people who may be talking in close range.
Context awareness: They can track users' preferences in real time and immediately switch an ongoing task or conversation when necessary.
Unseen speaker detection: An LSLM can detect and respond to unseen speakers without interpreting their voices as noise.

LSLM: How It Works

The LSLM uses a full duplex model to allow a model and users to speak and listen at the same time rather than waiting for either party to finish speaking. Its token-based decoder-only text-to-speech (TTS) mechanism incorporates tokenized (breaking words into smaller units or chunks) data for speech generation. Using a text-to-speech token-based decoder improves the model's response time and output. It uses a system known as a Streaming Self-Supervised Learning (SSL) Encoder to process voice/speech input continuously. The SSL encoder transforms audio into a model-understandable data format for real time speech interactions.

LSLM: Fusion Mechanisms

The LSLM employs three fusion techniques, early fusion, middle fusion, and late fusion, to integrate its speaking and listening capabilities and ensure optimized data processing. It also uses a command-based FDM to handle and respond to voice instructions/prompts. The system mentioned here helps the model distinguish voice interruptions and background noises from speech input. Its companion voice-based FDM helps evaluate the model's interruption handling capability and how well it can manage different speech inputs from users.

The LSLM operates using two main channels for speaking and listening. The speaking channel handles speech generation using a TTS system. The listening channel is used for input speech processing and identifying environmental noise and interruptions to switch and adjust contexts.

LSLM Interruption Handling

While the LSLM is talking, and a user interferes with an adjustment or switch of context, the LSLM can stop speaking immediately, listen, and shift focus using an interruption token. The interruption token helps the model terminate an ongoing task quickly if turn-taking occurs, respond to interruptions and other unseen human voices, and dismiss environmental noises.

Full Duplex Speech Language Models

As stated in one of the earlier sections, LSLMs can speak, listen, and respond to sudden user interruptions due to their full duplex capabilities. A full-duplex communication is similar to human-human conversations. One talks and listens at the same time. The other person listening can also interrupt the speech at any time. On the other hand, half-duplex communication is similar to direct chat messaging. While one chat participant starts typing, the other needs to wait until the one typing sends the message.

GPT-4o is an excellent example of a commercial spoken dialogue system using full duplex speech technology. However, GPT-4o is a text-centric LLM. Another example of a speech language model that uses a full duplex speech system is Moshi Chat, an experimental conversational AI that can listen, think, and talk at the same time.

Advantages of LSLM Over SLM

The following highlights the benefits of LSLMs in voice conversational apps.

GPT-4o-like interactions: Similarly to GPT-4o, implementing LSLMs in consumer products can help directly observe tone, multiple speakers, background noises, laughter, singing, and user emotion.
Not limited to turn-based conversations: Although the model allows users to wait for turns, you can interfere with it using new prompts at any point in the conversation. The ability to obstruct the LSLM with a new instruction helps remove awkward pauses, like those in traditional SLM interactions.
Ability to distinguish human voices from noise: The LSLM can differentiate high and low frequency white noises and traffic noises. However, traditional SLMs cannot detect unseen speakers and high-frequency background noises such as phone ringtones and noise from nearby road construction.
Integration with other modalities: Like GPT-4o, LSLMs can integrate seamlessly with other modalities, like vision. In some situations, multi-modal input with voice and vision is necessary to improve the model's effectiveness and performance.

Limitations of LSLMs

Although LSLMs have several application areas in real-world scenarios and advantages over traditional SLMs, sometimes, these systems cannot interpret wrong user instructions. They can also lead to misjudgment of interruptions in high-frequency noisy conditions. The following are some other limitations associated with LSLMs.

Universal Accessibility: Currently, the model supports only English. Providing support for other languages with unique rhythms and structures will make it more accessible to varying cultures.
Performance: Based on the test conducted in the listening-while-speaking paper, maintaining high performance under noisy conditions can be challenging when the model is used in voice-based applications. A high-frequency background noise, for example, can cause a drop in performance.
Interruption handling by the model in conditions whereby human voices mix with low and high-frequency noises can be challenging.
Cybersecurity risks: Similarly to any other application or service that processes real time voice inputs, an LSLM can face cybersecurity risks. Since the model continuously listens to capture human speech, it can unintentionally record sensitive personal data that hackers can intercept if not handled correctly.
Accuracy and reliability: Since the LSLM relies heavily on voice inputs, it can be challenging when verbal commands are less clear. Also, varying accents and misinterpretation of voice instructions can affect the model's reliability and accuracy of results.
Limited to few voice presets: While these AI models, like GPT-4o, excel in real-time voice communication, they are often restricted to a limited number of predefined male and female voice presets for output generation and user interactions. This limitation can make interacting with these models feel a bit unnatural and less personal in diverse user settings. Also, users cannot customize predefined voices that may be supplied by the model or add personalized ones.

LSLMs Use Cases

LSLMs can enhance speech communication in various ways, including instant translation and collaboration, customer service, virtual assistance, and more. The following highlights some of their potential application areas.

Healthcare assistance: LSLMs can act as assistants for patients and doctors to ensure improved healthcare. They could listen to patients, interpret symptoms, and help in early diagnosis of illnesses.
Real-time collaboration: In a collaborative environment setting, LSLM-based applications can facilitate seamless collaboration with multiple participants, allowing them to have real-time discussions during brainstorming sessions.
Real time tutoring: LSLMs can help provide personalized tutoring to various users. For example, they can assist kids in solving complex math problems by interacting with them in real time and step-by-step.
Language learning and translation: In a language learning setting, people can ask the models to pronounce words and read sentences from one language to another that they are not familiar with. For example, one can speak in English and ask the AI model to translate it into another language.
Meeting AI and companion: When implemented in video meeting apps, for example, meeting participants can ask the LSLM apps about specific things in a meeting, like summarization and note-taking. LSLMs can also help to retrieve action points from a previous meeting in real time.
Augmenting vision experiences: Combining these interactive speech language models with other modalities like vision, users can ask a model to describe what it observes in a particular surrounding and explain what is happening.
Customer service interactions: Integrating these voice models in a customer setting can help resolve order issues and troubleshooting.

LSLMs Conclusion

This article introduced you to the exciting research area of interactive speech dialogue systems and speech-based conversational AI. You have learned several reasons why traditional speech conversational systems like Alexa or Siri do not support real-time spoken scenarios and user interference during conversations. Unlike the above models, LSLMs support real time audio/voice communication, allowing both turn-based interactions and users to interfere and adjust voice commands at their preference.

We discussed the key features, benefits, and limitations of LSLMs and how they can be helpful in various scenarios and application areas. The accompanying research paper provides more information about LSLMs and tools for evaluating them. The related links in this article also provide helpful information on AI chat and LLMs.

Using a Speech Language Model That Can Listen While Speaking

Table of contents

In this Article