Implementing KittenML TTS 😻 in a LiveKit Voice AI Agent: Setup & Benchmark

Iqbal RahadianIqbal Rahadian
5 min read

Voice AI is rapidly becoming immersive, with real-time transcription, reasoning, and speech synthesis all working together. In my latest experiment, we tested KittenML TTS (preview) as the speech output layer in a LiveKit voice agent pipeline using OpenAI Realtime for speech-to-text (STT) and reasoning.

This post walks you through:

  1. How we integrated KittenML TTS into a LiveKit voice agent.

  2. What are the performance benchmarks of KittenML TTS.

Scroll to the bottom to find an TTS live conversation recording 👇

Architecture Overview

The system we tested looks like this:

[ User Speech ]
    ↓ (LiveKit Audio Track)
[ LiveKit Realtime Agent ]
    ↓ (STT via OpenAI Realtime)
[ Text Response ]
    ↓ (TTS via KittenML)
[ Audio Output to User ]

This architecture ensures low-latency, real-time interaction between the user and the AI agent:

  • The LiveKit Server handles bi-directional audio streaming, enabling smooth conversation.

  • OpenAI Realtime API converts speech to text and generates intelligent responses.

  • KittenML TTS synthesizes the text into high-quality speech, delivering low latency audio to the user.

By separating STT, reasoning, and TTS into distinct stages, we can benchmark each stage individually and optimize the pipeline for responsiveness and voice quality.

Why KittenML TTS?

We chose KittenML TTS because it offers:

  • Ultra-lightweight: Model size less than 25MB

  • CPU-optimized: Runs without GPU on any device

  • High-quality voices: Multiple premium voice options available

  • Fast inference: Offers real-time speech synthesis

You can find installation instructions, example code, and voice demos on the GitHub page to get started quickly.

Implementation

Before diving into code, make sure you have your environment set up with Python 3.8+ and all dependencies installed. We’ll walk through three main steps: installing KittenML TTS, creating a custom TTS plugin for LiveKit, and integrating it into an agent session.

1. Setting up KittenML TTS

You’ll first need to install KittenML TTS:

pip install <https://github.com/KittenML/KittenTTS/releases/download/0.1/kittentts-0.1.0-py3-none-any.whl>

2. Implement custom TTS plugin

from kittentts import KittenTTS
import numpy as np
from livekit.agents import tts
from livekit.agents.types import DEFAULT_API_CONNECT_OPTIONS

class KittenTTSPlugin(tts.TTS):
    def __init__(self, model_name="KittenML/kitten-tts-nano-0.1", voice="expr-voice-2-f", sample_rate=24000):
        # Initialize parent with required parameters (matching OpenAI TTS pattern)
        super().__init__(
            capabilities=tts.TTSCapabilities(streaming=False),
            sample_rate=sample_rate,
            num_channels=1,  # Mono audio
        )

        self.model = KittenTTS(model_name)
        self.voice = voice

        # Verify voice availability
        if voice not in self.model.available_voices:
            raise ValueError(f"Voice {voice} not available. Choose from {self.model.available_voices}")

    def synthesize(self, text: str, *, conn_options=DEFAULT_API_CONNECT_OPTIONS) -> tts.ChunkedStream:
        """
        Synthesize text to speech using KittenTTS.

        Args:
            text: The text to synthesize
            conn_options: Connection options (ignored for KittenTTS)

        Returns:
            ChunkedStream: Stream of synthesized audio chunks
        """
        return KittenTTSChunkedStream(tts=self, input_text=text, conn_options=conn_options)

    @property
    def name(self):
        return f"KittenTTS-{self.voice}"

class KittenTTSChunkedStream(tts.ChunkedStream):
    def __init__(self, *, tts: KittenTTSPlugin, input_text: str, conn_options):
        super().__init__(tts=tts, input_text=input_text, conn_options=conn_options)
        self._tts: KittenTTSPlugin = tts
        self._audio_generated = False

    async def _run(self, output_emitter: tts.AudioEmitter):
        """
        Generate audio using KittenTTS and emit it through the output emitter.
        This is the core method that must be implemented.
        """
        if self._audio_generated:
            return

        try:
            # Add padding to avoid cut-offs
            padded_text = self.input_text.strip() + " ... "

            # Generate audio waveform (numpy float32 array)
            audio = self._tts.model.generate(padded_text, voice=self._tts.voice)

            # Convert float32 to int16 (LiveKit requirement)
            # Scale from [-1, 1] to [-32768, 32767]
            audio_int16 = (audio * 32767).astype(np.int16)

            # Convert to bytes
            audio_bytes = audio_int16.tobytes()

            # Initialize the output emitter with PCM format
            output_emitter.initialize(
                request_id=str(id(self)),
                sample_rate=self._tts.sample_rate,
                num_channels=1,  # Mono
                mime_type="audio/pcm",  # Use PCM format for raw audio
            )

            # Push the audio data through the emitter
            output_emitter.push(audio_bytes)

            # Flush the emitter to indicate completion
            output_emitter.flush()

            self._audio_generated = True

        except Exception as e:
            # Handle any errors during synthesis
            raise RuntimeError(f"KittenTTS synthesis failed: {str(e)}")
        finally:
            # Ensure the emitter is properly closed
            try:
                output_emitter.end_input()
            except:
                pass

3. Use it in your AgentSession

tts = KittenTTSPlugin(voice="expr-voice-4-m")

session = AgentSession(
    ...
    tts=tts,
    ...
)

Performance Benchmark

Here’s the performance benchmark of KittenML TTS in this setup.

Test Environment:

  • LiveKit Server: Localhost

  • TTS Server: KittenML running locally on on same machine

  • STT & Reasoning: OpenAI Realtime API

[Worker 47c2e9b4] Waiting for session to end naturally... 
⚡ Realtime Model - Duration: 0.072s, Input tokens: 0, Output tokens: 0, Total tokens: 0, Tokens/sec: 0.0 
⚡ Realtime Model - Duration: 0.653s, Input tokens: 429, Output tokens: 11, Total tokens: 440, Tokens/sec: 16.8 
🗣️ TTS Stage - Duration: 0.433s, TTFB: 0.420s, Audio: 2.335s, Characters: 34 
⚡ Realtime Model - Duration: 0.947s, Input tokens: 479, Output tokens: 23, Total tokens: 502, Tokens/sec: 24.3 
🗣️ TTS Stage - Duration: 0.476s, TTFB: 0.466s, Audio: 3.485s, Characters: 50 
🗣️ TTS Stage - Duration: 0.450s, TTFB: 0.432s, Audio: 3.235s, Characters: 54 
⚡ Realtime Model - Duration: 0.482s, Input tokens: 537, Output tokens: 14, Total tokens: 551, Tokens/sec: 29.0 
🗣️ TTS Stage - Duration: 0.461s, TTFB: 0.453s, Audio: 3.235s, Characters: 44 
⚡ Realtime Model - Duration: 0.429s, Input tokens: 586, Output tokens: 24, Total tokens: 610, Tokens/sec: 56.0, Cached: 512 
🗣️ TTS Stage - Duration: 0.404s, TTFB: 0.398s, Audio: 2.860s, Characters: 48 
🗣️ TTS Stage - Duration: 0.444s, TTFB: 0.421s, Audio: 3.060s, Characters: 50 
⚡ Realtime Model - Duration: 0.854s, Input tokens: 647, Output tokens: 30, Total tokens: 677, Tokens/sec: 35.1, Cached: 576 
🗣️ TTS Stage - Duration: 0.417s, TTFB: 0.409s, Audio: 2.835s, Characters: 48 
🗣️ TTS Stage - Duration: 0.773s, TTFB: 0.738s, Audio: 6.235s, Characters: 105
⚡ Realtime Model - Duration: 0.445s, Input tokens: 701, Output tokens: 13, Total tokens: 714, Tokens/sec: 29.2, Cached: 512 
🗣️ TTS Stage - Duration: 0.417s, TTFB: 0.413s, Audio: 2.610s, Characters: 46
📊 Session Usage Summary: UsageSummary(llm_prompt_tokens=3379, llm_prompt_cached_tokens=1600, llm_completion_tokens=115, tts_characters_count=479, tts_audio_duration=29.89, stt_audio_duration=0.0)

We ran multiple conversation sessions to measure the responsiveness of both the reasoning (LLM) and TTS stages. Each stage was timed separately to understand where the majority of the latency occurs.

👉 For example, here is a short test conversation demo

Base on this short conversation, we get:

  1. LLM reasoning stage of 0.556 seconds average

  2. TTS Stage of 0.531 seconds average

This shows KittenML TTS can reliably generate high-quality speech in sub-second latency, even on a CPU.

Conclusion

Integrating KittenML TTS into a LiveKit voice agent is straightforward and delivers fast and CPU-efficient speech. This makes it an excellent choice for real-time Voice AI applications where GPU resources are limited or a low-latency response is critical.

6
Subscribe to my newsletter

Read articles from Iqbal Rahadian directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Iqbal Rahadian
Iqbal Rahadian