Implementing KittenML TTS 😻 in a LiveKit Voice AI Agent: Setup & Benchmark


Voice AI is rapidly becoming immersive, with real-time transcription, reasoning, and speech synthesis all working together. In my latest experiment, we tested KittenML TTS (preview) as the speech output layer in a LiveKit voice agent pipeline using OpenAI Realtime for speech-to-text (STT) and reasoning.
This post walks you through:
How we integrated KittenML TTS into a LiveKit voice agent.
What are the performance benchmarks of KittenML TTS.
Scroll to the bottom to find an TTS live conversation recording 👇
Architecture Overview
The system we tested looks like this:
[ User Speech ]
↓ (LiveKit Audio Track)
[ LiveKit Realtime Agent ]
↓ (STT via OpenAI Realtime)
[ Text Response ]
↓ (TTS via KittenML)
[ Audio Output to User ]
This architecture ensures low-latency, real-time interaction between the user and the AI agent:
The LiveKit Server handles bi-directional audio streaming, enabling smooth conversation.
OpenAI Realtime API converts speech to text and generates intelligent responses.
KittenML TTS synthesizes the text into high-quality speech, delivering low latency audio to the user.
By separating STT, reasoning, and TTS into distinct stages, we can benchmark each stage individually and optimize the pipeline for responsiveness and voice quality.
Why KittenML TTS?
We chose KittenML TTS because it offers:
Ultra-lightweight: Model size less than 25MB
CPU-optimized: Runs without GPU on any device
High-quality voices: Multiple premium voice options available
Fast inference: Offers real-time speech synthesis
You can find installation instructions, example code, and voice demos on the GitHub page to get started quickly.
Implementation
Before diving into code, make sure you have your environment set up with Python 3.8+ and all dependencies installed. We’ll walk through three main steps: installing KittenML TTS, creating a custom TTS plugin for LiveKit, and integrating it into an agent session.
1. Setting up KittenML TTS
You’ll first need to install KittenML TTS:
pip install <https://github.com/KittenML/KittenTTS/releases/download/0.1/kittentts-0.1.0-py3-none-any.whl>
2. Implement custom TTS plugin
from kittentts import KittenTTS
import numpy as np
from livekit.agents import tts
from livekit.agents.types import DEFAULT_API_CONNECT_OPTIONS
class KittenTTSPlugin(tts.TTS):
def __init__(self, model_name="KittenML/kitten-tts-nano-0.1", voice="expr-voice-2-f", sample_rate=24000):
# Initialize parent with required parameters (matching OpenAI TTS pattern)
super().__init__(
capabilities=tts.TTSCapabilities(streaming=False),
sample_rate=sample_rate,
num_channels=1, # Mono audio
)
self.model = KittenTTS(model_name)
self.voice = voice
# Verify voice availability
if voice not in self.model.available_voices:
raise ValueError(f"Voice {voice} not available. Choose from {self.model.available_voices}")
def synthesize(self, text: str, *, conn_options=DEFAULT_API_CONNECT_OPTIONS) -> tts.ChunkedStream:
"""
Synthesize text to speech using KittenTTS.
Args:
text: The text to synthesize
conn_options: Connection options (ignored for KittenTTS)
Returns:
ChunkedStream: Stream of synthesized audio chunks
"""
return KittenTTSChunkedStream(tts=self, input_text=text, conn_options=conn_options)
@property
def name(self):
return f"KittenTTS-{self.voice}"
class KittenTTSChunkedStream(tts.ChunkedStream):
def __init__(self, *, tts: KittenTTSPlugin, input_text: str, conn_options):
super().__init__(tts=tts, input_text=input_text, conn_options=conn_options)
self._tts: KittenTTSPlugin = tts
self._audio_generated = False
async def _run(self, output_emitter: tts.AudioEmitter):
"""
Generate audio using KittenTTS and emit it through the output emitter.
This is the core method that must be implemented.
"""
if self._audio_generated:
return
try:
# Add padding to avoid cut-offs
padded_text = self.input_text.strip() + " ... "
# Generate audio waveform (numpy float32 array)
audio = self._tts.model.generate(padded_text, voice=self._tts.voice)
# Convert float32 to int16 (LiveKit requirement)
# Scale from [-1, 1] to [-32768, 32767]
audio_int16 = (audio * 32767).astype(np.int16)
# Convert to bytes
audio_bytes = audio_int16.tobytes()
# Initialize the output emitter with PCM format
output_emitter.initialize(
request_id=str(id(self)),
sample_rate=self._tts.sample_rate,
num_channels=1, # Mono
mime_type="audio/pcm", # Use PCM format for raw audio
)
# Push the audio data through the emitter
output_emitter.push(audio_bytes)
# Flush the emitter to indicate completion
output_emitter.flush()
self._audio_generated = True
except Exception as e:
# Handle any errors during synthesis
raise RuntimeError(f"KittenTTS synthesis failed: {str(e)}")
finally:
# Ensure the emitter is properly closed
try:
output_emitter.end_input()
except:
pass
3. Use it in your AgentSession
tts = KittenTTSPlugin(voice="expr-voice-4-m")
session = AgentSession(
...
tts=tts,
...
)
Performance Benchmark
Here’s the performance benchmark of KittenML TTS in this setup.
Test Environment:
LiveKit Server: Localhost
TTS Server: KittenML running locally on on same machine
STT & Reasoning: OpenAI Realtime API
[Worker 47c2e9b4] Waiting for session to end naturally...
⚡ Realtime Model - Duration: 0.072s, Input tokens: 0, Output tokens: 0, Total tokens: 0, Tokens/sec: 0.0
⚡ Realtime Model - Duration: 0.653s, Input tokens: 429, Output tokens: 11, Total tokens: 440, Tokens/sec: 16.8
🗣️ TTS Stage - Duration: 0.433s, TTFB: 0.420s, Audio: 2.335s, Characters: 34
⚡ Realtime Model - Duration: 0.947s, Input tokens: 479, Output tokens: 23, Total tokens: 502, Tokens/sec: 24.3
🗣️ TTS Stage - Duration: 0.476s, TTFB: 0.466s, Audio: 3.485s, Characters: 50
🗣️ TTS Stage - Duration: 0.450s, TTFB: 0.432s, Audio: 3.235s, Characters: 54
⚡ Realtime Model - Duration: 0.482s, Input tokens: 537, Output tokens: 14, Total tokens: 551, Tokens/sec: 29.0
🗣️ TTS Stage - Duration: 0.461s, TTFB: 0.453s, Audio: 3.235s, Characters: 44
⚡ Realtime Model - Duration: 0.429s, Input tokens: 586, Output tokens: 24, Total tokens: 610, Tokens/sec: 56.0, Cached: 512
🗣️ TTS Stage - Duration: 0.404s, TTFB: 0.398s, Audio: 2.860s, Characters: 48
🗣️ TTS Stage - Duration: 0.444s, TTFB: 0.421s, Audio: 3.060s, Characters: 50
⚡ Realtime Model - Duration: 0.854s, Input tokens: 647, Output tokens: 30, Total tokens: 677, Tokens/sec: 35.1, Cached: 576
🗣️ TTS Stage - Duration: 0.417s, TTFB: 0.409s, Audio: 2.835s, Characters: 48
🗣️ TTS Stage - Duration: 0.773s, TTFB: 0.738s, Audio: 6.235s, Characters: 105
⚡ Realtime Model - Duration: 0.445s, Input tokens: 701, Output tokens: 13, Total tokens: 714, Tokens/sec: 29.2, Cached: 512
🗣️ TTS Stage - Duration: 0.417s, TTFB: 0.413s, Audio: 2.610s, Characters: 46
📊 Session Usage Summary: UsageSummary(llm_prompt_tokens=3379, llm_prompt_cached_tokens=1600, llm_completion_tokens=115, tts_characters_count=479, tts_audio_duration=29.89, stt_audio_duration=0.0)
We ran multiple conversation sessions to measure the responsiveness of both the reasoning (LLM) and TTS stages. Each stage was timed separately to understand where the majority of the latency occurs.
👉 For example, here is a short test conversation demo
Base on this short conversation, we get:
LLM reasoning stage of 0.556 seconds average
TTS Stage of 0.531 seconds average
This shows KittenML TTS can reliably generate high-quality speech in sub-second latency, even on a CPU.
Conclusion
Integrating KittenML TTS into a LiveKit voice agent is straightforward and delivers fast and CPU-efficient speech. This makes it an excellent choice for real-time Voice AI applications where GPU resources are limited or a low-latency response is critical.
Subscribe to my newsletter
Read articles from Iqbal Rahadian directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
