Building Venya Part 2 — Choosing the Tech Stack, Validating Speech Recognition & Text-to-Speech

Ayomide AkinolaAyomide Akinola
7 min read

Welcome to the exciting journey of building Venya, a voice-first AI personal assistant designed to listen, learn, and grow with you. This article dives deep into the technical foundation of Venya, exploring the thoughtful choices behind its tech stack, the challenges of integrating speech-to-text and text-to-speech functionalities, and the experiments that paved the way for a performant and scalable AI assistant.

Venya is a fully open-source project built with the vision of running efficiently across multiple devices, including low-end hardware like phones and laptops. The goal is to create a personal assistant that works primarily offline, protecting user data by avoiding constant reliance on cloud servers. Let’s unpack the technical decisions, experiments, and insights that shaped this ambitious project.

The Choice of Rust and Dioxus: A Powerful Tech Stack for Venya

At the core of Venya is Rust, a programming language chosen for its exceptional performance, safety, and reliability. Rust is not exactly new, but it has gained tremendous popularity in domains requiring low-level control and efficiency, such as blockchain, embedded systems, and even core components of Microsoft Windows. These attributes make Rust an ideal candidate for building Venya, where real-time voice processing demands minimal latency and maximum stability.

Rust’s strictness in enforcing safe coding practices results in neat, maintainable, and highly performant code. This is important for a project like Venya, where the software must run smoothly on various devices without sacrificing speed or responsiveness.

Complementing Rust is the choice of Dioxus as the framework for building Venya’s user interface. Dioxus offers a familiar experience to developers with a web background, drawing parallels to popular frameworks like React and Vue. This familiarity makes it easier to create dynamic, reactive UIs while still writing pure Rust code.

Dioxus leverages a web engine to render views, allowing developers to use Tailwind CSS, a utility-first CSS framework, to style the app efficiently. Tailwind CSS enables rapid UI design using predefined classes, which means Venya’s interface can be both robust and visually appealing without writing verbose custom styles.

By using Rust and Dioxus together, Venya benefits from a single codebase that can run across different platforms, delivering an optimized and consistent experience.

Why Not Other Languages?

While C++ or other languages could have been used, Rust’s combination of performance and safety tipped the scales. Additionally, the ecosystem around Rust is rapidly growing, with increasing support for AI and multimedia processing libraries, which are essential for Venya’s voice-first capabilities.

Validating Speech-to-Text Models: The Heart of Venya’s Listening Ability

The ability for Venya to understand spoken commands hinges on converting speech to text accurately and efficiently. This is a significant technical hurdle, especially when aiming for offline operation on devices with limited resources.

Two prominent open-source speech-to-text models were explored extensively:

  • VOSK AI: An offline speech recognition toolkit that supports multiple languages.

  • Whisper.cpp: A lightweight Rust binding for OpenAI’s Whisper model, implemented in C++.

The key requirement was that the model must work offline, ensuring privacy and reducing dependency on internet connectivity. Venya’s design aims for real-time transcription, where audio is processed in chunks and converted promptly to text, enabling smooth and responsive interaction.

Challenges with VOSK AI Integration

VOSK AI is primarily Python-centric, and integrating it directly into a Rust-based app proved difficult due to the lack of native Rust bindings. An initial workaround involved running a Python environment inside the Rust app to interface with VOSK, but this approach was ultimately impractical. Shipping a full Python runtime would bloat the app size, making it too large for typical users to download and use conveniently.

Despite these challenges, VOSK demonstrated reasonable accuracy and offline capability. However, the necessity for a lightweight, native Rust solution prompted further exploration.

Whisper.cpp: A Better Fit for Venya

Whisper.cpp, with its Rust binding, emerged as a superior option. It is faster, more accurate, and directly usable within the Rust ecosystem without relying on external runtimes. This model handles real-time speech-to-text conversion effectively, supporting Venya’s goal of on-device, low-latency processing.

Testing revealed that Whisper provided more accurate transcriptions compared to VOSK, especially when using appropriately sized models balancing accuracy and resource consumption. This makes Whisper.cpp the preferred speech-to-text engine for Venya.

How Real-Time Speech-to-Text Works in Venya

Venya records audio in small chunks, which are then processed concurrently by the speech-to-text model. This chunked processing ensures minimal delay between speaking and transcription, creating a near real-time conversational experience.

Text-to-Speech: Giving Venya a Voice That Feels Real

While speech-to-text converts what you say into text, text-to-speech (TTS) makes Venya speak back to you. Achieving natural-sounding, human-like speech is a complex challenge that demands both high performance and sophisticated AI models.

The Complexity Behind Text-to-Speech

Converting text into speech involves multiple stages:

  1. Text Normalization: Converting raw text into a phonetic representation that mimics how humans speak, including handling punctuation and context.

  2. Phoneme Chunking: Breaking down words into smaller sound units (phonemes) with corresponding intonation, pitch, and rhythm.

  3. Spectrogram Generation: Creating a visual representation of sound frequencies and amplitudes over time.

  4. Vocoder Processing: Transforming the spectrogram data into actual audio waveforms that sound like natural speech.

This pipeline requires careful tuning to produce speech that doesn’t sound robotic but instead has natural intonation and accents.

Exploring Open-Source TTS Models

Initial experiments involved using a Rust-based TTS project presented at RustNation 2024, which provided valuable insights but was not perfect. Early implementations sounded robotic and lacked variety in voice options.

Further research led to the discovery of Kokoro, an open-source TTS model known for its speed, quality, and multilingual support. Kokoro stands out for several reasons:

  • It is lightweight, trained with around 82 million parameters.

  • It supports multiple voices, including male and female, with various accents such as British English, American English, Spanish English, French English, and even Chinese English.

  • It runs offline efficiently, fitting Venya’s privacy-first vision.

Kokoro’s Rust bindings make it directly accessible within Venya’s codebase, avoiding the overhead of external dependencies.

Demoing Kokoro’s Text-to-Speech

Sample outputs from Kokoro demonstrate remarkably natural-sounding speech that can convey a friendly and realistic assistant voice. For example, the phrase:

“Venya is a voice-first AI assistant that listens, learns, and grows with you. Helping you manage your day, reflect, read books, and spark conversations.”

is rendered with clear intonation and a natural flow that is far from the robotic monotone typical of early TTS systems.

Putting It All Together: The Current State of Venya

The experiments and integrations discussed have laid a solid foundation for Venya’s core functionalities:

  • Using Rust and Dioxus for a performant, cross-platform codebase with a modern UI.

  • Leveraging Whisper.cpp for fast, accurate, and offline speech-to-text.

  • Adopting Kokoro for high-quality, natural-sounding text-to-speech with multilingual and multi-accent support.

This combination ensures Venya can listen and talk efficiently without relying heavily on cloud services, preserving user privacy and enabling offline use.

Remaining Challenges and Next Steps

While the speech recognition and synthesis components are promising, several challenges remain:

  • User Interface Development: Designing and implementing a user-friendly UI/UX is the next focus. Venya will soon feature a visually compelling interface built with Dioxus and styled with Tailwind CSS.

  • Large Language Model (LLM) Integration: Venya aims to incorporate LLMs to enable intelligent conversations. However, offline LLMs are currently limited in power, so cloud-based solutions like Google Gemini or OpenAI’s models might be integrated with privacy safeguards.

  • Optimizing Performance: Balancing real-time responsiveness with resource constraints on low-end devices remains a continuous effort.

Why Venya Matters: The Vision Behind the Project

Venya is more than a tech experimen: it’s a step towards democratizing AI personal assistants that respect user privacy, work offline, and run smoothly on everyday devices. By building Venya as an open-source project with transparent development, it invites collaboration and innovation from the community.

The choice of Rust and Dioxus reflects a commitment to modern, efficient software engineering, while the careful selection of speech-to-text and text-to-speech models ensures Venya can deliver a natural, engaging voice experience.

As AI assistants become ubiquitous, projects like Venya highlight the importance of control, privacy, and accessibility in AI technology.

Explore Venya Yourself

If you’re interested in following along or contributing, Venya’s entire codebase is available on GitHub. You can dive into the Rust code, check out the Dioxus UI framework, and experiment with the speech processing integrations firsthand.

For those passionate about AI, Rust programming, or voice technologies, Venya represents an exciting opportunity to learn and build something innovative from the ground up.

Conclusion

Building a voice-first AI assistant like Venya involves careful choices in technology and persistent experimentation. By selecting Rust for its performance and safety, leveraging Dioxus for a modern UI framework, and validating speech-to-text and text-to-speech models for offline, real-time operation, Venya is well on its way to becoming a powerful personal assistant.

The journey continues with UI development and intelligent language model integration, but the foundation is solid: Venya can listen and speak effectively on a range of devices without sacrificing privacy or speed.

If you want to stay updated on Venya’s progress or get involved, following the project and its community is highly recommended. The future of voice-first AI assistants is bright, and Venya is leading the charge towards a more private, performant, and user-friendly experience.

0
Subscribe to my newsletter

Read articles from Ayomide Akinola directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ayomide Akinola
Ayomide Akinola

A software developer at ceedcap