Avi: Rust-Based Voice Assistant Framework

Last week, I started a project that I've been conceptualizing for some time now - Avi, an Autonomous Voice Interface. It's the next evolution of my previous work on ALEX, but completely reimagined in Rust for better performance, reliability, and scalability.

Why Rust?

The transition from Python to Rust wasn't just about following trends or anything like that. Voice assistants require consistent performance and reliable resource management, especially when I'm thinking about running this on everything from desktops to IoT nodes. Rust gives me memory safety without garbage collection, concurrency without data races, and the kind of performance that makes real-time voice processing viable on lower-powered devices.

When I was building ALEX in Python, I kept hitting performance bottlenecks, especially with concurrent operations and resource management. The need for multiple threads to handle voice processing, intent recognition, and skill execution would often lead to unpredictable behavior and memory issues under load. With Rust, I've already seen significant improvements in both performance and stability.

And I just wanted to learn something new, a new programming language. After spending so much time in Python, I started to feel kind of stuck.

Architecture Overview

I've built Avi with modularity in mind. Here's what I've implemented so far:

Intent Recognition System - The core of understanding voice commands

Created a flexible intent recognition mechanism that uses both pattern matching and regex for more natural language processing
Built a slot extraction system that pulls structured data from voice commands
Implemented contextual awareness to remember previous interactions and maintain conversation state

AviScript - A domain-specific language for defining assistant behaviors

Implemented custom syntax handlers for intuitive skill definition
Added support for event-driven programming with on_intent, on_start, and on_end hooks
Built in JSON handling for data interchange between components
Created a flexible macro system for dynamic internationalization and localization

Skill Framework - The modular extension system

Established a clean structure for skills with metadata, configuration, and implementation
Created a runtime environment for loading and executing skills dynamically
Developed a hot-reloading system for updating skills without restarting the assistant

What I'm particularly excited about is how the intent recognition system and the skills framework interact. The intent recognizer identifies what the user wants and extracts structured data (slots), which then gets passed to the appropriate skill for handling, that inside will run the avi script defined by the skill. Here I faced a dilemma. I was trying to make the AviScript Engine (which itself is based on the Rhai Engine) switch the source path because, as it was, it would try to load a module from the root path, not the skill path. So my workaround was that every skill object has its own engine. (I know it's wasting memory, but that seemed like the only solution). As it is right now, the skill object occupies only around 700 bytes, but we'll see later if I keep it or change it.

Code Deep Dive: Intent Recognition

The intent recognition system follows a domain-modeling approach where intents represent user intentions and slots capture the parameters. Here's how it works:

// A simplified view of the recognition flow
fn recognize(&self, text: &str) -> Vec<ExtractedSlots> {
    let mut results = Vec::new();

    for intent in &self.intent_manager.intents {
        // Try matching against patterns
        for pat in &intent.patterns {
            if let Some(slots) = self.slot_extractor.extract_from_pattern(
                pat, text, &intent.name, &intent.slots,
            ) {
                results.push(slots);
            }
        }

        // Try matching against regex patterns
        for rx in &intent.regex_patterns {
            if let Some(slots) = self.slot_extractor.extract_from_regex(
                rx, text, &intent.name, &intent.slots,
            ) {
                results.push(slots);
            }
        }
    }

    results
}

This function scans through defined intents and tries to match user input against both literal patterns and regex patterns. When a match is found, it extracts structured data into slots that can be used by skills.

AviScript: Avi DSL

AviScript is where I've really pushed myself. Creating a domain-specific language isn't easy, but it makes skill development so much more intuitive. Here's a simple example of what AviScript looks like:

on_intent "GetWeather" {
      let location = slot("location");
      let date = slot("date");

      // API call handling
      let weather_data = fetch_weather(location, date);

      if weather_data.success {
        say("In ${location} on ${date}, expect ${weather_data.conditions} with temperatures around ${weather_data.temperature}°");
      } else {
        say("I couldn't find weather information for ${location}");
      }
}

The parser for this syntax was particularly challenging to build; thankfully, I was able to find the Rhai Engine, this engine saved me a lot of time,Its very hard to get some features to work like event driven scripts and importing but it makes creating new skills much more accessible, even for developers who aren't familiar with Rust.

What's Next?

There's still a lot to do. Here's what I'm planning for the coming weeks:

Voice Processing Pipeline - Integrate wake word detection and speech-to-text
- Implement WebRTC-based voice activity detection
- Create a pluggable STT system with support for multiple engines
- Optimize wake word detection for low-power devices
Message Bus Implementation - Complete the event system for inter-module communication
- Implement publish/subscribe patterns for flexible communication
- Create a serialization layer for cross-device communication
- Build message routing and priority handling
Interface Development - Start work on CLI and web interfaces
- Design a reactive web interface using WebAssembly
- Create a TUI (Text User Interface) for headless systems
- Develop a voice-first interaction model
Skill Marketplace - Design the framework for sharing and installing skills
- Create a package format for skills
- Implement dependency resolution
- Build a secure sandboxing system for third-party skills

I'm also working on a more detailed design for the "Enclosure" concept - the idea that Avi can run on various physical devices with different capabilities, all connecting back to a central system.

The Psychological Layer: Beyond Commands

One aspect of Avi that I haven't talked about yet is the psychological and emotional layer. Inspired by ELIZA, the early chatbot that simulated a psychotherapist, I'm working on a system that can provide basic mental health support.

This system analyzes user input for emotional cues and responds appropriately. It's not meant to replace professional help, of course, but it can provide a comforting presence and basic support.

Challenges So Far

It hasn't all been smooth sailing. Some challenges I've encountered:

Rust's Learning Curve - While Rust is powerful, its ownership system required me to rethink some of my architecture patterns. The borrow checker has been both my greatest teacher and my most persistent adversary.
DSL Implementation - Creating AviScript has been more complex than anticipated, especially getting the syntax right.
Cross-Platform Considerations - Planning ahead for different devices and OS support is tricky. I'm trying to avoid platform-specific code as much as possible, but some functionality (like audio capture) inevitably requires platform-specific implementations.

Community and Collaboration

Although Avi is still in its early stages, I'm excited about the potential for community collaboration. I've designed the architecture with extensibility in mind, making it easy for others to contribute skills, interfaces, or even new enclosure designs.

I'm particularly interested in collaborations around:

Natural language processing improvements
New language support
IoT integrations
Novel input/output mechanisms
Voice recognition enhancements

If you're interested in contributing, check out the GitHub repository and the contributing guidelines.

Final Thoughts

This first week with Avi has been both challenging and rewarding. I'm excited about building a voice assistant framework that's fast, reliable, and truly extensible. The Rust ecosystem has been fantastic to work with. I'm looking forward to diving deeper into systems programming as this project grows.

What makes this project special to me is that it brings together so many areas I'm passionate about: systems programming, natural language processing, hardware integration, and creating tools that feel almost alive. There's something magical about speaking to a computer and having it understand and respond meaningfully.

If you're interested in voice interfaces, systems programming, or Rust, I hope you'll follow along as I document this journey. Next week, I'll be focusing on the voice processing pipeline and sharing more details about the avi architecture.

Until then, happy coding!

Dev Log #1: Introducing Avi - A Rust-Based Voice Assistant Framework

Table of contents