How to Write Your Own Virtual Machine in Rust.

Biliqis OnikoyiBiliqis Onikoyi
13 min read

Introduction

If you’re a curious programmer like me, always eager to dive deeper into how things work under the hood, building a virtual machine (VM) is an exciting way to explore low-level systems programming. After watching countless YouTube videos, reading books, and scouring articles, I summoned the courage to write a VM in Rust, a language I love for its blend of safety and performance. In this tutorial, we’ll build a simple register-based VM inspired by the LC-3 architecture, which executes basic arithmetic and I/O operations, helping you understand how high-level code transforms into machine instructions. Along the way, we’ll explore Rust’s powerful features, including the extern keyword that caught my attention for its ability to interface with other languages.

Why build a VM? For me, it was about satisfying my curiosity and gaining a visual understanding of how programs move from high-level code (what developers write) to low-level binary executed on the CPU. By building this VM, you’ll see how values are loaded into registers, manipulated, and processed—mimicking a real CPU’s behavior. Plus, you’ll learn Rust’s systems programming capabilities in a fun, hands-on way.

Why Build a Virtual Machine?

People build VMs for all sorts of reasons: learning, creating custom runtimes, or just for the challenge. For me, it was about bridging the gap between high-level code and the CPU. Writing a VM helped me visualize how values are loaded into registers, how operations are executed, and how memory is managed. It’s like peeling back the layers of a computer system to see the magic happen! Plus, Rust’s memory safety and zero-cost abstractions make it a perfect fit for this project.

Some benefits of building a VM include:

  • Understanding Execution: Learn how code is interpreted or compiled into machine instructions.

  • Systems Programming: Gain hands-on experience with low-level concepts like registers and program counters.

  • Rust Mastery: Explore Rust’s enums, pattern matching, and the extern keyword for interfacing with C or other languages.

Prerequisites

To follow this tutorial, you’ll need:

  • Rust Installed: Install Rust using curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh.

  • Basic Rust Knowledge: Familiarity with structs, enums, match expressions, and error handling (Result).

  • Text Editor: Use VS Code, IntelliJ, or any editor with Rust support.

  • Optional: Basic understanding of register-based architectures or assembly (helpful but not required).

  • Tools: cargo for building and running the project, and libc for the extern example.

What Is a Register-Based Virtual Machine?

A virtual machine is a software-based emulation of a computer, executing programs written in bytecode (low-level instructions). Our VM will be register-based, meaning it uses registers (small, fast storage locations) to store operands and results, similar to a real CPU. For example, to compute 5 + 3, the VM loads 5 and 3 into registers, adds them, and stores the result in another register. This contrasts with stack-based VMs, like the JVM or Python’s VM, which use a stack to push and pop values. Register-based VMs, like our LC-3-inspired design, are common in educational architectures for their clarity in mimicking CPU operations.

Our VM will have:

  • Registers: Eight general-purpose registers (R0–R7), a program counter (PC), and a condition flag register (Negative, Zero, Positive).

  • Memory: A 16-bit addressable space (0x0000–0xFFFF, 65,536 locations).

  • Bytecode: A Vec<u16> array of 16-bit instructions (opcodes and operands).

  • Execution Loop: Fetches, decodes, and executes instructions.

Designing the Instruction Set

We’ll define a simple instruction set for our LC-3-inspired register-based VM, drawing from the opcodes in my codebase. These opcodes let us perform arithmetic, logical operations, and I/O, mimicking how a CPU processes instructions. Unlike stack-based VMs (like the JVM), which push and pop values on a stack, our VM uses registers for direct manipulation, making it easier to visualize CPU-like behavior. Each instruction is a 16-bit u16, with the top 4 bits as the opcode (e.g., 1 for ADD, 15 for TRAP) and the remaining bits encoding operands like registers or immediate values.

Here’s the instruction set we’ll use:

  • ADD: Add two registers or a register and an immediate value (e.g., 5).

  • AND: Bitwise AND two registers or a register and an immediate.

  • NOT: Bitwise NOT a register.

  • TRAP_OUT: Output a character from R0 to the console.

  • TRAP_GETC: Read a character from the keyboard into R0.

  • TRAP_HALT: Stop execution.

Example: ADD R0, R1, #5

To understand how instructions work, let’s break down ADD R0, R1, #5, which adds 5 to the value in register R1 and stores the result in R0. In our VM, this instruction is encoded as a 16-bit u16 (e.g., 0x1245 in hex). Here’s how it’s structured:

  • Bits 15–12 (Opcode): 0001 (1 in decimal, for ADD).

  • Bits 11–9 (Destination Register, DR): 000 (R0).

  • Bits 8–6 (Source Register 1, SR1): 001 (R1).

  • Bit 5 (Immediate Flag): 1 (indicating immediate mode, not a second register).

  • Bits 4–0 (Immediate Value, imm5): 00101 (5 in decimal, sign-extended to 16 bits).

When executed, if R1 holds 3, the VM adds 5 to it, stores 8 in R0, and updates the condition flags (e.g., Positive if R0 = 8). This format, inspired by my codebase, supports both register-register (ADD R0, R1, R2) and register-immediate modes (ADD R0, R1, #5), offering flexibility for arithmetic operations.

Implementing the Virtual Machine

Let’s build the VM step by step.

Step 1: Define the Opcodes

We’ll use a Rust enum to represent opcodes and traps, leveraging type safety. A TryFrom<u16> trait will convert instruction bits to opcodes.

Step 2: Create the VM Structure

The VM struct will store registers, the program counter, memory, condition flags, and running state.

Step 3: Implement the Execution Loop

The run method will:

  1. Fetch the current 16-bit instruction using the PC.

  2. Decode it into an opcode and operands.

  3. Execute the instruction (e.g., add, output, or halt).

  4. Increment the PC or halt.

Step 4: Handle Arithmetic Operations

Arithmetic opcodes (ADD, AND, NOT) will operate on registers, updating condition flags (Negative, Zero, Positive) based on the result.

Step 5: Explore the extern Keyword

Our VM uses extern to call C’s putchar for TRAP_OUT, allowing console output. The extern keyword lets Rust interface with C functions, which fascinated me for its low-level power. We’ll include it to show how it enables I/O.

The Complete Code

Here’s the Rust code for our VM, adapted from my codebase, with a simplified set of opcodes for beginners.

use crate::input_buffering::{check_key, restore_input_buffering, setup};
use std::convert::TryFrom;
use std::fs::File;
use std::io::{self, Read, Write};
use libc::putchar;

pub mod input_buffering {
    use libc::*;
    use nix::sys::signal::{self, SaFlags, SigAction, SigHandler, SigSet, Signal};
    use std::mem::MaybeUninit;
    use std::os::unix::io::RawFd;
    use std::sync::Once;

    static mut ORIGINAL_TIO: MaybeUninit<termios> = MaybeUninit::uninit();
    static INIT: Once = Once::new();

    extern "C" fn handle_interrupt(_sig: c_int) {
        println!("\nSIGINT received. Restoring terminal settings...");
        restore_input_buffering();
        std::process::exit(0);
    }

    pub fn setup() {
        unsafe {
            let sig_action = SigAction::new(
                SigHandler::Handler(handle_interrupt),
                SaFlags::empty(),
                SigSet::empty(),
            );
            signal::sigaction(Signal::SIGINT, &sig_action).expect("Failed to register SIGINT handler");
        }
        disable_input_buffering();
    }

    pub fn disable_input_buffering() {
        unsafe {
            let fd: RawFd = STDIN_FILENO;
            INIT.call_once(|| {
                tcgetattr(fd, ORIGINAL_TIO.as_mut_ptr());
            });
            let mut new_tio = ORIGINAL_TIO.assume_init();
            new_tio.c_lflag &= !(ICANON | ECHO);
            tcsetattr(fd, TCSANOW, &new_tio);
        }
    }

    pub fn restore_input_buffering() {
        unsafe {
            let fd: RawFd = STDIN_FILENO;
            tcsetattr(fd, TCSANOW, &ORIGINAL_TIO.assume_init());
        }
    }

    pub fn check_key() -> bool {
        unsafe {
            let fd: RawFd = STDIN_FILENO;
            let mut readfds = std::mem::zeroed::<fd_set>();
            FD_ZERO(&mut readfds);
            FD_SET(fd, &mut readfds);
            let mut timeout = timeval { tv_sec: 0, tv_usec: 0 };
            select(
                fd + 1,
                &mut readfds,
                std::ptr::null_mut(),
                std::ptr::null_mut(),
                &mut timeout,
            ) > 0
        }
    }
}

#[derive(Debug, Clone, Copy)]
#[repr(u16)]
pub enum Registers {
    R0 = 0,
    R1,
    R2,
    R3,
    R4,
    R5,
    R6,
    R7,
    PC,
    COND,
    COUNT = 10,
}

const MR_KBSR: u16 = 0xFE00; // Keyboard Status Register
const MR_KBDR: u16 = 0xFE02; // Keyboard Data Register
const MEMORY_SIZE: usize = 1 << 16;

#[derive(Debug, Clone, Copy)]
#[repr(u16)]
enum RCond {
    FL_POS = 1 << 0,
    FL_ZRO = 1 << 1,
    FL_NEG = 1 << 2,
}

#[derive(Debug, Clone, Copy)]
#[repr(u16)]
enum Opcode {
    ADD = 1,
    AND = 5,
    NOT = 9,
    TRAP = 15,
}

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u16)]
enum TrapCode {
    GETC = 0x20,
    OUT = 0x21,
    PUTS = 0x22,
    HALT = 0x25,
}

impl TryFrom<u16> for Opcode {
    type Error = String;

    fn try_from(value: u16) -> Result<Self, Self::Error> {
        match value {
            1 => Ok(Opcode::ADD),
            5 => Ok(Opcode::AND),
            9 => Ok(Opcode::NOT),
            15 => Ok(Opcode::TRAP),
            _ => Err(format!("Invalid opcode: {}", value)),
        }
    }
}

impl TrapCode {
    pub fn from_u16(val: u16) -> Option<Self> {
        match val {
            0x20 => Some(TrapCode::GETC),
            0x21 => Some(TrapCode::OUT),
            0x22 => Some(TrapCode::PUTS),
            0x25 => Some(TrapCode::HALT),
            _ => None,
        }
    }
}

pub fn sign_extend(value: u16, bit_count: u16) -> u16 {
    if (value >> (bit_count - 1)) & 1 == 1 {
        value | (0xFFFF << bit_count)
    } else {
        value
    }
}

#[derive(Debug, Clone)]
pub struct VM {
    pub memory: [u16; MEMORY_SIZE],
    pub registers_storage: [u16; Registers::COUNT as usize],
    running: bool,
}

impl VM {
    pub fn new() -> Self {
        Self {
            memory: [0; MEMORY_SIZE],
            registers_storage: [0; Registers::COUNT as usize],
            running: true,
        }
    }

    pub fn read_image_file(&mut self, mut file: File) -> io::Result<()> {
        let mut origin_buf = [0u8; 2];
        file.read_exact(&mut origin_buf)?;
        let origin = u16::from_be_bytes(origin_buf) as usize;
        let max_read = MEMORY_SIZE - origin;
        let mut buffer = vec![0u8; max_read * 2];
        let read_bytes = file.read(&mut buffer)?;
        for i in 0..(read_bytes / 2) {
            let byte_pair = [buffer[i * 2], buffer[i * 2 + 1]];
            self.memory[origin + i] = u16::from_be_bytes(byte_pair);
        }
        Ok(())
    }

    pub fn run(&mut self) -> io::Result<()> {
        setup();
        self.registers_storage[Registers::COND as usize] = RCond::FL_ZRO as u16;
        let pc_start: u16 = 0x3000;
        self.registers_storage[Registers::PC as usize] = pc_start;

        while self.running {
            let instr = self.memory_read(self.registers_storage[Registers::PC as usize]);
            self.registers_storage[Registers::PC as usize] += 1;
            let opcode = instr >> 12;

            match Opcode::try_from(opcode) {
                Ok(Opcode::ADD) => self.add(instr),
                Ok(Opcode::AND) => self.and(instr),
                Ok(Opcode::NOT) => self.not(instr),
                Ok(Opcode::TRAP) => {
                    let trap_vector = instr & 0xFF;
                    if let Some(trap) = TrapCode::from_u16(trap_vector) {
                        match trap {
                            TrapCode::GETC => self.trap_getc(None),
                            TrapCode::OUT => self.trap_out(),
                            TrapCode::PUTS => self.trap_puts(),
                            TrapCode::HALT => {
                                println!("Halting the program...");
                                io::stdout().flush()?;
                                self.running = false;
                            }
                        }
                    } else {
                        eprintln!("Invalid TRAP vector: {}", trap_vector);
                    }
                }
                Err(e) => eprintln!("{}", e),
            }
        }
        restore_input_buffering();
        Ok(())
    }

    pub fn memory_read(&mut self, address: u16) -> u16 {
        if address == MR_KBSR {
            if input_buffering::check_key() {
                self.memory[MR_KBSR as usize] = 1 << 15;
                self.memory[MR_KBDR as usize] = self.get_char();
            } else {
                self.memory[MR_KBSR as usize] = 0;
            }
        }
        self.memory[address as usize]
    }

    fn get_char(&self) -> u16 {
        let mut buffer = [0; 1];
        io::stdin().read_exact(&mut buffer).unwrap();
        buffer[0] as u16
    }

    fn add(&mut self, instruction: u16) {
        let r0 = (instruction >> 9) & 0x7;
        let r1 = (instruction >> 6) & 0x7;
        if (instruction >> 5) & 0x1 == 1 {
            let imm5 = instruction & 0x1F;
            let imm5 = sign_extend(imm5, 5);
            self.registers_storage[r0 as usize] = self.registers_storage[r1 as usize] + imm5;
        } else {
            let r2 = instruction & 0x7;
            self.registers_storage[r0 as usize] =
                self.registers_storage[r1 as usize] + self.registers_storage[r2 as usize];
        }
        self.update_flags(r0);
    }

    fn and(&mut self, instruction: u16) {
        let dr = (instruction >> 9) & 0x7;
        let r1 = (instruction >> 6) & 0x7;
        if (instruction >> 5) & 0x1 == 1 {
            let imm5 = instruction & 0x1F;
            let imm5 = sign_extend(imm5, 5);
            self.registers_storage[dr as usize] = self.registers_storage[r1 as usize] & imm5;
        } else {
            let r2 = instruction & 0x7;
            self.registers_storage[dr as usize] =
                self.registers_storage[r1 as usize] & self.registers_storage[r2 as usize];
        }
        self.update_flags(dr);
    }

    fn not(&mut self, instruction: u16) {
        let dr = (instruction >> 9) & 0x7;
        let r1 = (instruction >> 6) & 0x7;
        self.registers_storage[dr as usize] = !self.registers_storage[r1 as usize];
        self.update_flags(dr);
    }

    fn trap_getc(&mut self, input: Option<u16>) {
        if let Some(value) = input {
            self.registers_storage[Registers::R0 as usize] = value;
        } else {
            self.registers_storage[Registers::R0 as usize] = self.get_char();
        }
        self.update_flags(Registers::R0 as u16);
    }

    fn trap_out(&mut self) {
        let c = self.registers_storage[Registers::R0 as usize] as u8 as char;
        unsafe {
            putchar(c as i32);
        }
        io::stdout().flush().unwrap();
    }

    fn trap_puts(&mut self) {
        let mut address = self.registers_storage[Registers::R0 as usize];
        loop {
            let c = self.memory_read(address);
            if c == 0 {
                break;
            }
            print!("{}", c as u8 as char);
            address += 1;
        }
        io::stdout().flush().unwrap();
    }

    fn update_flags(&mut self, r: u16) {
        let value = self.registers_storage[r as usize];
        self.registers_storage[Registers::COND as usize] = if value == 0 {
            RCond::FL_ZRO as u16
        } else if (value >> 15) == 1 {
            RCond::FL_NEG as u16
        } else {
            RCond::FL_POS as u16
        };
    }
}

extern "C" {
    fn putchar(c: i32);
}

fn main() -> io::Result<()> {
    let mut vm = VM::new();
    let args: Vec<String> = std::env::args().collect();
    if args.len() < 2 {
        eprintln!("Usage: {} <image-file>", args[0]);
        std::process::exit(2);
    }
    vm.read_image_file(File::open(&args[1])?)?;
    vm.run()
}

Running the VM

  1. Create a new Rust project:

     cargo new vm_tutorial
     cd vm_tutorial
    
  2. Add dependencies to Cargo.toml:

     [dependencies]
     libc = "0.2"
     nix = "0.27"
    
  3. Create src/main.rs with the code above and src/input_buffering.rs with the input_buffering module.

  4. Create a sample LC-3 program file (program.obj) with:

    • Origin: 0x3000 (as [0x30, 0x00] in bytes).

    • Instructions: [0x12, 0x45, 0xF0, 0x21, 0xF0, 0x20, 0xF0, 0x21, 0xF0, 0x22, 0xF0, 0x25] (for ADD R0, R1, #5; OUT; GETC; OUT; PUTS; HALT).

    • String data: [0x48, 0x69, 0x00, 0x00] (for “Hi\0” at 0x3006).

  5. Run the program:

     cargo run -- program.obj
    
  6. Expected output: Prints 8 (ASCII 56, from R0=8 after ADD R0, R1, #5 with R1=3), reads a character, prints it, prints “Hi” (from PUTS with R0=0x3006), then halts.

Note: Requires libc-dev and libncurses-dev (e.g., sudo apt-get install libc-dev libncurses-dev on Linux) for input_buffering. If issues arise, test with a hardcoded program by replacing read_image_file with a Vec<u16>.

Note: The extern example uses putchar, requiring libc. On some systems, you may need to install libc-dev (Linux) or equivalent. For simplicity, you can comment out the putchar call in trap_out and use print! instead if linking issues arise.

How It Works

  • Bytecode: Loaded from a file into memory at 0x3000, as in read_image_file. Each 16-bit instruction has a 4-bit opcode (e.g., 1=ADD, 15=TRAP).

  • Execution Loop: The run method fetches instructions, decodes opcodes, and executes actions, using setup and restore_input_buffering for terminal settings.

  • Registers: R0–R7 for general use, PC at 0x3000, COND for flags (Negative, Zero, Positive).

  • Input Buffering: check_key and MR_KBSR/MR_KBDR enable non-blocking input for TRAP_GETC, as in your codebase.

  • Traps: TRAP_GETC reads a character, TRAP_OUT uses putchar, TRAP_PUTS prints a string, and TRAP_HALT stops execution.

  • Error Handling: Checks for invalid opcodes and file errors, exiting with appropriate messages.

Testing the VM

Try these programs in program.obj:

  • ADD and OUT: [0x3000, 0x1245, 0xF021, 0xF025] (with R1=3, sets R0=8, prints 8, halts).

  • GETC and PUTS: [0x3000, 0xF020, 0xF022, 0xF025, 0x0048, 0x0069, 0x0000] (reads a character, prints string “Hi” at 0x3004, halts).

Add unit tests in src/main.rs:

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_add() {
        let mut vm = VM::new();
        vm.registers_storage[Registers::R1 as usize] = 3;
        vm.memory[0x3000] = 0x1245; // ADD R0, R1, #5
        vm.add(0x1245);
        assert_eq!(vm.registers_storage[Registers::R0 as usize], 8);
        assert_eq!(vm.registers_storage[Registers::COND as usize], RCond::FL_POS as u16);
    }

    #[test]
    fn test_getc() {
        let mut vm = VM::new();
        vm.trap_getc(Some(65)); // Simulate 'A'
        assert_eq!(vm.registers_storage[Registers::R0 as usize], 65);
    }
}

Run tests with:

cargo test

Exploring the extern Keyword

The extern keyword fascinated me because it lets Rust interface with C or other languages. In our VM, we used extern "C" { fn putchar(c: i32); } to implement TRAP_OUT, printing a character from R0 to the console. This is powerful for:

  • Low-Level I/O: Accessing system calls or hardware interfaces.

  • Interoperability: Integrating with existing C libraries for advanced VM features (e.g., file I/O, networking).

  • Performance: Using optimized C code for specific tasks.

Be cautious with extern, as it requires unsafe blocks due to Rust’s strict safety guarantees. In a full VM, you might use it for keyboard input or other I/O operations.

Next Steps

Enhance your VM with:

  • Memory Operations: Add opcodes like LD (load), ST (store), LDI (load indirect), and STI (store indirect) for memory access.

  • Conditionals: Introduce BR (branch) for control flow based on condition flags.

  • Subroutines: Add JSR (jump to subroutine) for function calls.

  • I/O: Implement TRAP_PUTS and TRAP_PUTSP for string output, using your trap_puts and trap_putsp logic.

  • Disassembler: Write a function to print bytecode as human-readable text (e.g., ADD R0, R1, #5).

Explore real-world VMs like the JVM (stack-based) or Lua VM for inspiration, or check out blockchain-based execution models like the Aqua Verifier CLI for advanced ideas.

Conclusion

Building a virtual machine in Rust was a thrilling journey into low-level systems programming. It helped me visualize how high-level code becomes binary, how values move through memory and registers, and how Rust’s safety features shine in such projects. The extern keyword opened my eyes to Rust’s interoperability with C, paving the way for more complex systems. I hope this tutorial sparks your curiosity to build, tweak, and explore VMs further!

For the full project, including additional opcodes like BR, LD, and JSR, check out my GitHub repository. Dive deeper with the Rust Book or Crafting Interpreters for more insights into systems programming and VM design. Share your VM enhancements on GitHub or let me know how it goes!

1
Subscribe to my newsletter

Read articles from Biliqis Onikoyi directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Biliqis Onikoyi
Biliqis Onikoyi

Web3 || FrontEnd Dev