The Central Processing Unit (CPU) of the Nintendo Game Boy, also known as the Sharp LR35902 and based on the Zilog Z80 processor, is the core component responsible for the actual execution of a Game Boy game. It is the engine of the Game Boy so to speak. Without it, the Game Boy wouldn’t be functional. This is unlike the graphics processor (GPU) or audio processor (APU), which are peripheral components and do not have to exist for the Game Boy to be functional. Therefore, when building a Game Boy Emulator, creating the CPU is the logical first step.

This article will attempt to go over the basics of writing a Game Boy Emulator CPU in Rust (although you can take this advice and apply it to any programming language, not just Rust). In my first article on this blog, I gave some tips on how to go about writing the CPU, but in this article I am going to delve into a lot more detail.

I will first explain the concepts behind opcodes, cover the registers that are used in the CPU, and describe how the fetch-decode-execute cycle works.

What's An Opcode?

An opcode represents a single operation that is executed by the CPU, also known as an instruction, or machine code.

At the lowest level, a processor interprets a series of opcodes in memory and executes them. A processor is only able to understand a limited set of opcodes, and each opcode represents a very simple operation such as addition, subtraction, or loading a value into a register. No matter the complexity of the software you may use today, it is all executed as machine code by your computer.

A computer can only understand binary, which is basically 1s and 0s, however a human would have a very hard time building software with just binary. Why? Because binary is not human readable. Back in the day when game developers worked on building your favorite Game Boy games, they worked exclusively with something called assembly language, which is a human readable version of these low level CPU instructions.

This collection of instructions is encoded into a ROM (read-only memory) chip, which is built into the cartridges that you would insert into the back of your Game Boy before you play the game. So when we build a Game Boy emulator, we need to handle loading these ROMs into memory, and then interpreting the CPU instructions encoded in these ROMs to be able to play games.

The CPU contains a total of 499 opcodes. To properly emulate Game Boy games, you need to emulate the behavior of each opcode. Sound intimidating? Actually it’s not as bad as you might think.

Opcode Groups

Most opcodes in the Game Boy CPU can be thought of as belonging to a group of opcodes that have similar behavior (aside from maybe a select few). For example, a huge chunk of the CB opcodes in the CPU are dedicated to very simple bit operations such as setting a bit, resetting a bit, or testing whether a bit is set or unset, where the byte under test and the bit number depend on the opcode, like how opcode 0xCB 0xF2 sets bit 6 of register D, or opcode 0xCB 0x8B resets bit 1 of register E.

In other words, if you can implement one opcode as a reusable function, you can re-use that function for other opcodes to make life easier. As alluded to above, there are not only opcodes for simple bit operations, but also opcodes for arithmetic operations like adding/subtracting, opcodes for loading data into registers, opcodes for managing an internal stack, and more.

This reference shows a complete list of all opcodes grouped by functionality.

CPU State

Registers are memory that is built into the CPU and used for storing and manipulating data. For the Game Boy, each register stores just one byte of data. They are called A, B, C, D, E, H, L, and F. For two byte data manipulation, the CPU has opcodes that allow working with register pairs, which are just two registers used side-by-side where each register represents one byte in a two-byte operand. For example, opcodes that do sixteen bit (two byte) addition could require adding two register pairs and then storing the sum in a register pair, like how opcode 0x09 requires adding register pair BC to register pair HL, and storing the sum in register pair HL.

There are also utility registers used by the CPU. The program counter is a register that consistently points to the next opcode in memory. When the CPU is finished executing the current opcode, it uses the address stored in the program counter to read the next opcode.

The CPU also manages an internal stack, using a stack pointer register. The stack pointer stores an address which points to a location in memory, similar to the program counter. This location in memory always points to the top of the stack. With the stack, game developers can take advantage of LIFO operations. It is most commonly used for situations when we need to jump to another location in memory. In these cases, we push the current value of the program counter to the stack, then update the program counter to point to a new address. Then eventually we can do a stack return by popping the original address from the stack then re-pointing to it.

At its most basic level, CPU state could be described with the following struct in Rust:

pub struct CpuState {
    a: u8,
    b: u8,
    c: u8,
    d: u8,
    e: u8,
    h: u8,
    l: u8,
    f: u8,
    program_counter: u16,
    stack_pointer: u16
}

Fetch-Decode-Execute Loop

As mentioned before, the CPU from boot-up until shutdown is constantly fetching the next opcode from memory, decoding the opcode, and executing it. This is done in a infinite loop.

With Rust, the actual decoding and executing of the opcode can be done with a match statement. Here's a simplified example taken from my own emulator:

pub fn step(&mut emulator: Emulator) {
    // Fetch the next instruction in memory.
    let opcode = read_next_instruction_byte(emulator);

    match opcode {
        0x00 =>
            // Opcode 0x00 is a NOOP.
            (),
        0x01 => { 
            // Fetch the next two immediate bytes.
            let word = read_next_instruction_word(emulator);
            // Store the word into register pair BC.
            microops::store_in_register_pair(&mut emulator.cpu, REGISTER_BC, word);
        },
        0x02 => {
            // Read address from register pair BC.
            let address = microops::read_from_register_pair(&mut emulator.cpu, &REGISTER_BC);
            // Load the value of register A into address.
            loads::load_source_register_in_memory(emulator, Register::A, address);
        },
        0x03 =>
            alu::increment_register_pair(&mut emulator.cpu, REGISTER_BC),
        0x04 =>
            alu::increment_register(&mut emulator.cpu, Register::B),

        // ... etc.
        // An implementation is needed for each
        // opcode up to 0xFF (minus illegal opcodes).
    }
}

Given the amount of opcodes you need to handle, this will become a very long match statement. I've seen other emulators workaround this by using a hash map which maps opcodes to functions that will execute the opcode, but I figured for my purposes it was fine to just use a match statement as I felt it was simpler. There isn't performance implications of either approach, and it really comes down to which approach you find more readable.

This match statement will handle opcodes up to 0xFF, but you will also need to handle 0xCB opcodes. Since each opcode is just a byte, only a max of 255 opcodes would be supported. 0xCB opcodes allow us to extend this limit. 0xCB opcodes are two byte opcodes that always begin with 0xCB, and then an additional byte that can map to 255 additional opcodes.

When the opcode 0xCB is decoded in the CPU, the CPU should know to fetch the next byte in memory which will represent the 0xCB opcode that needs to be executed.

In the 0xCB opcode handler, this can be expressed by simply calling another function that handles 0xCB opcodes:

// ... other match statement handlers
0xCB =>
    execute_cb_opcode(emulator),
// ... other match statement handlers

Then, your execute_cb_opcode function can do something very similar to the step function showed above and handle the specific 0xCB opcode:

fn execute_cb_opcode(emulator: &mut Emulator) {
    let opcode = read_next_instruction_byte(emulator);
    match opcode {
        0x00 =>
            bitops::rotate_register_left(&mut emulator.cpu, Register::B),
        0x01 =>
            bitops::rotate_register_left(&mut emulator.cpu, Register::C),
        0x02 =>
            bitops::rotate_register_left(&mut emulator.cpu, Register::D),
        0x03 =>
            bitops::rotate_register_left(&mut emulator.cpu, Register::E),
        0x04 =>
            bitops::rotate_register_left(&mut emulator.cpu, Register::H),

        // ... etc.
    }
}

The final missing piece that hasn't been showed yet (but has been summarized in my previous article) is the main loop that repeatedly runs the fetch-decode-execute cycle.

In general, this can be expressed as a simple infinite loop that repeatedly runs the CPU's fetch-decode-execute cycle (as well as run the GPU and APU incrementally to keep them in sync with the CPU).

My emulator is a little more complex as it is run directly in the browser, and the core Rust codebase is compiled down to WebAssembly in order to accomplish this. I have a function called step_frame which loops until the next frame needs to be rendered on the screen:

#[wasm_bindgen(js_name = stepFrame)]
pub fn step_frame() {
    EMULATOR.with(|emulator_cell| {
        let mut emulator = emulator_cell.borrow_mut();

        let mut frame_rendered = false;

        while !frame_rendered {
            emulator::step(&mut emulator, |buffer: &Vec<u8>| {
                render(buffer.as_slice());
                frame_rendered = true;
            });
        }
    })
}

As explained in my previous article, the emulator's step function simply calls the step function for my CPU (showed above) and GPU (and eventually my APU once I've finished implementing audio support):

pub fn step(emulator: &mut Emulator, render: impl FnMut(&Vec<u8>)) {
    cpu::opcodes::step(emulator);
    gpu::step(emulator, render);
}

The GPU accepts a closure which will be called with the current frame buffer when the GPU determines its time to render the frame. Since my code is compiled down to WebAssembly, I let the Javascript portion of my code determine when it's appropriate to call step_frame again.

Conclusion

To summarize, a CPU primarily works with opcodes which are very simple operations run by the CPU. The purpose of the CPU is to run a fetch-decode-execute loop. The CPU fetches the next opcode from memory, decodes it to understand what exactly it needs to do, and finally executes it which mutates the state of the CPU or RAM. This can be emulated with a loop that repeatedly reads the next instructions from memory and executes them. These instructions can be decoded with a match statement that maps opcodes to the correct functionality.

How to Build the CPU for a Game Boy Emulator