Build a Compiler from Scratch, Part 1.1: A Hello World of sorts

Geoffrey CopinGeoffrey Copin
22 min read

It has become common practice to start with a "Hello World" program when learning a new programming language. This is a simple program that outputs the text "Hello, World!" to the screen. While writing such a program is a trivial task with most programming languages, getting our own language to interface with the OS and write data to the standard output will take a fair amount of work.

So we'll start with something simpler. How much simpler? Well, here is the first Pylite program that our compiler will translate to machine code:

def main():
  return 1

It just sets a return code and exits. This is the simplest program that we can write while having an easily observable effect. It will be our "Hello World" program for now.

Throughout this chapter, we will build the basic building blocks of our compiler, including the parser, and translation passes to the intermediate representation and machine code. Each of these blocks will only support the minimum features needed to compile our "Hello World" program, and will be expanded in later chapters to support more complex constructs.

Setting up the project

We will start by creating a new cargo workspace to host our compiler.

$ mkdir pylite
$ cd pylite

For now, this workspace will contain a single compiler crate:

# Cargo.toml
[workspace]
resolver = "2"
members = ["compiler"]

We can now create the compiler crate:

$ cargo new compiler --bin
$ cargo run 2>/dev/null
Hello, world!

Our top-level pylite folder should have the following structure:

.
├── Cargo.toml
└── compiler
    ├── Cargo.toml
    └── src
        └── main.rs

Parsing Pylite

The textual representation of a program does not lend itself well to the kind of analysis and transformation that we need to perform. In order to resolve types, associate identifiers with their definitions and generate our low-level intermediate representation, we need to convert the unstructured sequence of characters that make up the source code into a tree-like structure that represents the program's syntax. This process if called parsing, and the data structure it produces is called an Abstract Syntax Tree (AST). The parsing process is typically split into two parts: lexical analysis (or tokenization) and syntactic analysis (or parsing).

Source code
===========

def main():
    return 1


Token sequence
==============

┌───┐ ┌────┐ ┌─┐ ┌─┐ ┌─┐ ┌──────┐ ┌──────┐ ┌─┐ ┌──────┐
│def│ │main│ │(│ │)│ │:│ │INDENT│ │return│ │1│ │DEDENT│
└───┘ └────┘ └─┘ └─┘ └─┘ └──────┘ └──────┘ └─┘ └──────┘


Abstract Syntax Tree
====================

┌─────────────┐
│ FunctionDef │
│   (main)    │
└─────────────┘
        │
        └── ┌─────────────┐
            │ Return Stmt │
            └─────────────┘
                    │
                    └── ┌─────────┐
                        │ Literal │
                        │    1    │
                        └─────────┘

During lexical analysis, we group individual characters into tokens, which are the smallest meaningful units of the language. For example, the characters d, e and f will be grouped into a single token representing the def keyword. Irrelevant characters, such as non-significant whitespace, are discarded during this process. Since Pylite - like Python - is indentation-sensitive, we need to keep track of the current indentation level. This will be done by inserting INDENT and DEDENT tokens into the token sequence whenever the indentation level changes.

The syntax analysis phase will then take this sequence of tokens and match it against our language's syntax rules (or grammar) to build a tree-like structure called an Abstract Syntax Tree (AST).

Writing the lexer

The first step in writing our lexer is to choose a representation for our tokens. Rust's enums are a great fit for this purpose. We'll also pair each token with a Span struct that will hold the token's start and end byte position in the source code. Keeping track of the token's position will prove useful to display meaningful error messages to the user.

//! compiler/src/ast.rs

#[derive(Debug, Copy, Clone, Eq, PartialEq, Hash)]
pub struct Span {
    pub start: u32,
    pub end: u32,
}

impl Span {
    // Create a new span with the given start and end position.
    pub fn new(start: u32, end: u32) -> Self {
        Self { start, end }
    }

    // Create a new span that covers both the current span and the given span.
    pub fn merge(self, other: Span) -> Self {
        Self {
            start: self.start.min(other.start),
            end: self.end.max(other.end),
        }
    }
}
//! compiler/src/parse/token.rs

#[derive(Debug, Clone, Eq, PartialEq, Hash)]
pub enum Token {
    Def,
    Return,
    LPar,
    RPar,
    Colon,
    Identifier(String),
    Int(i64),
    Unknown(char),
}

The Unknown variant will be used to represent any unexpected character that we encounter during the tokenization process.

With these definitions in place, we can create a scaffolding for our lexer. We'll model it as an iterator yielding tuples of (Token, Span).

//! compiler/src/parse/lexer.rs

use std::{iter::Peekable, str::Chars};

use crate::ast::Span;

use super::token::Token;

pub struct Lexer<'c> {
    // iterator over the source code's characters
    input: Peekable<Chars<'c>>,
    // current byte position in the source code
    position: u32,
}

impl<'c> Lexer<'c> {
    pub fn new(input: &'c str) -> Self {
        Self {
            input: input.chars().peekable(),
            position: 0,
        }
    }

    fn next_token(&mut self) -> Option<(Token, Span)> {
        unimplemented!() 
    }
}

impl Iterator for Lexer<'_> {
    type Item = (Token, Span);

    fn next(&mut self) -> Option<Self::Item> {
        self.next_token()
    }
}

We'll also add the following helper methods to simplify the implementation of next_token:

  • emit_token takes a start position and a Token and returns a tuple of the form (Token, Span). where the Span starts at start and ends at the current position.
  • next_char consumes the next character in the input (if any), and updates the current position.
  • next_char_if conditionally consumes the next character and updates the current position.

Since we track the byte position of the tokens and utf-8 characters can span multiple bytes, we can't simply increment the position for each character. Instead, we'll use the char::len_utf8 method to get the byte size of the current character and update the position accordingly.

//! compiler/src/parse/lexer.rs

impl<'c> Lexer<'c> {
    // [...] 

    fn emit_token(&self, start: u32, token: Token) -> Option<(Token, Span)> {
        Some((
            token,
            Span {
                start,
                end: self.position,
            },
        ))
    }

    fn next_char(&mut self) -> Option<char> {
        self.next_char_if(|_| true)
    }

    fn next_char_if(&mut self, f: impl FnOnce(char) -> bool) -> Option<char> {
        self.input.next_if(|&c| f(c)).inspect(|c| {
            self.position += c.len_utf8() as u32;
        })
    }  
}

Some tokens contain a single character, such as ( or :. We can handle these tokens by matching comparing the current character to the expected one and returning the corresponding token if they match.

//! compiler/src/parse/lexer.rs

impl<'c> Lexer<'c> {
    fn next_token(&mut self) -> Option<(Token, Span)> {
        let start_pos = self.position;

        match self.next_char()? {
            '(' => self.emit_token(start_pos, Token::LPar),
            ')' => self.emit_token(start_pos, Token::RPar),
            ':' => self.emit_token(start_pos, Token::Colon),
            c => unimplemented!(),
        }
    }
}

The remaining cases are more involved, but follow a similar pattern:

  • if the current character is a digit, we consume the following characters until we reach a non-digit character. We then parse the resulting string as an integer and create an integer token.
  • if the current character is a letter, we consume the following characters until we reach a non-alphanumeric character different from _. We then check if the resulting string matches a keyword and create a keyword token if it does, or an identifier token otherwise.
  • if the current character is a whitespace, we skip every following whitespace and return the following token.

Finally, if the current character does not match any of the above cases, we create an Unknown token.

//! compiler/src/parse/lexer.rs

impl<'c> Lexer<'c> {
    // [...]
    fn next_token(&mut self) -> Option<(Token, Span)> {
        // [...]
        match self.next_char()? {
            // [...]
            c if c.is_numeric() => {
                let mut int_repr = String::from(c);
                while let Some(c) = self.next_char_if(|c| c.is_numeric()) {
                    int_repr.push(c);
                }
                let value = int_repr.parse::<i64>().expect("int token");
                self.emit_token(start_pos, Token::Int(value))
            }
            c if c.is_alphabetic() => {
                let mut identifier = String::from(c);
                while let Some(c) = self.next_char_if(|c| c.is_alphanumeric() || c == '_') {
                    identifier.push(c);
                }
                match identifier.as_str() {
                    "def" => self.emit_token(start_pos, Token::Def),
                    "return" => self.emit_token(start_pos, Token::Return),
                    _ => self.emit_token(start_pos, Token::Identifier(identifier)),
                }
            }
            c if c.is_whitespace() => {
                while self.next_char_if(|c| c.is_whitespace()).is_some() {}
                self.next_token()
            }
            c => self.emit_token(start_pos, Token::Unknown(c)),
        }
    }
}

With our lexer almost complete, it's time to write our first test! As building and verifying the token spans by hand would be tedious, we'll write a helper test_lexer function that takes as input the source code and a vec of (String, Token) tuples representing the expected tokens, with the String being the token's textual representation. If tokenizing the input and rendering the Spans does not yield the same vec, the test will fail.

//! compiler/src/parse/lexer.rs

#[cfg(test)]
mod test {
    use super::*;

    #[test]
    fn tokenize_return_const() {
        let input = r##"
def main():
    return 34
"##;

        let expected = vec![
            ("def", Token::Def),
            ("main", Token::Identifier("main".to_string())),
            ("(", Token::LPar),
            (")", Token::RPar),
            (":", Token::Colon),
            ("return", Token::Return),
            ("34", Token::Int(34)),
        ];

        test_lexer(input, expected);
    }

    fn test_lexer(input: &str, expected: Vec<(&str, Token)>) {
        let rendered: Vec<(&str, Token)> = Lexer::new(input)
            .map(|(token, span)| {
                let rendered = &input[span.start as usize..span.end as usize];
                (rendered, token)
            })
            .collect();

        assert_eq!(expected, rendered);
    }
}

Let's make sure that our test passes before moving on to the next section:

$ cargo test
running 1 test
test parse::lexer::tests::tokenize_return_const ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

Handling indentation

One defining aspect of Pylite's syntax is that it is indentation-sensitive. This means that blocks are delimited by their indentation level, rather than by explicit braces or keywords. To simplify the parsing process, we'll insert Indent and Dedent tokens into the token stream whenever we detect a change in the current indentation level.

How to detect to such changes? First, we need to define which character - or sequence of characters - represents a single level of indentation. In Pylite a single tab, or four consecutive spaces, will represent a single level of indentation. Every time we reach a new line, we'll count the number of tabs and spaces at the beginning of the line and compare the new indentation level to the previous one. If the indentation levels match, there is no extra token to insert. If they don't match we'll compute delta = abs(new_indentation - previous_indentation) and insert delta Indent or Dedent tokens accordingly. In case of inconsistent indentation, for example, if a new line starts with three spaces, we'll insert an InconsistentIndentation token.

Let's start by adding the new variants to our Token enum:

//! compiler/src/parse/token.rs

#[derive(Debug, Clone, Eq, PartialEq, Hash)]
pub enum Token {
    Def,
    Return,
    LPar,
    RPar,
    Colon,
    Identifier(String),
    Int(i64),
    Unknown(char),
+   Indent,
+   Dedent,
+   InconsistentIndentation,
}

// [...]

In some cases, the difference in indentation level will be higher than one. For example, we'll have to insert two Dedent tokens at the end of the following function definition. One to close the if block, and the other to close the function definition.

def main():
    if True:
        return 1

This doesn't fit well with the current design of our lexer, as we return tokens eagerly as soon as they are recognized. We'll need to refactor our lexer to support emitting multiple tokens at once. For this reason, we'll split the work done by our iter method into two separate steps: first push the recognized tokens into a queue, and then return the first token in the queue, if any.

//! compiler/src/parse/lexer.rs

-use std::{iter::Peekable, str::Chars};
+use std::{collections::VecDeque, iter::Peekable, str::Chars};

pub struct Lexer<'c> {
    input: Peekable<Chars<'c>>,
    // queue of tokens to emit, we use a VecDeque to efficiently pop tokens from the front
    // without having to shift the remaining elements
+   token_queue: VecDeque<(Token, Span)>,
    position: u32,
+   indentation: u32, 
}

impl<'c> Lexer<'c> {
    pub fn new(input: &'c str) -> Self {
        Self {
            input: input.chars().peekable(),
+           token_queue: VecDeque::default(),
            position: 0,
+           indentation: 0, 
        }
    }

-   fn next_token(&mut self) -> Option<(Token, Span)> {
+   fn emit_next_tokens(&mut self) {
        // handle indents at the beginning of a file
+       if self.position == 0 {
+           self.emit_indentation_tokens();
+       }
+
        let start_pos = self.position;

+       let Some(first_char) = self.next_char() else {
+           return;
+       };

-       match self.next_char()? {
+       match first_char {
            // [...]
+           '\n' => {
+               self.emit_indentation_tokens(); 
+               self.emit_next_tokens();
            }
            // [...]
        }
    }

+   fn emit_indentation_tokens(&mut self) {
+       let mut space_count = 0;
+       let mut indentation = 0;

+       while let Some(c) = self.next_char_if(char::is_whitespace) {
+           match c {
                // a tab increases the indentation level by 1
+               '\t' => {
+                   if space_count % 4 != 0 {
+                       space_count = 4;
+                       self.emit_token(self.position, Token::InconsistentIndentation);
+                   }
+                   indentation += 1;
+               }
                // indentation is resetted on every new line
+               '\n' => {
+                   space_count = 0;
+                   indentation = 0;
+               }
+               _ => {
+                   space_count += 1;
                    // four spaces increase the indentation level
+                   if space_count % 4 == 0 {
+                       indentation += 1;
+                   }
+               }
+           }
+       }

+       if space_count % 4 != 0 {
+           self.emit_token(self.position, Token::InconsistentIndentation)
+       }

+       if indentation == self.indentation {
+           return;
+       }

        // emit indent/dedent tokens if the new indentation level is
        // different from the previous one
+       for _ in 0..self.indentation.abs_diff(indentation) {
+           let token = if indentation > self.indentation {
+               Token::Indent
+           } else {
+               Token::Dedent
+           };
+           self.emit_token(self.position, token);
+       }

+       self.indentation = indentation;
+   }

    fn emit_token(&mut self, start: u32, token: Token) {
        self.token_queue.push_back((
            token,
            Span {
                start,
                end: self.position,
            },
        ));
    }

    // [...]
}

impl Iterator for Lexer<'_> {
    type Item = (Token, Span);

    fn next(&mut self) -> Option<Self::Item> {
-       self.next_token();
+       self.emit_next_tokens();
+       self.token_queue.pop_front()
    }
}

Now, our lexer will emit Indent and Dedent tokens whenever the indentation level changes. The only thing that is left to do is to update our test:

//! compiler/src/parse/lexer.rs

[...]

    #[test]
    fn tokenize_return_const() {
        let input = r##"
def main():
    return 34
"##;

        let expected = vec![
            ("def", Token::Def),
            ("main", Token::Identifier("main".to_string())),
            ("(", Token::LPar),
            (")", Token::RPar),
            (":", Token::Colon),
+           ("", Token::Indent),
            ("return", Token::Return),
            ("34", Token::Int(34)),
+           ("", Token::Dedent),
        ];

        test_lexer(input, expected);
    }

    [...]

Building the parser

Introduction to formal grammars

Throughout this series, we'll try to limit the amount of purely theoretical content. But it's challenging to study compilers without at least a cursory understanding of formal grammars. We'll actually make use of the concepts introduced in this section when writing our parser, as the shape of our parsing functions will closely mimic the structure of our grammar rules.

The kind of grammar that we'll be using is called a context-free grammar (CFG). Context-free grammars are a subset of formal grammars that are widely used in computer science, particularly in the field of programming languages. To learn more about formal grammars, you can check out the Wikipedia entry on the Chomsky hierarchy, which classified grammars into multiple nested categories.

But first, what is a language's grammar? In the context of programming languages, a grammar is a set of rules that define the syntax of a language. It specifies how tokens can be combined to form valid statements and expressions. In a way, grammars solve the same problem as regular expressions: given a piece of text, determine whether it is "valid," according to a set of rules.

Why not use regular expressions to define the syntax of our language, then? The answer is simple: regular expressions are not powerful enough to express the syntax of most programming languages. Regular expressions can handle simple patterns like matching keywords or numeric literals, but can't describe nested structures and recursive patterns that are fundamental to programming languages. As a motivating example, consider the challenge of matching balanced parentheses—a structure that appears everywhere in programming languages, from function calls to mathematical expressions. You can try to write a regular expression that matches balanced parentheses: inputs like (), (()) and (()()) should all be matched, while inputs like ((), ()) should not.

This is a classic example of something that cannot be expressed with regular expressions, but can be expressed with a context-free grammar. Without further ado, let's see how to write a grammar for our balanced parentheses example.

BalancedParenthesis = { Nested }-;
Nested = "(", { Nested }, ")";

Rules are defined using the = symbol, and are terminated by a semicolon. Within a rule , denotes a sequence of elements. Rule names can be surrounded to express repetition:

  • [] means "0 or 1"
  • {} means "0 or more"
  • {}- means "1 or more"

Translated into english, this grammar states that a BalancedParenthesis is a sequence of one or more Nested, where a Nested is defined a a (, followed by zero or more Nested and then a ).

How to verify that our grammar is correct? We can use rule substitution to replace the rule names with their definition, until we reach the expected string. For example, let's try to verify that the string (()()) is indeed a valid BalancedParenthesis:

(()()) = BalancedParenthesis
       = { Nested }- // 1 
       = "(", { Nested }, ")" // 2
       = "(", "(", { Nested }, ")", "(", { Nested }, ")", ")"  // 3
       = "(", "(", ")", "(", ")", ")" // 4

We start by replacing BalancedParenthesis with its definition (step 1), then we substitute Nested with its definition (step 2), and replace the inner { Nested } with two successive Nested (step 3). Finally, we replace the innermost Nested (step 4) and reach the expected string.

Reading and writing grammars can prove to be challenging at first, but with a bit of practice, you'll find that they are a powerful tool to explore and describe the syntax of programming languages. Luckily for us, at this point, the grammar for our language is quite simple, as it does not include repetitions or recursion like the grammar for balanced parentheses.

Program = FunDecl;

FunDecl = "def" Identifier "(" ")" ":" Block;

Block = Indent ReturnStatement Dedent;

BlockStatement = ReturnStatement;

ReturnStatement = "return" Expression;

Expression = Int;

You'll notice that some rules don't have a matching definition, such as Identifier, Int, Indent and Dedent. This is because they are terminal rules, which means that they directly represent tokens in the token stream produced by the lexer.

As an exercise, you can try to use successive substitutions starting from the Program rule to verify that the following Pylite program is syntactically valid:

def main():
    return 42

Generating unique identifiers

The data structures built by a compiler naturally tend to contain cycles. For example, a recursive function definition includes the AST nodes representing the function's body, which in turn contain a reference to the function itself (after name-resolution is complete). This is a problem for our compiler, as Rust's borrow checker makes it difficult to create cyclic data structures. To solve this problem, we'll add a level of indirection: each node in the AST - and in general, each entity within the compiler - will be identified by a unique identifier. In the previous example, during name resolution, we'll simply register the association between the function's unique identifier and the unique identifier of the recursive call within the function's body, without introducing any cycle.

Our unique identifiers will be backed by monotonically increasing unsigned integers. To keep track of the last assigned identifier and generate new ones, we'll create a UniqueIdGenerator struct. It will have a pointer to an AtomicU32 representing the next identifier to be assigned. Generating a new identifier is as simple as incrementing the atomic counter and returning its previous value.

//! compiler/src/id.rs

use std::{rc::Rc, sync::atomic::AtomicU32};

#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub struct UniqueId(u32);

impl From<u32> for UniqueId {
    fn from(value: u32) -> Self {
        UniqueId(value)
    }
}

#[derive(Debug, Clone, Default)]
pub struct UniqueIdGenerator {
    next_id: Rc<AtomicU32>,
}

impl UniqueIdGenerator {
    pub fn generate(&self) -> UniqueId {
        UniqueId(
            self.next_id
                .fetch_add(1, std::sync::atomic::Ordering::Relaxed),
        )
    }
}

Parsing the token stream

The next step is to transform the token stream into an Abstract Syntax Tree (AST). We'll start by defining a set of struct and enums to represent the different nodes in our AST.

For some additional type safety, we leverage the nonempty crate. It provides a NonEmpty type that mimics the behavior of Vec, but guarantees that the vector always contains at least one element. We'll use this type to represent function bodies, which always contain at least one statement. Let's install it by running cargo add nonempty in the compiler folder: ```rust // compiler/ast.rs

use std::{path::PathBuf, rc::Rc};

use nonempty::NonEmpty;

use crate::id::UniqueId;

// [...]

// Represents a Pylite module, which is an entire source file.

#[derive(Debug, Clone)] pub struct Module { pub path: Rc, }

impl Module { pub fn new(path: PathBuf) -> Self { Self { path: Rc::new(path), } } }

#[derive(Debug, Clone, Eq, PartialEq)] pub struct Statement { pub id: UniqueId, pub span: Span, pub kind: StatementKind, }

#[derive(Debug, Clone, Eq, PartialEq)] pub enum StatementKind { Decl(Decl), }

#[derive(Debug, Clone, Eq, PartialEq)] pub struct Decl { pub id: UniqueId, pub span: Span, pub kind: DeclKind, }

#[derive(Debug, Clone, Eq, PartialEq)] pub enum DeclKind { Function(FunDecl), }

#[derive(Debug, Clone, Eq, PartialEq)] pub struct FunDecl { pub name: Identifier, pub body: NonEmpty, }

#[derive(Debug, Clone, Eq, PartialEq)] pub struct Identifier { pub id: UniqueId, pub span: Span, pub name: String, }

#[derive(Debug, Clone, Eq, PartialEq)] pub struct BlockStatement { pub id: UniqueId, pub span: Span, pub kind: BlockStatementKind, }

#[derive(Debug, Clone, Eq, PartialEq)] pub enum BlockStatementKind { Return { value: Expr }, }

#[derive(Debug, Clone, Eq, PartialEq)] pub struct Expr { pub span: Span, pub id: UniqueId, pub kind: ExprKind, }

#[derive(Debug, Clone, Eq, PartialEq)] pub enum ExprKind { IntLit { value: i64 }, }

Enum nodes are split into two parts: the `Kind` enum, which contains the different
variants of the node, and a wrapper struct that contains fields that are common to 
all variants. This is a common pattern used in [rustc](https://doc.rust-lang.org/beta/nightly-rustc/rustc_ast/ast/struct.Expr.html),
among others.

With our AST nodes defined, we can start writing the parser.
We have many parsing algorithms to choose from to transform the token stream into an AST,
and the one we'll use is called recursive descent parsing. It has the dual advantage
of being relatively simple to implement and easy to extend to support extra features, such as error
recovery and reporting.

In recursive descent parsing, we define a function for each non-terminal rule in the grammar.
Every time a rule references a terminal we consume the corresponding token from the
token stream, and every time a rule references a non-terminal we call the corresponding
function. This process continues until we reach the end of the input or encounter an error.

Before we start implementing the parser, let's define an error type that will 
represent all the error scenarios that can occur during parsing. For now,
we'll handle three types of errors:
- `UnexpectedToken` is used when we encounter a token that does not match
  what was expected
- `UnexpectedEof` is used when we reach the end of the input while still
  expecting more tokens
- `Io` is used when we encounter an I/O error while reading the source code

To create error types more easily, we'll use the [thiserror](https://docs.rs/thiserror/latest/thiserror/) crate.
It provides a convenient macro to derive the `std::error::Error` trait for our error types.
It can be installed by running `cargo add thiserror` in the `compiler` folder.

```rust
//! compiler/src/parse/error.rs

use crate::ast::Span;

use super::token::Token;

#[derive(Debug, thiserror::Error)]
pub enum Error {
    #[error("Unexpected token {0:?}")]
    UnexpectedToken(Token, Span),
    #[error("io error: {0}")]
    Io(#[from] std::io::Error),
    #[error("Unexpected end of file")]
    UnexpectedEof,
}

We can now define our Parser struct, with a few utility methods to help us implement our parsing rules.

//! compiler/src/parse/parser.rs

use std::iter::Peekable;

use nonempty::NonEmpty;

use crate::{ast::*, id::UniqueIdGenerator};

use super::{error::Error, lexer::Lexer, token::Token};

#[derive(Clone)]
pub struct Parser<'c> {
    lexer: Peekable<Lexer<'c>>,
    id_gen: UniqueIdGenerator,
}

impl<'c> Parser<'c> {
    pub fn new(id_gen: UniqueIdGenerator, code: &'c str) -> Self {
        Self {
            id_gen,
            lexer: Lexer::new(code).peekable(),
        }
    }

    fn expect_eq(&mut self, expected: Token) -> Result<Span, Error> {
        let (token, span) = self.next_token()?;
        if token == expected {
            Ok(span)
        } else {
            Err(Error::UnexpectedToken(token, span))
        }
    }

    fn next_token(&mut self) -> Result<(Token, Span), Error> {
        self.lexer.next().ok_or(Error::UnexpectedEof)
    }
}
  • expect_eq compares the next token to the expected one and returns its span if they match, or an UnexpectedToken error if they don't
  • next_token returns the next token and its span, or an UnexpectedEof error if there are no more tokens

Expressions and identifiers

Remember that our current grammar rule for expressions is:

Expression = Int;

The corresponding parsing method is a direct translation of the BNF rule: we consume the next token and check if it is an integer literal. If it is, we create an Expr node where the kind field represents the integer literal. Otherwise, we return an UnexpectedToken error.

//! compiler/src/parse/parser.rs

// [...]
impl <'c> Parser<'c> {
    // [...]

    fn parse_expr(&mut self) -> Result<Expr, Error> {
        match self.next_token()? {
            (Token::Int(n), span) => Ok(Expr {
                span,
                id: self.id_gen.generate(),
                kind: ExprKind::IntLit { value: n },
            }),
            (token, span) => Err(Error::UnexpectedToken(token, span)),
        }
    }
}

Identifiers follow the same pattern, as Identifier AST nodes wrap a single Indentifier token:

//! compiler/src/parse/parser.rs

impl <'c> Parser<'c> {
    // [...]

    fn parse_identifier(&mut self) -> Result<Identifier, Error> {
        match self.next_token()? {
            (Token::Identifier(name), span) => Ok(Identifier {
                id: self.id_gen.generate(),
                span,
                name,
            }),
            (token, span) => Err(Error::UnexpectedToken(token, span)),
        }
    }
}

Blocks and block statements

Block statements are statements that appear within an indented block. For now the only block statement we support is the return statement. The grammar rule for blocks and block statements is as follows:

Block = Indent ReturnStatement Dedent;
BlockStatement = ReturnStatement;
ReturnStatement = "return" Expression;

Translating these rules into parsing methods is quite straightforward. Terminals are matched with the expect_eq utility, and non-terminals are parsed by delegating to the corresponding parsing method.

//! compiler/src/parse/parser.rs

impl <'c> Parser<'c> {
    // [...]
    fn parse_block(&mut self) -> Result<NonEmpty<BlockStatement>, Error> {
        self.expect_eq(Token::Indent)?;
        let stmts = NonEmpty::new(self.parse_block_stmt()?);
        self.expect_eq(Token::Dedent)?;
        Ok(stmts)
    }

    fn parse_block_stmt(&mut self) -> Result<BlockStatement, Error> {
        self.parse_return()
    }

    fn parse_return(&mut self) -> Result<BlockStatement, Error> {
        let ret_span = self.expect_eq(Token::Return)?;
        let expr = self.parse_expr()?;
        Ok(BlockStatement {
            id: self.id_gen.generate(),
            // merging the return token span with the expression span
            // produces a span that covers the entire return statement
            span: ret_span.merge(expr.span),
            kind: BlockStatementKind::Return { value: expr },
        })
    }
}

Function declarations

Function declarations are a bit more complex than the rules we've seen so far, but the translation from grammar to code follows the exact same pattern. Here is the grammar rule for function declarations:

FunDecl = "def" Identifier "(" ")" ":" Block;

And the corresponding parsing method:

//! compiler/src/parse/parser.rs

impl <'c> Parser<'c> {
    // [...]

    fn parse_fun_decl(&mut self) -> Result<Decl, Error> {
        let span_start = self.expect_eq(Token::Def)?;
        let name = self.parse_identifier()?;
        self.expect_eq(Token::LPar)?;
        self.expect_eq(Token::RPar)?;
        self.expect_eq(Token::Colon)?;
        let body = self.parse_block()?;
        Ok(Decl {
            id: self.id_gen.generate(),
            span: span_start.merge(body.last().span),
            kind: DeclKind::Function(FunDecl { name, body }),
        })
    }
}

With this addition, our parser is nearly complete. We'll define two last parsing methods:

  • parse_decl to parse a Decl node
  • parse_module to parse the top level declarations in a module

Both methods are straightforward, as they only need to delegate to parse_fun_decl.

//! compiler/src/parse/parser.rs

impl <'c> Parser<'c> {
    // [...]

    pub fn parse_module(&mut self) -> Result<Vec<Statement>, Error> {
        let decl = self.parse_decl()?;
        Ok(vec![Statement {
            id: self.id_gen.generate(),
            span: decl.span,
            kind: StatementKind::Decl(decl),
        }])
    }

    fn parse_decl(&mut self) -> Result<Decl, Error> {
        self.parse_fun_decl()
    }
}

Let's edit our main function to test our parser:

//! compiler/src/main.rs

mod ast;
mod ctx;
mod db;
mod error;
mod id;
mod parse;

fn main() {
    let input_file = std::env::args().nth(1).expect("missing input file");
    let input_code = std::fs::read_to_string(input_file).expect("failed to read source code");
    let id_gen = id::UniqueIdGenerator::default();
    let mut parser = parse::parser::Parser::new(id_gen, &input_code);
    let stmts = parser.parse_module().expect("failed to parse module");
    println!("{stmts:#?}");
}

Assuming we have saved our small Pylite program in the res/samples/return_const/main.py file, we can test our parser by running the following command:

$ cargo run -- res/samples/return_const/main.py

It should print the following output, confirming that our parser works as expected:

[
    Statement {
        id: UniqueId(
            4,
        ),
        span: Span {
            start: 0,
            end: 24,
        },
        kind: Decl(
            Decl {
                id: UniqueId(
                    3,
                ),
                span: Span {
                    start: 0,
                    end: 24,
                },
                kind: Function(
                    FunDecl {
                        name: Identifier {
                            id: UniqueId(
                                0,
                            ),
                            span: Span {
                                start: 4,
                                end: 8,
                            },
                            name: "main",
                        },
                        body: NonEmpty {
                            head: BlockStatement {
                                id: UniqueId(
                                    2,
                                ),
                                span: Span {
                                    start: 16,
                                    end: 24,
                                },
                                kind: Return {
                                    value: Expr {
                                        span: Span {
                                            start: 23,
                                            end: 24,
                                        },
                                        id: UniqueId(
                                            1,
                                        ),
                                        kind: IntLit {
                                            value: 1,
                                        },
                                    },
                                },
                            },
                            tail: [],
                        },
                    },
                ),
            },
        ),
    },
]

This concludes the first half of our compiler. In the next part, we'll build an intermediate representation for our program, and start generating assembly code!

0
Subscribe to my newsletter

Read articles from Geoffrey Copin directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Geoffrey Copin
Geoffrey Copin