Build a Compiler from Scratch, Part 1.1: A Hello World of sorts


It has become common practice to start with a "Hello World" program when learning a new programming language. This is a simple program that outputs the text "Hello, World!" to the screen. While writing such a program is a trivial task with most programming languages, getting our own language to interface with the OS and write data to the standard output will take a fair amount of work.
So we'll start with something simpler. How much simpler? Well, here is the first Pylite
program that our compiler will translate to machine code:
def main():
return 1
It just sets a return code and exits. This is the simplest program that we can write while having an easily observable effect. It will be our "Hello World" program for now.
Throughout this chapter, we will build the basic building blocks of our compiler, including the parser, and translation passes to the intermediate representation and machine code. Each of these blocks will only support the minimum features needed to compile our "Hello World" program, and will be expanded in later chapters to support more complex constructs.
Setting up the project
We will start by creating a new cargo workspace to host our compiler.
$ mkdir pylite
$ cd pylite
For now, this workspace will contain a single compiler
crate:
# Cargo.toml
[workspace]
resolver = "2"
members = ["compiler"]
We can now create the compiler
crate:
$ cargo new compiler --bin
$ cargo run 2>/dev/null
Hello, world!
Our top-level pylite
folder should have the following structure:
.
├── Cargo.toml
└── compiler
├── Cargo.toml
└── src
└── main.rs
Parsing Pylite
The textual representation of a program does not lend itself well to the kind of analysis and transformation that we need to perform. In order to resolve types, associate identifiers with their definitions and generate our low-level intermediate representation, we need to convert the unstructured sequence of characters that make up the source code into a tree-like structure that represents the program's syntax. This process if called parsing, and the data structure it produces is called an Abstract Syntax Tree (AST). The parsing process is typically split into two parts: lexical analysis (or tokenization) and syntactic analysis (or parsing).
Source code
===========
def main():
return 1
Token sequence
==============
┌───┐ ┌────┐ ┌─┐ ┌─┐ ┌─┐ ┌──────┐ ┌──────┐ ┌─┐ ┌──────┐
│def│ │main│ │(│ │)│ │:│ │INDENT│ │return│ │1│ │DEDENT│
└───┘ └────┘ └─┘ └─┘ └─┘ └──────┘ └──────┘ └─┘ └──────┘
Abstract Syntax Tree
====================
┌─────────────┐
│ FunctionDef │
│ (main) │
└─────────────┘
│
└── ┌─────────────┐
│ Return Stmt │
└─────────────┘
│
└── ┌─────────┐
│ Literal │
│ 1 │
└─────────┘
During lexical analysis, we group individual characters into tokens, which are the smallest
meaningful units of the language. For example, the characters d
, e
and f
will be grouped
into a single token representing the def
keyword. Irrelevant characters, such as non-significant
whitespace, are discarded during this process.
Since Pylite
- like Python - is indentation-sensitive, we need to keep track of the current indentation level.
This will be done by inserting INDENT
and DEDENT
tokens into the token sequence whenever the indentation level changes.
The syntax analysis phase will then take this sequence of tokens and match it against our language's syntax rules (or grammar) to build a tree-like structure called an Abstract Syntax Tree (AST).
Writing the lexer
The first step in writing our lexer is to choose a representation for our tokens. Rust's enums are a great
fit for this purpose. We'll also pair each token with a Span
struct that will hold the token's
start and end byte position in the source code. Keeping track of the token's position will prove useful to display
meaningful error messages to the user.
//! compiler/src/ast.rs
#[derive(Debug, Copy, Clone, Eq, PartialEq, Hash)]
pub struct Span {
pub start: u32,
pub end: u32,
}
impl Span {
// Create a new span with the given start and end position.
pub fn new(start: u32, end: u32) -> Self {
Self { start, end }
}
// Create a new span that covers both the current span and the given span.
pub fn merge(self, other: Span) -> Self {
Self {
start: self.start.min(other.start),
end: self.end.max(other.end),
}
}
}
//! compiler/src/parse/token.rs
#[derive(Debug, Clone, Eq, PartialEq, Hash)]
pub enum Token {
Def,
Return,
LPar,
RPar,
Colon,
Identifier(String),
Int(i64),
Unknown(char),
}
The
Unknown
variant will be used to represent any unexpected character that we encounter during the tokenization process.
With these definitions in place, we can create a scaffolding for our lexer. We'll model it as an iterator
yielding tuples of (Token, Span)
.
//! compiler/src/parse/lexer.rs
use std::{iter::Peekable, str::Chars};
use crate::ast::Span;
use super::token::Token;
pub struct Lexer<'c> {
// iterator over the source code's characters
input: Peekable<Chars<'c>>,
// current byte position in the source code
position: u32,
}
impl<'c> Lexer<'c> {
pub fn new(input: &'c str) -> Self {
Self {
input: input.chars().peekable(),
position: 0,
}
}
fn next_token(&mut self) -> Option<(Token, Span)> {
unimplemented!()
}
}
impl Iterator for Lexer<'_> {
type Item = (Token, Span);
fn next(&mut self) -> Option<Self::Item> {
self.next_token()
}
}
We'll also add the following helper methods to simplify the implementation of next_token
:
emit_token
takes astart
position and aToken
and returns a tuple of the form(Token, Span)
. where theSpan
starts atstart
and ends at the current position.next_char
consumes the next character in the input (if any), and updates the current position.next_char_if
conditionally consumes the next character and updates the current position.
Since we track the byte position of the tokens and utf-8 characters can span multiple bytes, we can't simply increment the position for each character. Instead, we'll use the
char::len_utf8
method to get the byte size of the current character and update the position accordingly.
//! compiler/src/parse/lexer.rs
impl<'c> Lexer<'c> {
// [...]
fn emit_token(&self, start: u32, token: Token) -> Option<(Token, Span)> {
Some((
token,
Span {
start,
end: self.position,
},
))
}
fn next_char(&mut self) -> Option<char> {
self.next_char_if(|_| true)
}
fn next_char_if(&mut self, f: impl FnOnce(char) -> bool) -> Option<char> {
self.input.next_if(|&c| f(c)).inspect(|c| {
self.position += c.len_utf8() as u32;
})
}
}
Some tokens contain a single character, such as (
or :
. We can handle these tokens by matching comparing the
current character to the expected one and returning the corresponding token if they match.
//! compiler/src/parse/lexer.rs
impl<'c> Lexer<'c> {
fn next_token(&mut self) -> Option<(Token, Span)> {
let start_pos = self.position;
match self.next_char()? {
'(' => self.emit_token(start_pos, Token::LPar),
')' => self.emit_token(start_pos, Token::RPar),
':' => self.emit_token(start_pos, Token::Colon),
c => unimplemented!(),
}
}
}
The remaining cases are more involved, but follow a similar pattern:
- if the current character is a digit, we consume the following characters until we reach a non-digit character. We then parse the resulting string as an integer and create an integer token.
- if the current character is a letter, we consume the following characters until we reach a
non-alphanumeric character different from
_
. We then check if the resulting string matches a keyword and create a keyword token if it does, or an identifier token otherwise. - if the current character is a whitespace, we skip every following whitespace and return the following token.
Finally, if the current character does not match any of the above cases, we create an Unknown
token.
//! compiler/src/parse/lexer.rs
impl<'c> Lexer<'c> {
// [...]
fn next_token(&mut self) -> Option<(Token, Span)> {
// [...]
match self.next_char()? {
// [...]
c if c.is_numeric() => {
let mut int_repr = String::from(c);
while let Some(c) = self.next_char_if(|c| c.is_numeric()) {
int_repr.push(c);
}
let value = int_repr.parse::<i64>().expect("int token");
self.emit_token(start_pos, Token::Int(value))
}
c if c.is_alphabetic() => {
let mut identifier = String::from(c);
while let Some(c) = self.next_char_if(|c| c.is_alphanumeric() || c == '_') {
identifier.push(c);
}
match identifier.as_str() {
"def" => self.emit_token(start_pos, Token::Def),
"return" => self.emit_token(start_pos, Token::Return),
_ => self.emit_token(start_pos, Token::Identifier(identifier)),
}
}
c if c.is_whitespace() => {
while self.next_char_if(|c| c.is_whitespace()).is_some() {}
self.next_token()
}
c => self.emit_token(start_pos, Token::Unknown(c)),
}
}
}
With our lexer almost complete, it's time to write our first test!
As building and verifying the token spans by hand would be tedious, we'll write a helper test_lexer
function that takes as input the source code and a vec
of (String, Token)
tuples representing the
expected tokens, with the String
being the token's textual representation.
If tokenizing the input and rendering the Span
s does not yield the same vec
, the test will fail.
//! compiler/src/parse/lexer.rs
#[cfg(test)]
mod test {
use super::*;
#[test]
fn tokenize_return_const() {
let input = r##"
def main():
return 34
"##;
let expected = vec![
("def", Token::Def),
("main", Token::Identifier("main".to_string())),
("(", Token::LPar),
(")", Token::RPar),
(":", Token::Colon),
("return", Token::Return),
("34", Token::Int(34)),
];
test_lexer(input, expected);
}
fn test_lexer(input: &str, expected: Vec<(&str, Token)>) {
let rendered: Vec<(&str, Token)> = Lexer::new(input)
.map(|(token, span)| {
let rendered = &input[span.start as usize..span.end as usize];
(rendered, token)
})
.collect();
assert_eq!(expected, rendered);
}
}
Let's make sure that our test passes before moving on to the next section:
$ cargo test
running 1 test
test parse::lexer::tests::tokenize_return_const ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
Handling indentation
One defining aspect of Pylite
's syntax is that it is indentation-sensitive. This means that blocks
are delimited by their indentation level, rather than by explicit braces or keywords.
To simplify the parsing process, we'll insert Indent
and Dedent
tokens into the token stream
whenever we detect a change in the current indentation level.
How to detect to such changes? First, we need to define which character - or sequence of characters - represents
a single level of indentation. In Pylite
a single tab, or four consecutive spaces, will represent a single level
of indentation. Every time we reach a new line, we'll count the number of tabs and spaces at the beginning of the line
and compare the new indentation level to the previous one. If the indentation levels match, there is no extra token to
insert. If they don't match we'll compute delta = abs(new_indentation - previous_indentation)
and insert delta
Indent
or Dedent
tokens accordingly.
In case of inconsistent indentation, for example, if a new line starts with three spaces, we'll insert an InconsistentIndentation
token.
Let's start by adding the new variants to our Token
enum:
//! compiler/src/parse/token.rs
#[derive(Debug, Clone, Eq, PartialEq, Hash)]
pub enum Token {
Def,
Return,
LPar,
RPar,
Colon,
Identifier(String),
Int(i64),
Unknown(char),
+ Indent,
+ Dedent,
+ InconsistentIndentation,
}
// [...]
In some cases, the difference in indentation level will be higher than one. For example, we'll have to insert
two Dedent
tokens at the end of the following function definition. One to close the if block, and the
other to close the function definition.
def main():
if True:
return 1
This doesn't fit well with the current design of our lexer, as we return tokens eagerly as soon as they are
recognized. We'll need to refactor our lexer to support emitting multiple tokens at once. For this reason,
we'll split the work done by our iter
method into two separate steps: first push the recognized tokens into
a queue, and then return the first token in the queue, if any.
//! compiler/src/parse/lexer.rs
-use std::{iter::Peekable, str::Chars};
+use std::{collections::VecDeque, iter::Peekable, str::Chars};
pub struct Lexer<'c> {
input: Peekable<Chars<'c>>,
// queue of tokens to emit, we use a VecDeque to efficiently pop tokens from the front
// without having to shift the remaining elements
+ token_queue: VecDeque<(Token, Span)>,
position: u32,
+ indentation: u32,
}
impl<'c> Lexer<'c> {
pub fn new(input: &'c str) -> Self {
Self {
input: input.chars().peekable(),
+ token_queue: VecDeque::default(),
position: 0,
+ indentation: 0,
}
}
- fn next_token(&mut self) -> Option<(Token, Span)> {
+ fn emit_next_tokens(&mut self) {
// handle indents at the beginning of a file
+ if self.position == 0 {
+ self.emit_indentation_tokens();
+ }
+
let start_pos = self.position;
+ let Some(first_char) = self.next_char() else {
+ return;
+ };
- match self.next_char()? {
+ match first_char {
// [...]
+ '\n' => {
+ self.emit_indentation_tokens();
+ self.emit_next_tokens();
}
// [...]
}
}
+ fn emit_indentation_tokens(&mut self) {
+ let mut space_count = 0;
+ let mut indentation = 0;
+ while let Some(c) = self.next_char_if(char::is_whitespace) {
+ match c {
// a tab increases the indentation level by 1
+ '\t' => {
+ if space_count % 4 != 0 {
+ space_count = 4;
+ self.emit_token(self.position, Token::InconsistentIndentation);
+ }
+ indentation += 1;
+ }
// indentation is resetted on every new line
+ '\n' => {
+ space_count = 0;
+ indentation = 0;
+ }
+ _ => {
+ space_count += 1;
// four spaces increase the indentation level
+ if space_count % 4 == 0 {
+ indentation += 1;
+ }
+ }
+ }
+ }
+ if space_count % 4 != 0 {
+ self.emit_token(self.position, Token::InconsistentIndentation)
+ }
+ if indentation == self.indentation {
+ return;
+ }
// emit indent/dedent tokens if the new indentation level is
// different from the previous one
+ for _ in 0..self.indentation.abs_diff(indentation) {
+ let token = if indentation > self.indentation {
+ Token::Indent
+ } else {
+ Token::Dedent
+ };
+ self.emit_token(self.position, token);
+ }
+ self.indentation = indentation;
+ }
fn emit_token(&mut self, start: u32, token: Token) {
self.token_queue.push_back((
token,
Span {
start,
end: self.position,
},
));
}
// [...]
}
impl Iterator for Lexer<'_> {
type Item = (Token, Span);
fn next(&mut self) -> Option<Self::Item> {
- self.next_token();
+ self.emit_next_tokens();
+ self.token_queue.pop_front()
}
}
Now, our lexer will emit Indent
and Dedent
tokens whenever the indentation level changes.
The only thing that is left to do is to update our test:
//! compiler/src/parse/lexer.rs
[...]
#[test]
fn tokenize_return_const() {
let input = r##"
def main():
return 34
"##;
let expected = vec![
("def", Token::Def),
("main", Token::Identifier("main".to_string())),
("(", Token::LPar),
(")", Token::RPar),
(":", Token::Colon),
+ ("", Token::Indent),
("return", Token::Return),
("34", Token::Int(34)),
+ ("", Token::Dedent),
];
test_lexer(input, expected);
}
[...]
Building the parser
Introduction to formal grammars
Throughout this series, we'll try to limit the amount of purely theoretical content. But it's challenging to study compilers without at least a cursory understanding of formal grammars. We'll actually make use of the concepts introduced in this section when writing our parser, as the shape of our parsing functions will closely mimic the structure of our grammar rules.
The kind of grammar that we'll be using is called a context-free grammar (CFG). Context-free grammars are a subset of formal grammars that are widely used in computer science, particularly in the field of programming languages. To learn more about formal grammars, you can check out the Wikipedia entry on the Chomsky hierarchy, which classified grammars into multiple nested categories.
But first, what is a language's grammar? In the context of programming languages, a grammar is a set of rules that define the syntax of a language. It specifies how tokens can be combined to form valid statements and expressions. In a way, grammars solve the same problem as regular expressions: given a piece of text, determine whether it is "valid," according to a set of rules.
Why not use regular expressions to define the syntax of our language, then?
The answer is simple: regular expressions are not powerful enough to express the
syntax of most programming languages. Regular expressions can handle simple patterns
like matching keywords or numeric literals, but can't describe nested structures
and recursive patterns that are fundamental to programming languages.
As a motivating example, consider the challenge of matching balanced parentheses—a
structure that appears everywhere in programming languages, from function calls to
mathematical expressions. You can try to write a regular expression that matches
balanced parentheses: inputs like ()
, (())
and (()())
should all be matched,
while inputs like (()
, ())
should not.
This is a classic example of something that cannot be expressed with regular expressions, but can be expressed with a context-free grammar. Without further ado, let's see how to write a grammar for our balanced parentheses example.
BalancedParenthesis = { Nested }-;
Nested = "(", { Nested }, ")";
Rules are defined using the =
symbol, and are terminated by a semicolon.
Within a rule ,
denotes a sequence of elements. Rule names can be
surrounded to express repetition:
[]
means "0 or 1"{}
means "0 or more"{}-
means "1 or more"
Translated into english, this grammar states that a BalancedParenthesis
is
a sequence of one or more Nested
, where a Nested
is
defined a a (
, followed by zero or more Nested
and then a )
.
How to verify that our grammar is correct? We can use rule substitution to
replace the rule names with their definition, until we reach the expected
string. For example, let's try to verify that the string (()())
is
indeed a valid BalancedParenthesis
:
(()()) = BalancedParenthesis
= { Nested }- // 1
= "(", { Nested }, ")" // 2
= "(", "(", { Nested }, ")", "(", { Nested }, ")", ")" // 3
= "(", "(", ")", "(", ")", ")" // 4
We start by replacing BalancedParenthesis
with its definition (step 1), then
we substitute Nested
with its definition (step 2), and replace the inner
{ Nested }
with two successive Nested
(step 3). Finally, we replace
the innermost Nested
(step 4) and reach the expected string.
Reading and writing grammars can prove to be challenging at first, but with a bit of practice, you'll find that they are a powerful tool to explore and describe the syntax of programming languages. Luckily for us, at this point, the grammar for our language is quite simple, as it does not include repetitions or recursion like the grammar for balanced parentheses.
Program = FunDecl;
FunDecl = "def" Identifier "(" ")" ":" Block;
Block = Indent ReturnStatement Dedent;
BlockStatement = ReturnStatement;
ReturnStatement = "return" Expression;
Expression = Int;
You'll notice that some rules don't have a matching definition, such as Identifier
, Int
, Indent
and Dedent
. This is because they are terminal rules, which means that they directly represent
tokens in the token stream produced by the lexer.
As an exercise, you can try to use successive substitutions starting from the Program
rule to verify that the following Pylite
program is syntactically valid:
def main():
return 42
Generating unique identifiers
The data structures built by a compiler naturally tend to contain cycles. For example, a recursive function definition includes the AST nodes representing the function's body, which in turn contain a reference to the function itself (after name-resolution is complete). This is a problem for our compiler, as Rust's borrow checker makes it difficult to create cyclic data structures. To solve this problem, we'll add a level of indirection: each node in the AST - and in general, each entity within the compiler - will be identified by a unique identifier. In the previous example, during name resolution, we'll simply register the association between the function's unique identifier and the unique identifier of the recursive call within the function's body, without introducing any cycle.
Our unique identifiers will be backed by monotonically increasing unsigned integers.
To keep track of the last assigned identifier and generate new ones, we'll create a
UniqueIdGenerator
struct. It will have a pointer to an AtomicU32
representing
the next identifier to be assigned. Generating a new identifier is as simple as
incrementing the atomic counter and returning its previous value.
//! compiler/src/id.rs
use std::{rc::Rc, sync::atomic::AtomicU32};
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub struct UniqueId(u32);
impl From<u32> for UniqueId {
fn from(value: u32) -> Self {
UniqueId(value)
}
}
#[derive(Debug, Clone, Default)]
pub struct UniqueIdGenerator {
next_id: Rc<AtomicU32>,
}
impl UniqueIdGenerator {
pub fn generate(&self) -> UniqueId {
UniqueId(
self.next_id
.fetch_add(1, std::sync::atomic::Ordering::Relaxed),
)
}
}
Parsing the token stream
The next step is to transform the token stream into an Abstract Syntax Tree (AST).
We'll start by defining a set of struct
and enums
to represent the different
nodes in our AST.
For some additional type safety, we leverage the
nonempty
crate. It provides aNonEmpty
type that mimics the behavior ofVec
, but guarantees that the vector always contains at least one element. We'll use this type to represent function bodies, which always contain at least one statement. Let's install it by runningcargo add nonempty
in thecompiler
folder: ```rust // compiler/ast.rs
use std::{path::PathBuf, rc::Rc};
use nonempty::NonEmpty;
use crate::id::UniqueId;
// [...]
// Represents a Pylite module, which is an entire source file.
#[derive(Debug, Clone)] pub struct Module { pub path: Rc, }
impl Module { pub fn new(path: PathBuf) -> Self { Self { path: Rc::new(path), } } }
#[derive(Debug, Clone, Eq, PartialEq)] pub struct Statement { pub id: UniqueId, pub span: Span, pub kind: StatementKind, }
#[derive(Debug, Clone, Eq, PartialEq)] pub enum StatementKind { Decl(Decl), }
#[derive(Debug, Clone, Eq, PartialEq)] pub struct Decl { pub id: UniqueId, pub span: Span, pub kind: DeclKind, }
#[derive(Debug, Clone, Eq, PartialEq)] pub enum DeclKind { Function(FunDecl), }
#[derive(Debug, Clone, Eq, PartialEq)] pub struct FunDecl { pub name: Identifier, pub body: NonEmpty, }
#[derive(Debug, Clone, Eq, PartialEq)] pub struct Identifier { pub id: UniqueId, pub span: Span, pub name: String, }
#[derive(Debug, Clone, Eq, PartialEq)] pub struct BlockStatement { pub id: UniqueId, pub span: Span, pub kind: BlockStatementKind, }
#[derive(Debug, Clone, Eq, PartialEq)] pub enum BlockStatementKind { Return { value: Expr }, }
#[derive(Debug, Clone, Eq, PartialEq)] pub struct Expr { pub span: Span, pub id: UniqueId, pub kind: ExprKind, }
#[derive(Debug, Clone, Eq, PartialEq)] pub enum ExprKind { IntLit { value: i64 }, }
Enum nodes are split into two parts: the `Kind` enum, which contains the different
variants of the node, and a wrapper struct that contains fields that are common to
all variants. This is a common pattern used in [rustc](https://doc.rust-lang.org/beta/nightly-rustc/rustc_ast/ast/struct.Expr.html),
among others.
With our AST nodes defined, we can start writing the parser.
We have many parsing algorithms to choose from to transform the token stream into an AST,
and the one we'll use is called recursive descent parsing. It has the dual advantage
of being relatively simple to implement and easy to extend to support extra features, such as error
recovery and reporting.
In recursive descent parsing, we define a function for each non-terminal rule in the grammar.
Every time a rule references a terminal we consume the corresponding token from the
token stream, and every time a rule references a non-terminal we call the corresponding
function. This process continues until we reach the end of the input or encounter an error.
Before we start implementing the parser, let's define an error type that will
represent all the error scenarios that can occur during parsing. For now,
we'll handle three types of errors:
- `UnexpectedToken` is used when we encounter a token that does not match
what was expected
- `UnexpectedEof` is used when we reach the end of the input while still
expecting more tokens
- `Io` is used when we encounter an I/O error while reading the source code
To create error types more easily, we'll use the [thiserror](https://docs.rs/thiserror/latest/thiserror/) crate.
It provides a convenient macro to derive the `std::error::Error` trait for our error types.
It can be installed by running `cargo add thiserror` in the `compiler` folder.
```rust
//! compiler/src/parse/error.rs
use crate::ast::Span;
use super::token::Token;
#[derive(Debug, thiserror::Error)]
pub enum Error {
#[error("Unexpected token {0:?}")]
UnexpectedToken(Token, Span),
#[error("io error: {0}")]
Io(#[from] std::io::Error),
#[error("Unexpected end of file")]
UnexpectedEof,
}
We can now define our Parser
struct, with a few utility methods to help us
implement our parsing rules.
//! compiler/src/parse/parser.rs
use std::iter::Peekable;
use nonempty::NonEmpty;
use crate::{ast::*, id::UniqueIdGenerator};
use super::{error::Error, lexer::Lexer, token::Token};
#[derive(Clone)]
pub struct Parser<'c> {
lexer: Peekable<Lexer<'c>>,
id_gen: UniqueIdGenerator,
}
impl<'c> Parser<'c> {
pub fn new(id_gen: UniqueIdGenerator, code: &'c str) -> Self {
Self {
id_gen,
lexer: Lexer::new(code).peekable(),
}
}
fn expect_eq(&mut self, expected: Token) -> Result<Span, Error> {
let (token, span) = self.next_token()?;
if token == expected {
Ok(span)
} else {
Err(Error::UnexpectedToken(token, span))
}
}
fn next_token(&mut self) -> Result<(Token, Span), Error> {
self.lexer.next().ok_or(Error::UnexpectedEof)
}
}
expect_eq
compares the next token to the expected one and returns its span if they match, or anUnexpectedToken
error if they don'tnext_token
returns the next token and its span, or anUnexpectedEof
error if there are no more tokens
Expressions and identifiers
Remember that our current grammar rule for expressions is:
Expression = Int;
The corresponding parsing method is a direct translation of the BNF rule:
we consume the next token and check if it is an integer literal. If it is, we create
an Expr
node where the kind
field represents the integer literal. Otherwise, we
return an UnexpectedToken
error.
//! compiler/src/parse/parser.rs
// [...]
impl <'c> Parser<'c> {
// [...]
fn parse_expr(&mut self) -> Result<Expr, Error> {
match self.next_token()? {
(Token::Int(n), span) => Ok(Expr {
span,
id: self.id_gen.generate(),
kind: ExprKind::IntLit { value: n },
}),
(token, span) => Err(Error::UnexpectedToken(token, span)),
}
}
}
Identifiers follow the same pattern, as Identifier
AST nodes wrap a single
Indentifier
token:
//! compiler/src/parse/parser.rs
impl <'c> Parser<'c> {
// [...]
fn parse_identifier(&mut self) -> Result<Identifier, Error> {
match self.next_token()? {
(Token::Identifier(name), span) => Ok(Identifier {
id: self.id_gen.generate(),
span,
name,
}),
(token, span) => Err(Error::UnexpectedToken(token, span)),
}
}
}
Blocks and block statements
Block statements are statements that appear within an indented block.
For now the only block statement we support is the return
statement.
The grammar rule for blocks and block statements is as follows:
Block = Indent ReturnStatement Dedent;
BlockStatement = ReturnStatement;
ReturnStatement = "return" Expression;
Translating these rules into parsing methods is quite straightforward.
Terminals are matched with the expect_eq
utility, and non-terminals
are parsed by delegating to the corresponding parsing method.
//! compiler/src/parse/parser.rs
impl <'c> Parser<'c> {
// [...]
fn parse_block(&mut self) -> Result<NonEmpty<BlockStatement>, Error> {
self.expect_eq(Token::Indent)?;
let stmts = NonEmpty::new(self.parse_block_stmt()?);
self.expect_eq(Token::Dedent)?;
Ok(stmts)
}
fn parse_block_stmt(&mut self) -> Result<BlockStatement, Error> {
self.parse_return()
}
fn parse_return(&mut self) -> Result<BlockStatement, Error> {
let ret_span = self.expect_eq(Token::Return)?;
let expr = self.parse_expr()?;
Ok(BlockStatement {
id: self.id_gen.generate(),
// merging the return token span with the expression span
// produces a span that covers the entire return statement
span: ret_span.merge(expr.span),
kind: BlockStatementKind::Return { value: expr },
})
}
}
Function declarations
Function declarations are a bit more complex than the rules we've seen so far, but the translation from grammar to code follows the exact same pattern. Here is the grammar rule for function declarations:
FunDecl = "def" Identifier "(" ")" ":" Block;
And the corresponding parsing method:
//! compiler/src/parse/parser.rs
impl <'c> Parser<'c> {
// [...]
fn parse_fun_decl(&mut self) -> Result<Decl, Error> {
let span_start = self.expect_eq(Token::Def)?;
let name = self.parse_identifier()?;
self.expect_eq(Token::LPar)?;
self.expect_eq(Token::RPar)?;
self.expect_eq(Token::Colon)?;
let body = self.parse_block()?;
Ok(Decl {
id: self.id_gen.generate(),
span: span_start.merge(body.last().span),
kind: DeclKind::Function(FunDecl { name, body }),
})
}
}
With this addition, our parser is nearly complete. We'll define two last parsing methods:
parse_decl
to parse aDecl
nodeparse_module
to parse the top level declarations in a module
Both methods are straightforward, as they only need to delegate to parse_fun_decl
.
//! compiler/src/parse/parser.rs
impl <'c> Parser<'c> {
// [...]
pub fn parse_module(&mut self) -> Result<Vec<Statement>, Error> {
let decl = self.parse_decl()?;
Ok(vec![Statement {
id: self.id_gen.generate(),
span: decl.span,
kind: StatementKind::Decl(decl),
}])
}
fn parse_decl(&mut self) -> Result<Decl, Error> {
self.parse_fun_decl()
}
}
Let's edit our main function to test our parser:
//! compiler/src/main.rs
mod ast;
mod ctx;
mod db;
mod error;
mod id;
mod parse;
fn main() {
let input_file = std::env::args().nth(1).expect("missing input file");
let input_code = std::fs::read_to_string(input_file).expect("failed to read source code");
let id_gen = id::UniqueIdGenerator::default();
let mut parser = parse::parser::Parser::new(id_gen, &input_code);
let stmts = parser.parse_module().expect("failed to parse module");
println!("{stmts:#?}");
}
Assuming we have saved our small Pylite
program in the res/samples/return_const/main.py
file, we can test our parser by running the following command:
$ cargo run -- res/samples/return_const/main.py
It should print the following output, confirming that our parser works as expected:
[
Statement {
id: UniqueId(
4,
),
span: Span {
start: 0,
end: 24,
},
kind: Decl(
Decl {
id: UniqueId(
3,
),
span: Span {
start: 0,
end: 24,
},
kind: Function(
FunDecl {
name: Identifier {
id: UniqueId(
0,
),
span: Span {
start: 4,
end: 8,
},
name: "main",
},
body: NonEmpty {
head: BlockStatement {
id: UniqueId(
2,
),
span: Span {
start: 16,
end: 24,
},
kind: Return {
value: Expr {
span: Span {
start: 23,
end: 24,
},
id: UniqueId(
1,
),
kind: IntLit {
value: 1,
},
},
},
},
tail: [],
},
},
),
},
),
},
]
This concludes the first half of our compiler. In the next part, we'll build an intermediate representation for our program, and start generating assembly code!
Subscribe to my newsletter
Read articles from Geoffrey Copin directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
