Building a Parser: A Beginner-Friendly Guide to Constructing an AST-Based Parser
In our previous article, we built a lexer that converts code into tokens, breaking down the input into meaningful pieces. Now, the next exciting step is to write a parser, which takes these tokens and organizes them into a meaningful structure called an Abstract Syntax Tree (AST). This guide will walk you through the process of building a parser in a simple, step-by-step manner, making it easy to understand.
The code to this series can be found here
What Exactly is a Parser?
A parser is like a translator. It takes the sequence of tokens produced by the lexer and turns them into an organized structure. This structure, the AST, represents the grammar and rules of our programming language. For example, an assignment statement like let x = 5;
would be broken down into parts (tokens) and then arranged into a tree that shows the relationships between these parts.
Step 1: Defining the AST Structure
Before we start coding the parser, we need to define our AST nodes. These nodes will represent different elements in our programming language. Let's start with a few simple ones:
Number: Represents a numeric value.
BinOp: Represents an operation like
+
or-
.Variable: Represents a variable name.
Assign: Represents an assignment statement like
let x = 5;
.Print: Represents a print statement like
print(x);
.
Here’s how we define these nodes in Python:
class ASTNode:
pass
class Number(ASTNode):
def __init__(self, value):
self.value = value
class BinOp(ASTNode):
def __init__(self, left, op, right):
self.left = left
self.op = op
self.right = right
class Variable(ASTNode):
def __init__(self, name):
self.name = name
class Assign(ASTNode):
def __init__(self, name, value):
self.name = name
self.value = value
class Print(ASTNode):
def __init__(self, expression):
self.expression = expression
These classes will help our parser build the tree, making it easy to visualize and work with the structure of the code.
Step 2: Setting Up the Parser
Now that we have the basic AST structure, let’s set up the Parser
class. This class will take the list of tokens from our lexer and keep track of where it is in that list using a position (pos
).
class Parser:
def __init__(self, tokens):
self.tokens = tokens
self.pos = 0
The tokens
parameter will hold the list of tokens we got from the lexer, and pos
will keep track of which token we’re currently looking at.
Step 3: Moving Through Tokens with the consume
Method
To make our parser work, we need a way to move through the tokens and check if they match what we expect. That’s where the consume
method comes in. It will check if the current token matches what we need, move to the next one if it does, and return its value. If it doesn’t match, it will raise an error.
def consume(self, expected_type):
if self.pos < len(self.tokens) and self.tokens[self.pos][0] == expected_type:
self.pos += 1
return self.tokens[self.pos - 1][1]
else:
raise SyntaxError(f"Expected {expected_type}, got {self.tokens[self.pos][0]}")
This method helps us move through the tokens while making sure everything follows the rules of our language.
Step 4: Parsing the Entire Code with the parse
Method
The main job of the parser is to go through all the tokens and build an AST. We start by defining the parse
method, which creates a list of statements (each represented as an AST node).
def parse(self):
statements = []
while self.pos < len(self.tokens):
if self.tokens[self.pos][0] == 'NEWLINE':
self.pos += 1 # Skip newlines
continue
statements.append(self.statement())
return statements
The parse
method goes through each token until there are none left. It checks if the current token is a newline (which we skip), and otherwise, it tries to parse a statement using the statement
method.
Step 5: Parsing Statements
In our simple language, there are two types of statements: assignments (like let x = 5;
) and print statements (like print(x);
). The statement
method decides which type of statement to parse based on the current token.
def statement(self):
token_type, token_value = self.tokens[self.pos]
if token_type == 'ID' and token_value == 'let':
return self.assignment()
elif token_type == 'ID' and token_value == 'print':
return self.print_statement()
else:
raise SyntaxError("Unknown statement")
If the token is
let
, we know it’s an assignment, so we call theassignment
method.If the token is
print
, we call theprint_statement
method.
Step 6: Parsing Assignments
The assignment
method handles statements like let x = 5;
. It expects the keyword let
, a variable name, an =
sign, an expression (like a number or variable), and a semicolon (;
).
def assignment(self):
self.consume('ID') # consume 'let'
var_name = self.consume('ID')
self.consume('ASSIGN')
expr = self.expression()
self.consume('END')
return Assign(var_name, expr)
This method uses consume
to check if the tokens match the pattern of an assignment.
Step 7: Parsing Print Statements
Next, we define print_statement
to handle statements like print(x);
. It expects print
, followed by an opening parenthesis ((
), an expression, a closing parenthesis ()
), and a semicolon (;
).
def print_statement(self):
self.consume('ID') # consume 'print'
self.consume('LPAREN')
expr = self.expression()
self.consume('RPAREN')
self.consume('END')
return Print(expr)
This method builds a Print
node, making sure the structure is correct.
Step 8: Parsing Expressions
Expressions are the building blocks of our language. They can be numbers, variables, or combinations of operations like 5 + 3
. The expression
method will handle this:
def expression(self):
left = self.term()
while self.pos < len(self.tokens) and self.tokens[self.pos][0] == 'OP':
op = self.consume('OP')
right = self.term()
left = BinOp(left, op, right)
return left
This method looks for terms (numbers, variables, or parentheses) and builds binary operations (BinOp
nodes) when it finds operators (+
, -
, etc.).
Step 9: Parsing the Smallest Parts - Terms
The term
method handles the smallest elements like numbers, variables, or nested expressions within parentheses.
def term(self):
token_type, token_value = self.tokens[self.pos]
if token_type == 'NUMBER':
return Number(self.consume('NUMBER'))
elif token_type == 'ID':
return Variable(self.consume('ID'))
elif token_type == 'LPAREN':
self.consume('LPAREN')
expr = self.expression()
self.consume('RPAREN')
return expr
else:
raise SyntaxError("Invalid syntax")
Numbers: Parsed as
Number
nodes.Variables: Parsed as
Variable
nodes.Parentheses: Allow us to handle expressions like
(3 + 5)
correctly.
Step 10: Testing Our Parser
Let’s test our parser using some sample code:
code = """
let x = 5;
print(x);
"""
tokens = lexer(code)
parser = Parser(tokens)
ast = parser.parse()
for node in ast:
print(node)
If everything is working, the parser will convert the code into an AST, showing the structure of each statement.
Conclusion
Congratulations! You've built a parser that takes tokens and organizes them into an AST. We started with simple nodes and gradually added more complex parsing logic to handle different statements and expressions. The next step is to use this AST in an interpreter that can execute or evaluate it.
Stay tuned for the next article where we’ll explore building an interpreter!
P.S. Got stuck or have questions? Drop a comment below or reach out—I'm all ears!
Subscribe to my newsletter
Read articles from Abou Zuhayr directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Abou Zuhayr
Abou Zuhayr
Hey, I’m Abou Zuhayr, Android developer by day, wannabe rockstar by night. With 6 years of Android wizardry under my belt and currently working at Gather.ai, I've successfully convinced multiple devices to do my bidding while pretending I'm not secretly just turning them off and on again. I’m also deeply embedded (pun intended) in systems that make you question if they’re alive. When I'm not fixing bugs that aren’t my fault (I swear!), I’m serenading my code with guitar riffs or drumming away the frustration of yet another NullPointerException. Oh, and I write sometimes – mostly about how Android development feels like extreme sport mixed with jazz improvisation. Looking for new challenges in tech that don’t involve my devices plotting against me!