Building a Parser: A Beginner-Friendly Guide to Constructing an AST-Based Parser

Abou ZuhayrAbou Zuhayr
6 min read

In our previous article, we built a lexer that converts code into tokens, breaking down the input into meaningful pieces. Now, the next exciting step is to write a parser, which takes these tokens and organizes them into a meaningful structure called an Abstract Syntax Tree (AST). This guide will walk you through the process of building a parser in a simple, step-by-step manner, making it easy to understand.

The code to this series can be found here

What Exactly is a Parser?

A parser is like a translator. It takes the sequence of tokens produced by the lexer and turns them into an organized structure. This structure, the AST, represents the grammar and rules of our programming language. For example, an assignment statement like let x = 5; would be broken down into parts (tokens) and then arranged into a tree that shows the relationships between these parts.

Step 1: Defining the AST Structure

Before we start coding the parser, we need to define our AST nodes. These nodes will represent different elements in our programming language. Let's start with a few simple ones:

  • Number: Represents a numeric value.

  • BinOp: Represents an operation like + or -.

  • Variable: Represents a variable name.

  • Assign: Represents an assignment statement like let x = 5;.

  • Print: Represents a print statement like print(x);.

Here’s how we define these nodes in Python:

class ASTNode:
    pass

class Number(ASTNode):
    def __init__(self, value):
        self.value = value

class BinOp(ASTNode):
    def __init__(self, left, op, right):
        self.left = left
        self.op = op
        self.right = right

class Variable(ASTNode):
    def __init__(self, name):
        self.name = name

class Assign(ASTNode):
    def __init__(self, name, value):
        self.name = name
        self.value = value

class Print(ASTNode):
    def __init__(self, expression):
        self.expression = expression

These classes will help our parser build the tree, making it easy to visualize and work with the structure of the code.

Step 2: Setting Up the Parser

Now that we have the basic AST structure, let’s set up the Parser class. This class will take the list of tokens from our lexer and keep track of where it is in that list using a position (pos).

class Parser:
    def __init__(self, tokens):
        self.tokens = tokens
        self.pos = 0

The tokens parameter will hold the list of tokens we got from the lexer, and pos will keep track of which token we’re currently looking at.

Step 3: Moving Through Tokens with the consume Method

To make our parser work, we need a way to move through the tokens and check if they match what we expect. That’s where the consume method comes in. It will check if the current token matches what we need, move to the next one if it does, and return its value. If it doesn’t match, it will raise an error.

def consume(self, expected_type):
    if self.pos < len(self.tokens) and self.tokens[self.pos][0] == expected_type:
        self.pos += 1
        return self.tokens[self.pos - 1][1]
    else:
        raise SyntaxError(f"Expected {expected_type}, got {self.tokens[self.pos][0]}")

This method helps us move through the tokens while making sure everything follows the rules of our language.

Step 4: Parsing the Entire Code with the parse Method

The main job of the parser is to go through all the tokens and build an AST. We start by defining the parse method, which creates a list of statements (each represented as an AST node).

def parse(self):
    statements = []
    while self.pos < len(self.tokens):
        if self.tokens[self.pos][0] == 'NEWLINE':
            self.pos += 1  # Skip newlines
            continue
        statements.append(self.statement())
    return statements

The parse method goes through each token until there are none left. It checks if the current token is a newline (which we skip), and otherwise, it tries to parse a statement using the statement method.

Step 5: Parsing Statements

In our simple language, there are two types of statements: assignments (like let x = 5;) and print statements (like print(x);). The statement method decides which type of statement to parse based on the current token.

def statement(self):
    token_type, token_value = self.tokens[self.pos]
    if token_type == 'ID' and token_value == 'let':
        return self.assignment()
    elif token_type == 'ID' and token_value == 'print':
        return self.print_statement()
    else:
        raise SyntaxError("Unknown statement")
  • If the token is let, we know it’s an assignment, so we call the assignment method.

  • If the token is print, we call the print_statement method.

Step 6: Parsing Assignments

The assignment method handles statements like let x = 5;. It expects the keyword let, a variable name, an = sign, an expression (like a number or variable), and a semicolon (;).

def assignment(self):
    self.consume('ID')  # consume 'let'
    var_name = self.consume('ID')
    self.consume('ASSIGN')
    expr = self.expression()
    self.consume('END')
    return Assign(var_name, expr)

This method uses consume to check if the tokens match the pattern of an assignment.

Step 7: Parsing Print Statements

Next, we define print_statement to handle statements like print(x);. It expects print, followed by an opening parenthesis ((), an expression, a closing parenthesis ()), and a semicolon (;).

def print_statement(self):
    self.consume('ID')  # consume 'print'
    self.consume('LPAREN')
    expr = self.expression()
    self.consume('RPAREN')
    self.consume('END')
    return Print(expr)

This method builds a Print node, making sure the structure is correct.

Step 8: Parsing Expressions

Expressions are the building blocks of our language. They can be numbers, variables, or combinations of operations like 5 + 3. The expression method will handle this:

def expression(self):
    left = self.term()
    while self.pos < len(self.tokens) and self.tokens[self.pos][0] == 'OP':
        op = self.consume('OP')
        right = self.term()
        left = BinOp(left, op, right)
    return left

This method looks for terms (numbers, variables, or parentheses) and builds binary operations (BinOp nodes) when it finds operators (+, -, etc.).

Step 9: Parsing the Smallest Parts - Terms

The term method handles the smallest elements like numbers, variables, or nested expressions within parentheses.

def term(self):
    token_type, token_value = self.tokens[self.pos]
    if token_type == 'NUMBER':
        return Number(self.consume('NUMBER'))
    elif token_type == 'ID':
        return Variable(self.consume('ID'))
    elif token_type == 'LPAREN':
        self.consume('LPAREN')
        expr = self.expression()
        self.consume('RPAREN')
        return expr
    else:
        raise SyntaxError("Invalid syntax")
  • Numbers: Parsed as Number nodes.

  • Variables: Parsed as Variable nodes.

  • Parentheses: Allow us to handle expressions like (3 + 5) correctly.

Step 10: Testing Our Parser

Let’s test our parser using some sample code:

code = """
let x = 5;
print(x);
"""

tokens = lexer(code)
parser = Parser(tokens)
ast = parser.parse()

for node in ast:
    print(node)

If everything is working, the parser will convert the code into an AST, showing the structure of each statement.

Conclusion

Congratulations! You've built a parser that takes tokens and organizes them into an AST. We started with simple nodes and gradually added more complex parsing logic to handle different statements and expressions. The next step is to use this AST in an interpreter that can execute or evaluate it.

Stay tuned for the next article where we’ll explore building an interpreter!


P.S. Got stuck or have questions? Drop a comment below or reach out—I'm all ears!

0
Subscribe to my newsletter

Read articles from Abou Zuhayr directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Abou Zuhayr
Abou Zuhayr

Hey, I’m Abou Zuhayr, Android developer by day, wannabe rockstar by night. With 6 years of Android wizardry under my belt and currently working at Gather.ai, I've successfully convinced multiple devices to do my bidding while pretending I'm not secretly just turning them off and on again. I’m also deeply embedded (pun intended) in systems that make you question if they’re alive. When I'm not fixing bugs that aren’t my fault (I swear!), I’m serenading my code with guitar riffs or drumming away the frustration of yet another NullPointerException. Oh, and I write sometimes – mostly about how Android development feels like extreme sport mixed with jazz improvisation. Looking for new challenges in tech that don’t involve my devices plotting against me!