Introduction

In this post, I will share the way I built a Lexical Analyzer using python.

The purpose of this post is to understand the tasks performed by the lexical analyzer and how they are performed. This post describes how I developed the first phase of a compiler, the lexer.

Putting it all together

Once armed with all the necessary tools that I shared in the part 1 (Lexical grammar, Regular expressions, FSM and Transition table), now we have a look at how all come together in the implementation of a lexer.

TokenType

First, we need to create a TokenType Class where we will be adding differents types of tokens as constants. For this class, I will be using Enums in order to categorize them. This will provide a structured and readable way to represent the tokens, also it will assure that each token type is unique.

from enum import Enum
import string

class TokenType(Enum):
    """Represents the different types of tokens that can be identified.

    This Enum class is used to categorize and define constants for various 
    token types.

    Attributes:
        Each constant in the Enum represents a unique token type, making it 
        easier to classify and process
    """

    LETTERS = string.ascii_letters
    DIGITS = "0123456789$"
    SYMBOLS = "+-*/=><;"

    # Keywords
    BASIC_KEYWORDS = ["PRINT","INPUT","IF","THEN","GOTO","FOR","TO","NEXT",
        "REM","END",]

    # Special token types
    LINE_NUMBER = "LINE NUMBER"
    KEYWORD = "KEYWORD"
    UNKNOWN = "UNKNOWN"
    COMMENT = "COMMENT"

    # Delimiters
    SEMMICOLON = "SEMMICOLON"

    # Assignment operators
    EQUALS = "EQUALS"

    # Identifier and Literals
    NUM_IDENTIFIER = "NUMERIC IDENTIFIER"
    ALNUM_IDENTIFIER = "ALPHANUMERIC IDENTIFIER"
    STRING = "STRING"
    NUMBER = "NUMBER"

    # Boolean operators
    OR = "OR"

    # Arithmetic operators
    PLUS = "PLUS"
    MINUS = "MINUS"
    TIMES = "TIMES"
    DIVIDE = "DIVIDE"

    # Comparison operators
    LE = "LESS EQUAL"
    GE = "GREATER EQUAL"
    NE = "NOT EQUAL"

The Lexer

For this python file, I will be adding two classes. The first class will be Token Class which it will be generating a token for each valid lexeme. We have a finite set of operators, keywords, and other elements, so adding a token for each type of operator or keyword will improve clarity. A token has a type (e.g. Keyword, Identifier, Number, or Operator), a value and the line number.

import re

from TokenType import TokenType

class Token:
    """Represents a 'Token' identified during the analysis.

    Attributes:
        type: A string indicating the token type.
        value: A string with the value of the token.
        line: A string with the line number of the token.
    """
    def __init__(self, type: str, value: str, line: str):
        """Initializes a new Token instance with a type and value.

        Args:
            type: Define the 'TokenType' corresponding to the type of the 
                'Token'.
            value: Define the 'String' value of the token. The actual 
                characters of the lexeme.
            line: Define the line number where the token was encountered.
        """
        self.type = type
        self.value = value
        self.line = line

The next class in this file will be Lexer Class. For this class is necessary add attributes in order to keep track of the current position in the input, as long as the current line and column.

class Lexer:
    """Represents a tokenizer for source code analysis.

    Attributes:
        input: A string with the source code to tokenize.
        position: Current position in the source code.
        current_char: Current character being processed.
        line: Current line number
    """

    def __init__(self, text: str):
        """Initializes a Lexer instance with the source code provided.

        Args:
            input: A string with the source code.
        """
        self.input = input
        self.position = 0
        self.current_char = self.input[self.position]
        self.line = 0

In this class, we will be using functions to do different process. The first function is called advance, this is used to track the the current character. This function will be updating the position attribute to the next character, also current_char will have the next character in the input, if we reach the end of the source code the attribute will be None.

    def advance(self):
        """Moves the current position in the input string.

        Args:
            None

        Updates:
            position: Increments by 1 to indicate the next character position.
            current_char: The next character in the input string or `None` 
                if the end of the string is reached.
        """
        self.position += 1
        self.current_char = (
            self.input[self.position]
            if self.position < len(self.input)
            else None
        )

The function extract_tokens, is used to make an array with the information of tokens recognised (Type, Value, Line Number). We get all the tokens until the current char is equal to None. This will be the end of the source code. In this function, I'm using helper functions to delegate the recognition of the lexemes.

def extract_tokens(self):
        """Tokenizes the input text and generates a list of tokens.

        Returns:
            tokens: A list of Token objects representing the parsed tokens.
        """
        is_newline = True
        is_rem = False
        tokens = []

        while self.current_char is not None:
            char = self.current_char

            # Handle newlines
            if char == "\n":
                is_newline = True
                self.advance()
                continue

            # Skip whitespace
            elif char.isspace():
                self.advance()
                continue

            # Line must start with a line number
            if is_newline and self.current_char in TokenType["DIGITS"].value:
                token_value = self.recognize_number()
                tokens.append(
                    Token(
                        TokenType["LINE_NUMBER"].value,
                        token_value,
                        token_value,
                    )
                )
                self.line = token_value
                is_newline = False

            # Handle comments
            elif (
                is_rem
                and self.current_char
                in TokenType["LETTERS"].value + TokenType["DIGITS"].value
            ):
                tokens.append(
                    Token(
                        TokenType["COMMENT"].value,
                        self.recognize_comment(),
                        self.line,
                    )
                )
                is_rem = False

            # Get numbers
            elif self.current_char in TokenType["DIGITS"].value:
                tokens.append(
                    Token(
                        TokenType["NUMBER"].value,
                        self.recognize_number(),
                        self.line,
                    )
                )

            # Get strings
            elif self.current_char == '"':
                tokens.append(
                    Token(
                        TokenType["STRING"].value,
                        self.recognize_string(),
                        self.line,
                    )
                )

            # Get identifiers
            elif self.current_char in TokenType["LETTERS"].value:
                token_value = self.recognize_identifier()
                token_type = self.recognize_token_type(token_value)
                if token_value == "REM":
                    is_rem = True
                tokens.append(Token(token_type, token_value, self.line))

            # Get operators
            elif self.current_char in TokenType["SYMBOLS"].value:
                token_value = self.recognize_operator()
                token_type = self.recognize_token_type(token_value)
                tokens.append(Token(token_type, token_value, self.line))

        return tokens

Let’s take a look at the helper methods.

Recognising numbers

In this function, I only recognise integer numbers in order to keep the lexer simple. The function reads digits sequentially to form complete integers, ensuring they conform to the syntax rules of the language.

    def recognize_number(self):
        number = ""

        while (
            self.current_char
            and self.current_char in TokenType["DIGITS"].value
        ):
            number += self.current_char
            self.advance()

        return number

Recognising identifier

For this function, it reads digits and letters sequentially to form a string until we encounter a character that is not a letter or a digit.

    def recognize_identifier(self):
        identifier = ""
        while (
            self.current_char
            and self.current_char
            in TokenType["LETTERS"].value + TokenType["DIGITS"].value
        ):
            identifier += self.current_char
            self.advance()

        return identifier

Recognising token type

In this function, we will be recognising the type of a given token. It checks if the token value is a keyword, an alphanumeric identifier, a numeric identifier, or a specific token. For this function, I added a mapping for specific token in order to avoid the need of nested if statements. Additionally, I used regular expressions to validate the two types of identifiers. A regular expression is a rule that describe all the string that can be built from a set of basic characters/symbols.

    def recognize_token_type(self, token_value: str):
        """Recognizes and returns the type of token value given.

        Args:
            token_type: The token value to classify.

        Returns:
            identifier: A string with the token type corresponding to the 
                given token value.
        """
        # Mapping specific token values to their types
        token_type_map = {
            "OR": TokenType["OR"].value,
            ";": TokenType["SEMMICOLON"].value,
            "<=": TokenType["LE"].value,
            ">=": TokenType["GE"].value,
            "+": TokenType["PLUS"].value,
            "-": TokenType["MINUS"].value,
            "*": TokenType["TIMES"].value,
            "/": TokenType["DIVIDE"].value,
            "=": TokenType["EQUALS"].value,
            "<>": TokenType["NE"].value,
        }

        # Check for keywords
        if token_value.upper() in TokenType["BASIC_KEYWORDS"].value:
            return TokenType["KEYWORD"].value

        # Check the token_type_map for predefined matches
        if token_value in token_type_map:
            return token_type_map[token_value]

        # Check for alphanumeric identifier
        if re.match(r"[A-Z][A-Z]*\${1}$", token_value.upper()):
            return TokenType["ALNUM_IDENTIFIER"].value

        # Check for numeric identifier
        if re.match(r"[A-Z][A-Z]*$", token_value.upper()):
            return TokenType["NUM_IDENTIFIER"].value

        # Default type
        return TokenType["UNKNOWN"].value

Recognising operators

This function reads symbols to form a string representing the operator until we encounter a character that is not a symbol. Once the operator is formed, the recognize_token_type function will be the responsible to determinate the type of operator , arithmetic or comparison operator.

    def recognize_operator(self):
        symbol = ""
        while (
            self.current_char
            and self.current_char in TokenType["SYMBOLS"].value
        ):
            symbol += self.current_char
            self.advance()

        return symbol

Recognising strings

This function reads any character sequentially to form a string until a double-quote symbol (") is detected, indicating the end of the string. All characters between the quotes are captured, including spaces and special symbols.

    def recognize_string(self):
        string = ""
        self.advance()

        while self.current_char and self.current_char != '"':
            string += self.current_char
            self.advance()

        self.advance()

        return string

Recognising comments

This function reads any character sequentially to form a string until a line break (\n) is detected, indicating the end of the comment. All characters captured, including spaces and special symbols. This function will be called when the keyword REM is detected.

    def recognize_comment(self):
        comment = ""

        while self.current_char and self.current_char != "\n":
            comment += self.current_char
            self.advance()

        return comment

Testing the lexer

In order to know if this class works correctly, I will be using pytest for testing. First, I create a file called test_lexer.py, where I define the test functions. For this functions, I instantiate the Lexer class, passing raw source code to its constructor, and then call the extract_tokens function. The returned list of tokens is transformed into a list of dictionaries for easier comparison with the expected results.

from src.Lexer import Lexer

def test_validate_line_number_token():
    """
    Test the extract_tokens function to ensure it returns line number token.
    """
    tokens = Lexer("10").extract_tokens()

    token_dicts = [
        {"type": token.type, "value": token.value, "line": token.line}
        for token in tokens
    ]

    expected_tokens = [{"type": "LINE NUMBER", "value": "10", "line": "10"}]

    assert token_dicts == expected_tokens


def test_validate_keyword_token():
    """
    Test the extract_tokens function to ensure it returns line number and 
    END token.
    """
    tokens = Lexer("200 END").extract_tokens()

    token_dicts = [
        {"type": token.type, "value": token.value, "line": token.line}
        for token in tokens
    ]

    expected_tokens = [
        {"type": "LINE NUMBER", "value": "200", "line": "200"},
        {"type": "KEYWORD", "value": "END", "line": "200"},
    ]

    assert token_dicts == expected_tokens


def test_validate_identifier_plus_number_tokens():
    """
    Test the extract_tokens function to ensure it returns line number,
    identifier, plus and number token.
    """
    tokens = Lexer("20 r = r + 10").extract_tokens()

    token_dicts = [
        {"type": token.type, "value": token.value, "line": token.line}
        for token in tokens
    ]

    expected_tokens = [
        {"type": "LINE NUMBER", "value": "20", "line": "20"},
        {"type": "NUMERIC IDENTIFIER", "value": "r", "line": "20"},
        {"type": "EQUALS", "value": "=", "line": "20"},
        {"type": "NUMERIC IDENTIFIER", "value": "r", "line": "20"},
        {"type": "PLUS", "value": "+", "line": "20"},
        {"type": "NUMBER", "value": "10", "line": "20"},
    ]

    assert token_dicts == expected_tokens


def test_validate_identifier_string_tokens():
    """
    Test the extract_tokens function to ensure it returns line number,
    keyword and string token.
    """
    raw_code = '20 PRINT "Hello World!"'
    tokens = Lexer(raw_code).extract_tokens()

    token_dicts = [
        {"type": token.type, "value": token.value, "line": token.line}
        for token in tokens
    ]

    expected_tokens = [
        {"type": "LINE NUMBER", "value": "20", "line": "20"},
        {"type": "KEYWORD", "value": "PRINT", "line": "20"},
        {"type": "STRING", "value": "Hello World!", "line": "20"},
    ]

    assert token_dicts == expected_tokens


def test_validate_comment():
    """
    Test the extract_tokens function to ensure it returns line number, 
    keyword and comment token.
    """
    raw_code = "185 REM This is a comment"
    tokens = Lexer(raw_code).extract_tokens()

    token_dicts = [
        {"type": token.type, "value": token.value, "line": token.line}
        for token in tokens
    ]

    expected_tokens = [
        {"type": "LINE NUMBER", "value": "185", "line": "185"},
        {"type": "KEYWORD", "value": "REM", "line": "185"},
        {"type": "COMMENT", "value": "This is a comment", "line": "185"},
    ]

    assert token_dicts == expected_tokens

Once we have this in place, we can run our tests, and they should be passing.

Conclusion

Now we have completed the Lexer class. This class is able of tokenize the input source code. It ensures that the raw source code is broken down into meaningful tokens.

In the next article, I will be sharing the development process of a Syntax Analyzer. This will involve how a finite state machine (FSM) can help us to analyse token sequences.

Creating a Lexical Analyzer for Theory of Computation: Lexer Development - Part 2