Creating a Lexical Analyzer for Theory of Computation: Lexer Development - Part 2


Introduction
In this post, I will share the way I built a Lexical Analyzer using python.
The purpose of this post is to understand the tasks performed by the lexical analyzer and how they are performed. This post describes how I developed the first phase of a compiler, the lexer.
Putting it all together
Once armed with all the necessary tools that I shared in the part 1 (Lexical grammar, Regular expressions, FSM and Transition table), now we have a look at how all come together in the implementation of a lexer.
TokenType
First, we need to create a TokenType Class where we will be adding differents types of tokens as constants. For this class, I will be using Enums in order to categorize them. This will provide a structured and readable way to represent the tokens, also it will assure that each token type is unique.
from enum import Enum
import string
class TokenType(Enum):
"""Represents the different types of tokens that can be identified.
This Enum class is used to categorize and define constants for various
token types.
Attributes:
Each constant in the Enum represents a unique token type, making it
easier to classify and process
"""
LETTERS = string.ascii_letters
DIGITS = "0123456789$"
SYMBOLS = "+-*/=><;"
# Keywords
BASIC_KEYWORDS = ["PRINT","INPUT","IF","THEN","GOTO","FOR","TO","NEXT",
"REM","END",]
# Special token types
LINE_NUMBER = "LINE NUMBER"
KEYWORD = "KEYWORD"
UNKNOWN = "UNKNOWN"
COMMENT = "COMMENT"
# Delimiters
SEMMICOLON = "SEMMICOLON"
# Assignment operators
EQUALS = "EQUALS"
# Identifier and Literals
NUM_IDENTIFIER = "NUMERIC IDENTIFIER"
ALNUM_IDENTIFIER = "ALPHANUMERIC IDENTIFIER"
STRING = "STRING"
NUMBER = "NUMBER"
# Boolean operators
OR = "OR"
# Arithmetic operators
PLUS = "PLUS"
MINUS = "MINUS"
TIMES = "TIMES"
DIVIDE = "DIVIDE"
# Comparison operators
LE = "LESS EQUAL"
GE = "GREATER EQUAL"
NE = "NOT EQUAL"
The Lexer
For this python file, I will be adding two classes. The first class will be Token Class which it will be generating a token for each valid lexeme. We have a finite set of operators, keywords, and other elements, so adding a token for each type of operator or keyword will improve clarity. A token has a type (e.g. Keyword, Identifier, Number, or Operator), a value and the line number.
import re
from TokenType import TokenType
class Token:
"""Represents a 'Token' identified during the analysis.
Attributes:
type: A string indicating the token type.
value: A string with the value of the token.
line: A string with the line number of the token.
"""
def __init__(self, type: str, value: str, line: str):
"""Initializes a new Token instance with a type and value.
Args:
type: Define the 'TokenType' corresponding to the type of the
'Token'.
value: Define the 'String' value of the token. The actual
characters of the lexeme.
line: Define the line number where the token was encountered.
"""
self.type = type
self.value = value
self.line = line
The next class in this file will be Lexer Class. For this class is necessary add attributes in order to keep track of the current position in the input, as long as the current line and column.
class Lexer:
"""Represents a tokenizer for source code analysis.
Attributes:
input: A string with the source code to tokenize.
position: Current position in the source code.
current_char: Current character being processed.
line: Current line number
"""
def __init__(self, text: str):
"""Initializes a Lexer instance with the source code provided.
Args:
input: A string with the source code.
"""
self.input = input
self.position = 0
self.current_char = self.input[self.position]
self.line = 0
In this class, we will be using functions to do different process. The first function is called advance, this is used to track the the current character. This function will be updating the position
attribute to the next character, also current_char
will have the next character in the input, if we reach the end of the source code the attribute will be None.
def advance(self):
"""Moves the current position in the input string.
Args:
None
Updates:
position: Increments by 1 to indicate the next character position.
current_char: The next character in the input string or `None`
if the end of the string is reached.
"""
self.position += 1
self.current_char = (
self.input[self.position]
if self.position < len(self.input)
else None
)
The function extract_tokens, is used to make an array with the information of tokens recognised (Type, Value, Line Number). We get all the tokens until the current char is equal to None. This will be the end of the source code. In this function, I'm using helper functions to delegate the recognition of the lexemes.
def extract_tokens(self):
"""Tokenizes the input text and generates a list of tokens.
Returns:
tokens: A list of Token objects representing the parsed tokens.
"""
is_newline = True
is_rem = False
tokens = []
while self.current_char is not None:
char = self.current_char
# Handle newlines
if char == "\n":
is_newline = True
self.advance()
continue
# Skip whitespace
elif char.isspace():
self.advance()
continue
# Line must start with a line number
if is_newline and self.current_char in TokenType["DIGITS"].value:
token_value = self.recognize_number()
tokens.append(
Token(
TokenType["LINE_NUMBER"].value,
token_value,
token_value,
)
)
self.line = token_value
is_newline = False
# Handle comments
elif (
is_rem
and self.current_char
in TokenType["LETTERS"].value + TokenType["DIGITS"].value
):
tokens.append(
Token(
TokenType["COMMENT"].value,
self.recognize_comment(),
self.line,
)
)
is_rem = False
# Get numbers
elif self.current_char in TokenType["DIGITS"].value:
tokens.append(
Token(
TokenType["NUMBER"].value,
self.recognize_number(),
self.line,
)
)
# Get strings
elif self.current_char == '"':
tokens.append(
Token(
TokenType["STRING"].value,
self.recognize_string(),
self.line,
)
)
# Get identifiers
elif self.current_char in TokenType["LETTERS"].value:
token_value = self.recognize_identifier()
token_type = self.recognize_token_type(token_value)
if token_value == "REM":
is_rem = True
tokens.append(Token(token_type, token_value, self.line))
# Get operators
elif self.current_char in TokenType["SYMBOLS"].value:
token_value = self.recognize_operator()
token_type = self.recognize_token_type(token_value)
tokens.append(Token(token_type, token_value, self.line))
return tokens
Let’s take a look at the helper methods.
Recognising numbers
In this function, I only recognise integer numbers in order to keep the lexer simple. The function reads digits sequentially to form complete integers, ensuring they conform to the syntax rules of the language.
def recognize_number(self):
number = ""
while (
self.current_char
and self.current_char in TokenType["DIGITS"].value
):
number += self.current_char
self.advance()
return number
Recognising identifier
For this function, it reads digits and letters sequentially to form a string until we encounter a character that is not a letter or a digit.
def recognize_identifier(self):
identifier = ""
while (
self.current_char
and self.current_char
in TokenType["LETTERS"].value + TokenType["DIGITS"].value
):
identifier += self.current_char
self.advance()
return identifier
Recognising token type
In this function, we will be recognising the type of a given token. It checks if the token value is a keyword, an alphanumeric identifier, a numeric identifier, or a specific token. For this function, I added a mapping for specific token in order to avoid the need of nested if statements. Additionally, I used regular expressions to validate the two types of identifiers. A regular expression is a rule that describe all the string that can be built from a set of basic characters/symbols.
def recognize_token_type(self, token_value: str):
"""Recognizes and returns the type of token value given.
Args:
token_type: The token value to classify.
Returns:
identifier: A string with the token type corresponding to the
given token value.
"""
# Mapping specific token values to their types
token_type_map = {
"OR": TokenType["OR"].value,
";": TokenType["SEMMICOLON"].value,
"<=": TokenType["LE"].value,
">=": TokenType["GE"].value,
"+": TokenType["PLUS"].value,
"-": TokenType["MINUS"].value,
"*": TokenType["TIMES"].value,
"/": TokenType["DIVIDE"].value,
"=": TokenType["EQUALS"].value,
"<>": TokenType["NE"].value,
}
# Check for keywords
if token_value.upper() in TokenType["BASIC_KEYWORDS"].value:
return TokenType["KEYWORD"].value
# Check the token_type_map for predefined matches
if token_value in token_type_map:
return token_type_map[token_value]
# Check for alphanumeric identifier
if re.match(r"[A-Z][A-Z]*\${1}$", token_value.upper()):
return TokenType["ALNUM_IDENTIFIER"].value
# Check for numeric identifier
if re.match(r"[A-Z][A-Z]*$", token_value.upper()):
return TokenType["NUM_IDENTIFIER"].value
# Default type
return TokenType["UNKNOWN"].value
Recognising operators
This function reads symbols to form a string representing the operator until we encounter a character that is not a symbol. Once the operator is formed, the recognize_token_type
function will be the responsible to determinate the type of operator , arithmetic or comparison operator.
def recognize_operator(self):
symbol = ""
while (
self.current_char
and self.current_char in TokenType["SYMBOLS"].value
):
symbol += self.current_char
self.advance()
return symbol
Recognising strings
This function reads any character sequentially to form a string until a double-quote symbol ("
) is detected, indicating the end of the string. All characters between the quotes are captured, including spaces and special symbols.
def recognize_string(self):
string = ""
self.advance()
while self.current_char and self.current_char != '"':
string += self.current_char
self.advance()
self.advance()
return string
Recognising comments
This function reads any character sequentially to form a string until a line break (\n
) is detected, indicating the end of the comment. All characters captured, including spaces and special symbols. This function will be called when the keyword REM is detected.
def recognize_comment(self):
comment = ""
while self.current_char and self.current_char != "\n":
comment += self.current_char
self.advance()
return comment
Testing the lexer
In order to know if this class works correctly, I will be using pytest for testing. First, I create a file called test_lexer.py
, where I define the test functions. For this functions, I instantiate the Lexer
class, passing raw source code to its constructor, and then call the extract_tokens
function. The returned list of tokens is transformed into a list of dictionaries for easier comparison with the expected results.
from src.Lexer import Lexer
def test_validate_line_number_token():
"""
Test the extract_tokens function to ensure it returns line number token.
"""
tokens = Lexer("10").extract_tokens()
token_dicts = [
{"type": token.type, "value": token.value, "line": token.line}
for token in tokens
]
expected_tokens = [{"type": "LINE NUMBER", "value": "10", "line": "10"}]
assert token_dicts == expected_tokens
def test_validate_keyword_token():
"""
Test the extract_tokens function to ensure it returns line number and
END token.
"""
tokens = Lexer("200 END").extract_tokens()
token_dicts = [
{"type": token.type, "value": token.value, "line": token.line}
for token in tokens
]
expected_tokens = [
{"type": "LINE NUMBER", "value": "200", "line": "200"},
{"type": "KEYWORD", "value": "END", "line": "200"},
]
assert token_dicts == expected_tokens
def test_validate_identifier_plus_number_tokens():
"""
Test the extract_tokens function to ensure it returns line number,
identifier, plus and number token.
"""
tokens = Lexer("20 r = r + 10").extract_tokens()
token_dicts = [
{"type": token.type, "value": token.value, "line": token.line}
for token in tokens
]
expected_tokens = [
{"type": "LINE NUMBER", "value": "20", "line": "20"},
{"type": "NUMERIC IDENTIFIER", "value": "r", "line": "20"},
{"type": "EQUALS", "value": "=", "line": "20"},
{"type": "NUMERIC IDENTIFIER", "value": "r", "line": "20"},
{"type": "PLUS", "value": "+", "line": "20"},
{"type": "NUMBER", "value": "10", "line": "20"},
]
assert token_dicts == expected_tokens
def test_validate_identifier_string_tokens():
"""
Test the extract_tokens function to ensure it returns line number,
keyword and string token.
"""
raw_code = '20 PRINT "Hello World!"'
tokens = Lexer(raw_code).extract_tokens()
token_dicts = [
{"type": token.type, "value": token.value, "line": token.line}
for token in tokens
]
expected_tokens = [
{"type": "LINE NUMBER", "value": "20", "line": "20"},
{"type": "KEYWORD", "value": "PRINT", "line": "20"},
{"type": "STRING", "value": "Hello World!", "line": "20"},
]
assert token_dicts == expected_tokens
def test_validate_comment():
"""
Test the extract_tokens function to ensure it returns line number,
keyword and comment token.
"""
raw_code = "185 REM This is a comment"
tokens = Lexer(raw_code).extract_tokens()
token_dicts = [
{"type": token.type, "value": token.value, "line": token.line}
for token in tokens
]
expected_tokens = [
{"type": "LINE NUMBER", "value": "185", "line": "185"},
{"type": "KEYWORD", "value": "REM", "line": "185"},
{"type": "COMMENT", "value": "This is a comment", "line": "185"},
]
assert token_dicts == expected_tokens
Once we have this in place, we can run our tests, and they should be passing.
Conclusion
Now we have completed the Lexer class. This class is able of tokenize the input source code. It ensures that the raw source code is broken down into meaningful tokens.
In the next article, I will be sharing the development process of a Syntax Analyzer. This will involve how a finite state machine (FSM) can help us to analyse token sequences.
Subscribe to my newsletter
Read articles from Saul Hernandez directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Saul Hernandez
Saul Hernandez
Continuously learning and growing in software development, with experience in web, mobile, and SQL. I love sharing my journey and exploring innovation in tech! 🚀