Quark's Outlines: Python Lexical Analysis

Mike VincentMike Vincent
Apr 02, 2025·
7 min read

Synopsis of Python Lexical Analysis

What is lexical analysis?

If you want to understand how Python reads your code, then you must first learn about lexical analysis. Lexical analysis is the first step in how Python reads and runs your code. It means turning raw text into smaller parts called tokens. A token is one word, name, symbol, or number that has meaning in Python. These tokens are passed to the parser, which checks if the tokens follow the rules of the language.

Python code uses the ASCII character set, which is a standard list of letters, numbers, and symbols. You can also use other characters inside string values and comments. But these may look different on other systems unless you write them using escape codes like \xhh or \ooo.

How does Python turn code into tokens?

Python reads the program from left to right and breaks it into a stream of tokens. A token can be an identifier, keyword, literal, operator, or delimiter.

An identifier is a name like price or total. A keyword is a word used by Python, such as if, def, or return. A literal is a fixed value like 7, 3.14, or 'hello'. An operator is a symbol like +, -, *, /, ==, !=, <, >, <=, >=. A delimiter is a symbol like (, ), [, ], {, }, :, ,, ., or ;.

Python also uses three layout tokens: NEWLINE, INDENT, and DEDENT. These help mark the shape of the code. Whitespace such as spaces, tabs, and formfeeds is used to separate tokens but is not itself a token.

When Python reads the code, it always forms the longest possible valid token. For example, it reads == as one token meaning equal to. It does not read it as two = tokens. Whitespace is ignored unless it changes how tokens are grouped.

Why learn about lexical analysis in Python?

Lexical analysis shows how Python sees your code before it runs. It explains how Python finds tokens, builds logical lines from physical lines, and groups code using indentation. A logical line is a full line of code as Python understands it. A physical line is one line of text in the file.

Knowing how lexical analysis works helps you avoid common errors. You learn why Python is strict about spacing and layout, and why some code fails even if it looks correct. Lexical rules form the base of Python syntax.

Timeline of Python Lexical Analysis

Where does lexical analysis come from?

Lexical analysis began as a way to separate meaningful parts of language. In programming, it became the first step in turning code into structure. Behind every parser is a stream of tokens, and behind every grammar is a theory of how rules can be formed. This timeline lists the key people, tools, and writings that shaped how Python and other languages break code into tokens.

1956 – Chomsky Hierarchy. Noam Chomsky published a system of formal grammars that describe how valid strings are built from symbols.

1957 – FORTRAN Compiler. IBM’s compiler for FORTRAN turned algebra-like input into tokens the machine could use.

1960 – ALGOL 60 Report. Peter Naur edited the ALGOL 60 grammar, showing how language rules could be written in formal notation.

1965 – Syntax-Directed Translation. Donald Knuth introduced the idea that parse trees can carry meaning using grammar-based rules.

1975 – Lex Tool. Mike Lesk and Eric Schmidt created lex, which builds lexical analyzers from pattern-matching rules.

1977 – Dragon Book. Alfred Aho and Jeffrey Ullman published Principles of Compiler Design, introducing compiler stages and token systems.

1986 – Red Dragon Book. Aho, Sethi, and Ullman expanded their text with stronger methods for scanning, parsing, and code generation.

1989 – Python 0.9. Guido van Rossum released Python with indentation-based syntax, placing lexical meaning in leading whitespace.

1994 – PLY Library. David Beazley created PLY, a pure Python version of lex and yacc used to build parsers from rules and tokens.

2003 – Python PEP 263. Python 2.3 added encoding declarations to help the lexer read non-ASCII source files using comments like # coding: utf-8.

2008 – Python PEP 3131. Python 3.0 allowed non-ASCII letters in identifiers and made UTF-8 the default source encoding.

2019 – Python Walrus Operator. Python 3.8 introduced := as a new token for assignment in expressions, expanding the lexer’s symbol set.

2021 – Python Match Keywords. Python 3.10 added match and case as soft keywords that act as tokens only in pattern matching.

2023 – Hugging Face Tokenizers. LLM tools used fast tokenizers to split text into prediction units, adapting lexical scanning to natural language.

2025 – Python 3.13 Token Errors. Python improved its error messages when tokenizing fails due to bad characters or wrong indentation.

Problems & Solutions with Python Lexical Analysis

Python's lexical analysis helps the computer break code into parts it can understand. These rules tell Python where one word ends and the next begins, which symbols matter, and which ones are not allowed. The examples below show common problems and how Python solves them using lexical rules. Each problem starts with a real-world situation and ends with a solution that uses Python's token system.

Problem 1: How to keep words separate using a Python token rule

Problem: You write a message to a friend but do not use spaces. You write Mixeggsflourandmilk with no breaks. Your friend cannot tell where one word ends or what the message means.

Solution: Python solves this by requiring tokens to be clearly separated. A token is one name, number, or symbol. You must use spacing or punctuation so Python can tell one token from the next. For example, a = 2 + b has five tokens: a, =, 2, +, b. If you write ab, Python sees that as one token. If you write a b, Python sees two names. Lexical analysis makes sure tokens do not merge. In the line x = 5 * (y + 3), Python finds these tokens: x, =, 5, *, (, y, +, 3, ).

x = 5 * (y + 3)

Problem 2: How to read compound symbols using a Python operator token

Problem: You see a sign that says ==. You do not know the symbol and try to read each = by itself. But the meaning only makes sense when you read both together.

Solution: Python uses the longest match that forms a legal token. It reads == as one token, not two. This is the equality operator. If you wrote = and =, it would not mean the same thing. Python also treats !=, <=, >=, <<, and >> as full tokens. These compound symbols are never split up. In the line if x == 10:, Python reads == as one unit that checks for equality.

if x == 10:
    print(x)

Problem 3: How to avoid invalid symbols using a Python character rule

Problem: You write a sign using English letters, but then add a rare symbol like $ to make it stand out. Some readers do not know what it means or think it is a mistake.

In Python, if you try to name something using an invalid symbol, the code will not run:

total$ = 10

Solution: Python only allows certain characters in its code. These include letters, digits, a few symbols, and some Unicode characters. Symbols like $, @, or ? are not allowed unless inside a string or comment. If you type something like total$ = 10, Python gives a syntax error. But total_cost = 10 works because it uses valid characters.

total_cost = price + tax

Like, Comment, and Subscribe

Did you find this helpful? Let me know by clicking the like button below. I'd love to hear your thoughts in the comments, too! If you want to see more content like this, don't forget to subscribe to my channel. Thanks for reading!


Mike Vincent is an American software engineer and writer based in Los Angeles. Mike writes about technology leadership and holds degrees in Linguistics and Industrial Automation. More about Mike Vincent

7
Subscribe to my newsletter

Read articles from Mike Vincent directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Mike Vincent
Mike Vincent

Mike Vincent is an American software engineer and writer based in Los Angeles.