Shunting Yard and Thompson's Algorithms

Khushal AgrawalKhushal Agrawal
3 min read

When most people think of “regex engines,” they imagine complicated parsers or some mystical black box that magically matches strings.
Underneath, the best of them use two brilliantly simple ideas — one to turn your messy regex into clean instructions, and one to turn those instructions into a machine that eats strings for breakfast.

The Shunting Yard Algorithm — Order Without Parentheses

Imagine you have the regex:

a(b|c)*d

It’s not obvious at first glance which operators happen before others.
Humans can parse it visually, but computers? They need explicit, ordered instructions.

That’s where Edsger Dijkstra’s Shunting Yard Algorithm comes in.
Its whole purpose: take infix notation (human-friendly) and turn it into postfix notation (machine-friendly) without losing meaning.

It does this by juggling two stacks:

  • One for the output (postfix tokens)

  • One for operators, carefully applying precedence rules:

      + ? *   → highest (quantifiers)
      concat  → middle
      |       → lowest
    

The name comes from the way it moves tokens around like train cars in a railway yard — always ensuring operators “fall” into place at the right time.

Output of our example:

a b c | * concat d concat

Now the order is crystal clear.

Thompson’s Construction — From Postfix to an NFA

Postfix form is like having an IKEA instruction sheet — now you just need to build.

Ken Thompson’s algorithm takes each token and builds tiny NFAs (Non-deterministic Finite Automata), then glues them together:

  • a → a two-state NFA with an a transition.

  • * → wraps an NFA in loops via epsilon transitions (ε-moves).

  • | → forks the path, letting the machine explore alternatives in parallel.

Because it’s all done with epsilon transitions (free moves that don’t consume input), Thompson’s method is guaranteed to:

  1. Be linear in regex size — no catastrophic backtracking.

  2. Match exactly the language described, no surprises.

  3. Produce a structure you can simulate easily.

In the end, your regex turns into a little network of states — a machine that tries all possibilities in parallel without guessing wrong.


Why This Is Elegant

  • Shunting Yard: Instead of writing a massive parser for regex precedence rules, you use a tiny general-purpose algorithm invented for math expressions.

  • Thompson’s: Instead of writing a hand-tuned state machine for each regex, you just snap together a few small building blocks.

The beauty?
The two algorithms fit together perfectly — Shunting Yard linearizes the expression, and Thompson’s treats that sequence like Lego instructions.


Bonus Tricks

  • Concat token: Humans omit it, machines need it — adding explicit concat operators before Shunting Yard makes life easy.

  • Operator precedence: Handle right-associative vs left-associative correctly — quantifiers like + and * bind tightly and don’t swap order.

  • Character classes: Treat [abc] as a single literal token in Shunting Yard so Thompson’s can build an NFA that matches any of them.


The Takeaway

Shunting Yard + Thompson’s isn’t the most fashionable approach in modern regex engines (many use DFAs or hybrid models), but it’s an example of engineering minimalism:

  • Lean on a small, battle-tested algorithm instead of overcomplicating parsing.

  • Build from small, composable NFA fragments instead of hardcoding everything.

  • End up with something easy to debug, extend, and understand.

It’s regex without the black magic.

10
Subscribe to my newsletter

Read articles from Khushal Agrawal directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Khushal Agrawal
Khushal Agrawal