LSM tree

If you’re diving into the world of database internals or exploring how modern NoSQL databases like Cassandra, RocksDB, or LevelDB handle massive amounts of writes, you’ll eventually run into the term LSM Tree — short for Log-Structured Merge Tree.

In this blog post, I’ll walk you through the components and mechanics of LSM Trees, how writes and deletes are handled, and why this design is optimized for speed and reliability.

🧠 What is an LSM Tree?

At a high level, an LSM Tree is a data structure designed to optimize write-heavy workloads. Unlike traditional B-Trees that update data in place, LSM Trees write new data sequentially and defer organizing it until later. This makes them incredibly efficient for high-throughput systems where data is constantly being ingested.

Think of it like this:

Rather than scribbling changes directly into a book (like B-Trees do), LSM Trees take quick notes on a sticky pad (in memory), and once they accumulate enough, they rewrite the book cleanly (on disk).

🧩 Core Components of LSM Tree-Based Systems

Write-Ahead Log (WAL) – The Black Box Recorder

When you write data, the first stop is the WAL, a sequential log on disk. It guarantees durability — if the system crashes, this log can be replayed to recover lost writes. You can think of it like a flight recorder: it logs everything before changes are made.

MemTable – In-Memory, Sorted Buffer

Next, the same data goes to the MemTable, an in-memory structure that holds writes temporarily. It’s fast and sorted, making lookups efficient. Once it fills up, it’s flushed to disk.

SSTable (Sorted String Table) – Immutable On-Disk Storage

When a MemTable is full, it’s flushed to disk as an SSTable — a sorted, immutable file. These are written sequentially to disk, which is much faster than random writes.

⚙️ Write Path: How Data Moves Through the System

Client writes data.
The write is first appended to the WAL.
Then it’s written into the MemTable.
When the MemTable is full, it is flushed to disk as a new SSTable.
The corresponding WAL entries can now be safely discarded.

This flow gives us speed (thanks to RAM and sequential writes) and durability (thanks to the WAL).

🔄 Compaction: Cleaning Up the Mess

As time goes on, multiple SSTables accumulate — sometimes containing old versions of the same key. A background process called compaction merges these files, removes duplicates, and applies tombstones (more on this shortly).

Compaction is key to keeping reads efficient and storage clean.

💀 What Are Tombstones?

In systems like Cassandra, deletes aren’t immediate. Instead, a special marker called a tombstone is written to indicate that a value has been deleted.

Why not just delete immediately?

In distributed systems, not all replicas may be online.
Tombstones ensure that the delete is propagated and respected when those nodes come back.
During compaction, data with associated tombstones is finally purged.

This ensures eventual consistency and correctness across the cluster.

🧭 What Happens If a Node Goes Down?

If a node crashes before the MemTable is flushed, it can recover by replaying the WAL. But how does the database know which WAL entries have already been flushed?

That’s where markers or headers in the WAL come in. These indicate safe checkpoints. On recovery, the system:

Ignores entries before the marker.
Re-applies entries after it.

This mechanism avoids double writes or missed data.

🚄 Why Not Just Write Directly to Disk?

Good question. Writing directly to disk — especially in-place — can be:

Slow, because disk I/O is expensive.
Risky, because crashes can corrupt partial writes.

LSM Trees embrace sequential writes — which are faster, simpler, and easier to recover from. By buffering writes in RAM and writing sorted files to disk, performance is boosted significantly.

✨ Why LSM Trees Shine

✅ High write throughput
📦 Sequential, efficient disk usage
🧹 Background compaction keeps data fresh
🔒 Safe crash recovery with WAL
🧽 Great for systems that handle lots of inserts/updates

Popular databases like Cassandra, LevelDB, RocksDB, and HBase are all powered by LSM Trees under the hood.

🧵 TL;DR

LSM Trees help databases handle high-speed writes by:

Buffering data in memory (MemTable),
Logging it to disk for durability (WAL),
Flushing it to sorted files (SSTables), and
Periodically compacting and cleaning up.

It’s a clever design pattern that trades off some read performance in exchange for massive write efficiency and fault tolerance.

🎯 Wrap-up

If you’re getting started with database design, understanding how LSM Trees work is a fantastic entry point. It gives insight into:

Storage architecture
Durability vs performance trade-offs
How modern NoSQL systems ensure eventual consistency

In future posts, I’ll dive deeper into topics like B-Trees vs LSM Trees, tuning compaction, and Bloom filters for optimizing read paths.

Unleashing the Power of LSM Trees: The Backbone of High-Performance Databases

🧠 What is an LSM Tree?

🧩 Core Components of LSM Tree-Based Systems

Write-Ahead Log (WAL) – The Black Box Recorder

MemTable – In-Memory, Sorted Buffer

SSTable (Sorted String Table) – Immutable On-Disk Storage