Inside the Engine Room – The Lifecycle of an Apache Iceberg Query

Sriram KrishnanSriram Krishnan
5 min read

In our last chapters, we explored the "what" and "who" of modern data platforms. We established why this new architecture is so powerful. Now, we must look under the hood and understand how it actually works.

What gives a Lakehouse built on Apache Iceberg its power? The answer lies in the elegant and robust design of its query lifecycle. Understanding this process—what happens every time you read or write data—is the key to unlocking its full potential for both performance and reliability.

This chapter is a journey into the engine room. We will follow a single piece of data as it's written, updated, and read, revealing how Iceberg's layers of metadata work in concert to deliver ACID transactions, high performance, and time travel.


The Three Layers of an Iceberg Table

Before we trace a query, it's crucial to understand Iceberg's three-tiered structure.

  1. The Catalog Layer: The entry point. Its primary job is to keep a pointer to the current metadata file for each table.

  2. The Metadata Layer: The "brain" of the table. It consists of a tree of files:

    • Metadata Files (.json): Contain the table's schema, partition spec, and snapshot history.

    • Manifest Lists (.avro): A list of all manifest files that make up a snapshot.

    • Manifest Files (.avro): A list of individual data files, along with detailed column-level statistics for each file.

  3. The Data Layer: Where your actual data lives, typically as Parquet files in your cloud object storage.


The Write Lifecycle: Building Reliability on Every Commit

The genius of Iceberg is most apparent during a write operation. What seems like a simple command sets off a carefully orchestrated chain of events to ensure data is never corrupted.

Step 1: The CREATE TABLE and INSERT Operations
When you first create a table and insert data, the process is straightforward. Iceberg writes new data files, creates corresponding manifest files and a manifest list, generates a new metadata file (v2.metadata.json), and then atomically commits the change by updating the catalog's pointer from v1.metadata.json to v2.metadata.json. This atomic "compare-and-swap" operation is what guarantees ACID compliance.

Step 2: Handling Updates and Deletes — The COW vs. MOR Trade-off
This is where things get interesting. Iceberg offers two distinct strategies for handling UPDATE, DELETE, and MERGE operations, each with a critical performance trade-off.

Copy-on-Write (COW): The Default Strategy (Optimized for Reads)

By default, Iceberg uses a Copy-on-Write strategy. When you update a single row in a data file, COW rewrites the entire data file.

  • The Process: It copies all the unchanged rows from the original file, adds the updated version of the changed row, and writes this combined data into a new data file. A new snapshot is then created that replaces the old file with the new one.

  • Pros: The read path is extremely fast. When you query the table, every data file is clean and represents the latest state. There is no extra work to do at read time.

  • Cons: The write path is slower and more resource-intensive, as it requires rewriting data.

Merge-on-Read (MOR): The Alternative Strategy (Optimized for Writes)

Merge-on-Read is designed for write-heavy use cases, like streaming or CDC data, where updates and deletes are frequent. Instead of rewriting files, MOR takes a different approach.

  • The Process:

    • To delete a row, MOR writes a small delete file that simply records which row in which data file is now deleted.

    • To update a row, it performs two actions: it writes a delete file to mark the old row as deleted, and it appends the new version of the row in a new data file.

  • Pros: The write path is much faster and cheaper, as it only appends small files.

  • Cons: The read path is slower. The query engine must now read the original data files, read the delete files, and "merge" them on the fly to figure out the current state before returning a result.

Feature

Copy-on-Write (COW)

Merge-on-Read (MOR)

Strategy

Rewrites data files on update/delete.

Appends separate delete files and data files.

Write Performance

Slower and more resource-intensive.

Faster and less resource-intensive.

Read Performance

Faster. Data is always in a clean state.

Slower. Requires merging data and delete files at read time.

Ideal Use Case

Read-heavy analytical tables. (e.g., BI dashboards, reporting)

Write-heavy, streaming tables. (e.g., CDC pipelines, high-volume events)


The Read Lifecycle: Performance Through Pruning

All the meticulous work during the write pays off during the read. Iceberg's metadata allows for aggressive pruning to minimize the amount of data read from object storage.

Let's follow a SELECT query:

  1. Check the Catalog: The engine finds the latest metadata.json file.

  2. Partition Pruning (Manifest List Scan): The engine reads the manifest list and uses the query's WHERE clause to prune entire manifest files based on their partition-level statistics.

  3. Data File Pruning (Manifest Scan): For the remaining manifests, the engine reads them and uses their column-level statistics (min/max values) to prune individual data files.

The Extra Step for Merge-on-Read Tables

After pruning and identifying the relevant data files, the query engine performs one additional, crucial step for MOR tables:

  • It finds all associated delete files for the current snapshot.

  • It then merges the data from the remaining data files with the information in the delete files, filtering out the deleted rows before returning the final result to the user. This is the "merge" in Merge-on-Read.


The "Magic" Feature: Time Travel

Because every change in Iceberg creates a new, immutable snapshot, you can query any previous state of the table with a simple addition to your SELECT statement. This works identically for both COW and MOR tables.

-- Query the table as it was at a specific time
SELECT * FROM orders
TIMESTAMP AS OF '2025-08-10 12:00:00';

Behind the scenes, the engine simply finds the metadata pointer for that older snapshot and begins the read process from there, using the data and delete files that were valid at that point in time.

Final Thoughts

The lifecycle of an Iceberg query reveals the core principle of the Lakehouse: invest in metadata intelligence on the write path to enable extreme performance and reliability on the read path. The complexity of the write process—from the atomic commit to the strategic choice between Copy-on-Write and Merge-on-Read—is what provides the warehouse-grade features we need.

By understanding this engine room, you can now make informed decisions about how your tables should handle writes, tailoring your architecture to be cost-effective and performant for your specific workload.

0
Subscribe to my newsletter

Read articles from Sriram Krishnan directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sriram Krishnan
Sriram Krishnan

Sharing lessons, tools & patterns to build scalable, modern data platforms —one post at a time.