The Brain of the Lakehouse – A Guide to Apache Iceberg Catalogs

In our last chapter, we went deep into the engine room, tracing the lifecycle of a query through Apache Iceberg’s metadata layers. This raises a fundamental question: when a query engine like Spark or Snowflake receives a query for SELECT * FROM db1.orders, how does it even find the table in the first place?
The answer is the Iceberg Catalog.
The catalog is the central nervous system of your Lakehouse. It acts as the librarian, responsible for one critical task: mapping a table name to the precise location of its current metadata file. While this sounds simple, the way a catalog performs this task is the single most important factor in determining the reliability, consistency, and performance of your entire data platform in a production environment.
In this chapter, we will explore the different types of Iceberg catalogs, from simple file-based approaches to powerful managed services, and help you choose the right one for your architecture.
What is an Iceberg Catalog and Why Does it Matter?
At its core, an Iceberg catalog is a service that provides a mapping from a table name to the path of its current metadata.json file. However, for a catalog to be used in production, it must fulfill one non-negotiable requirement: it must support atomic operations.
Atomicity is what guarantees data consistency. Without it, concurrent writes from different jobs could overwrite each other, leading to silent data loss. A production-grade catalog uses an atomic compare-and-swap operation to prevent this, ensuring data integrity. This guarantee against data loss is the defining feature of a production-ready catalog.
A Comparison of Common Iceberg Catalogs
Let's explore the most common catalog implementations, from the simple to the sophisticated.
1. The Starter Catalog: Hadoop (and "S3 Tables")
The Hadoop catalog is the simplest way to get started. It requires no external systems; it just uses your existing file system (like Amazon S3 or HDFS) to store table information.
How it Works: It finds the current table state by listing files in the table's metadata directory and looking for a version-hint.txt file.
Pros: Extremely easy to set up for local development or experimentation.
Cons: Not recommended for production. It lacks atomic guarantees on most object stores (including S3), making it vulnerable to data loss from concurrent writes. It also suffers from slow performance when listing many tables.
2. The Traditional Workhorse: Hive Metastore
For years, the Hive Metastore has been the de facto metadata store for the Hadoop ecosystem. It can be used as a reliable, production-ready Iceberg catalog.
How it Works: It stores the path to the current metadata.json file in the location property of a table entry within its own database (e.g., MySQL, Postgres). It supports atomic commits.
Pros: Battle-tested and compatible with a wide variety of tools. It's cloud-agnostic.
Cons: Requires you to self-host and manage the Metastore service. It does not support multi-table transactions.
3. The Cloud-Native Choice: AWS Glue Catalog
For teams heavily invested in the AWS ecosystem, the AWS Glue Data Catalog is a natural fit.
How it Works: It functions similarly to Hive, storing the metadata path in a table property (metadata_location) within the managed Glue service.
Pros: It's a fully managed service, reducing operational overhead. It provides strong atomicity and integrates well with other AWS services.
Cons: It's specific to the AWS ecosystem, which can complicate multi-cloud strategies. Like Hive, it does not support multi-table transactions. It manages the metastore, but you are still responsible for table maintenance like compaction.
4. The Database-as-a-Catalog: JDBC
The JDBC catalog is a versatile and popular choice that leverages any standard JDBC-compliant database (like PostgreSQL or MySQL) as the backend.
How it Works: It stores the mapping of table names to metadata file locations inside a table within the relational database. The database's own support for atomic transactions ensures data consistency.
Pros: Easy to start if you already have a database. High availability can be inherited from managed database services (like Amazon RDS). It is cloud-agnostic.
Cons: Does not support multi-table transactions. Requires all query engines to have the correct JDBC driver available, which can add dependency complexity.
The Next Generation: Catalogs for Modern Data Workflows
The latest generation of catalogs offers powerful features beyond simple table management.
5. The "Git for Data" Catalog: Project Nessie
Project Nessie brings Git-like semantics to your data lake, allowing you to branch, tag, and merge your data as if it were code.
How it Works: Nessie is a transactional catalog that provides a Git-like commit history for your entire Lakehouse.
Key Features:
Multi-table Transactions: You can perform atomic commits across multiple tables in a single transaction.
Data Branching: Create isolated branches (dev, feature-branch) to experiment with data transformations without affecting production.
Pros: Unlocks powerful data engineering and data science workflows, ensures reproducibility, and is cloud-agnostic.
Cons: It's a service you need to host and manage (though hosted versions are available).
6. The Ultimate Flexibility: The REST Catalog
The REST catalog uses a standard HTTP service to manage table metadata. It decouples the catalog's functionality from its underlying storage.
How it Works: A query engine sends RESTful API calls to a service endpoint to perform catalog operations.
Pros: Highly flexible, cloud-agnostic, and supports multi-table transactions.
Cons: Requires running a REST service and a backing data store.
The Fully Managed Experience: Platform-as-a-Catalog
This is the future of the Lakehouse, where the platform abstracts away not just the catalog but also the difficult parts of table maintenance.
7. Amazon S3 Tables: The Managed Lakehouse on AWS
This is AWS's answer to simplifying the Iceberg Lakehouse. While it uses the AWS Glue Catalog for its foundation, Amazon S3 Tables is a higher-level service that fully manages the table lifecycle.If the Glue Catalog is the library's card catalog, S3 Tables is the helpful librarian who not only finds the book for you but also automatically tidies the shelves.
It handles the most complex operational burdens for you:
Automatic Performance Optimization: S3 Tables automatically runs COMPACTION to combine small files and VACUUM to remove old metadata and data files. This is a massive operational win.
Simplified Management: Provides a simple, managed experience for creating and governing Iceberg tables on S3.
8. Snowflake Iceberg Tables: The Unified Data Cloud Approach
This is Snowflake's powerful offering in the managed Lakehouse space. It allows you to use Snowflake itself as a first-class Iceberg Catalog for data stored in your own object storage.
Zero Additional Infrastructure: There are no other services to manage; the catalog is part of the Snowflake platform.
Unified Governance: All of Snowflake's native features—data sharing, dynamic masking policies, and row-access policies—work seamlessly on your Iceberg data.
Peak Performance & Transactions: Leverages Snowflake’s query engine and supports multi-table transactions.
Summary and Choosing Your Catalog
Catalog | Production Ready? | Managed? | Auto Table Maintenance? | Multi-Table Transactions? | Best For... |
Hadoop | No | No | No | No | Local development only. |
Hive Metastore | Yes | No (Self-hosted) | No | No | On-premise or multi-cloud setups already using Hive. |
AWS Glue | Yes | Yes (Metastore only) | No | No | Teams on AWS who want a managed metastore but will handle table optimization themselves. |
JDBC Catalog | Yes | Varies (BYO DB) | No | No | Teams with an existing relational database (e.g., RDS). |
Project Nessie | Yes | No (Hosted options) | No | Yes | Complex data engineering workflows needing Git-like branching. |
REST Catalog | Yes | Varies (BYO) | No | Yes | Custom, flexible platforms requiring interoperability. |
Amazon S3 Tables | Yes | Yes (Fully Managed) | Yes | No | Teams on AWS seeking the simplest, most automated Lakehouse experience. |
Snowflake Tables | Yes | Yes (Fully Managed) | Yes | Yes | Teams building a Lakehouse within the Snowflake ecosystem. |
Final Thoughts
The Iceberg Catalog is a strategic choice that reflects your organization's scale and cloud strategy. The emergence of fully managed offerings like Amazon S3 Tables and Snowflake Iceberg Tables signals a clear industry trend: simplifying the operational burden of the Lakehouse is the next frontier. The open nature of the ecosystem and lightweight migration tools provide the flexibility to adapt as your needs change.
Subscribe to my newsletter
Read articles from Sriram Krishnan directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Sriram Krishnan
Sriram Krishnan
Sharing lessons, tools & patterns to build scalable, modern data platforms —one post at a time.