Two key components enable the data lakehouse to reach its full potential: the table format and the data catalog. A table format allows collections of files in your data lakehouse to be recognized as database tables, while a catalog facilitates tracking these tables as a discoverable group for data lakehouse tools such as Dremio, Spark, and Flink.

Until recently, the primary open-source technology for cataloging has been the Apache Hive metastore, which essentially tracks directories where tables may exist. While Hive can track Apache Iceberg, Delta Lake, and Apache Hudi, it lacks the advanced features that modern lakehouses require since it was not designed for the era of data lakehouses. In this space, despite the emergence of various proprietary options, only Apache Iceberg has seen significant development in open-source catalogs that can be self-managed. This is crucial because open formats and open catalogs minimize friction when using different tools, avoiding vendor lock-in, and mitigating the risk of being stranded if your preferred catalog service is discontinued or acquired.

The first significant step in the realm of modern open-source catalogs within the Apache Iceberg ecosystem was Nessie. Nessie not only offers a self-managed catalog but also introduces a unique suite of Git-for-data features, allowing you to branch, merge, and tag changes to your catalog tables. This functionality was available long before table-level versioning was incorporated into the Iceberg specification.

The next advancement was the REST catalog specification within the Apache Iceberg project. This specification defines a standard set of endpoints that any catalog can implement, allowing them to share the same client libraries across different languages. This capability enables proprietary in-house catalogs, proprietary catalog services, and open-source catalogs to connect seamlessly with Apache Iceberg supporting tools. Recently, Nessie announced that it will soon adopt the REST specification into its open-source offering.

Since then, two more open-source options have emerged in the catalog space: Gravitino and Polaris. Gravitino aims to serve as a metastore for both table and AI assets, while Polaris, an open-source offering from Snowflake, provides a suite of portable security features. This ensures that Apache Iceberg tables can have access controls honored across any engine supporting the Polaris catalog, such as Snowflake and Dremio.

The significance of both Nessie and Polaris adopting the REST specification lies in their collaboration with the Iceberg community to advance the specification, making it more extensible to support cutting-edge features. Additionally, AWS has proposed a SCAN plan addition to the REST spec, which will offload table scan planning from client engines to the catalog. This allows different catalogs to incorporate planning optimizations unique to their feature sets and even enable planning for alternative formats like Hudi and Delta. The power of open source paints a promising future.

The most important takeaway is that with open catalog connector specifications and open-source catalog implementations like Nessie, Polaris, Gravitino, and the legacy Apache Hive, there is every reason to build a data lakehouse without vendor lock-in. By embracing the open ecosystem enabled by the Apache Iceberg community, you can achieve greater flexibility and avoid dependency on proprietary solutions.

If you want to experience some hands-on exercises to learn about working with Apache Iceberg, here is a good starting place.

Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop

Open Source Table Format + Open Source Catalog = No Vendor Lock-in (Nessie, Polaris, Gravitino)

Subscribe to my newsletter

Alex Merced

Alex Merced