Data Lake Framework ft. Apache Hudi

Raju MandalRaju Mandal
5 min read

Apache Hudi (Hadoop Upserts Deletes and Incrementals) is a data management framework that helps efficiently store, update, and query large datasets on cloud storage like Amazon S3, HDFS, or Google Cloud Storage.

Why do we need Apache Hudi?

Normally, big data storage (like Hadoop or data lakes) is designed for appending new data, but it struggles with:

  1. Updating existing data (e.g., correcting errors in a dataset).

  2. Deleting specific records (e.g., for compliance reasons like GDPR).

  3. Efficient real-time querying (querying recent changes fast).

Hudi solves these issues by adding capabilities like upserts (update + insert), deletes, and incremental processing on top of big data storage.


Key Features of Apache Hudi:

  1. Upserts & Deletes – You can update and delete records instead of just appending new ones.

  2. Incremental Processing – Process only the changed data instead of scanning everything.

  3. Snapshot Isolation – It keeps track of versions of data so queries always get consistent results.

  4. Optimized Querying – Works well with Apache Spark, Presto, Hive, and Trino for fast analytics.

Example Use Case:

Imagine you are storing bank transactions in a data lake. If a customer disputes a transaction, you need to update it.

  • Without Hudi: You might have to rewrite the whole dataset, which is slow.

  • With Hudi: You can upsert just the updated transaction efficiently.


How Apache Hudi Stores Data:

Hudi has different storage modes based on your needs:

  1. Copy-on-Write (COW) – Every update creates a new version of the data file (good for analytics).

  2. Merge-on-Read (MOR) – Stores changes separately and merges them later for fast updates (good for real-time queries).


1. Apache Hudi Architecture

Hudi provides three key layers:

  1. Storage Layer – Uses Parquet (for columnar storage) and Avro (for record-based updates).

  2. Indexing Layer – Uses Bloom Filters, Hash Index, or HBase for efficient lookups and updates.

  3. Query Layer – Works with Apache Spark, Presto, Trino, Hive, and Flink for analytics.

Hudi works with batch and streaming data using Apache Spark, Flink, or Hive as the processing engine.


2. Storage Formats in Apache Hudi

Hudi supports two main storage modes:

a) Copy-on-Write (COW)

  • Each update creates a new version of the Parquet file (no separate delta logs).

  • Good for batch processing and analytical queries (since files are optimized for reading).

  • Downside: Slower writes because files must be rewritten.

b) Merge-on-Read (MOR)

  • Updates are stored as Avro logs, and periodically merged into Parquet files.

  • Good for real-time analytics where fast writes and low-latency queries are required.

  • Downside: Read queries need to merge data from logs, which can add overhead.

FeatureCopy-on-Write (COW)Merge-on-Read (MOR)
Write SpeedSlower (due to rewriting files)Faster (writes to delta logs)
Read SpeedFaster (optimized Parquet files)Slower (log merging required)
Best forBatch analyticsStreaming & real-time updates

3. Indexing in Hudi (How Upserts Work Efficiently)

Since updating and deleting records in a data lake is not native to traditional data formats like Parquet, Hudi uses indexing to efficiently find records that need updates.

Hudi supports different types of indexes:

  • Bloom Filter Index (Default) – Efficiently checks if a record exists in a file.

  • Bucket Index – Hash-based partitioning to speed up lookups.

  • HBase Index – Uses HBase as an external index store.

  • Simple Index – Full table scan (less efficient for large tables).

Example Workflow of an Upsert in Hudi:

  1. A new batch of data arrives.

  2. Hudi checks the index to see if records already exist.

  3. If they exist → The records are updated in the Parquet file (COW) or written to delta logs (MOR).

  4. If they don’t exist → The records are inserted as new data.


4. Querying Apache Hudi Datasets

Hudi provides different table types for querying data efficiently:

a) Snapshot Table (Default)

  • Provides the latest view of the data (including updates and deletes).

  • Used for real-time analytics.

    Example Query :

      SELECT * FROM hudi_table;
    

b) Read-Optimized Table

  • Provides only base columnar Parquet files (no log merges).

  • Faster read performance but lacks real-time updates.

    Example query:

      SELECT * FROM hudi_table_ro;
    

c) Incremental Query Table

  • Fetches only new or changed records since a given time.

  • Used for CDC (Change Data Capture) processing.Example in Spark:


5. Apache Hudi vs Delta Lake vs Iceberg

Hudi competes with Delta Lake (Databricks) and Apache Iceberg, each having its strengths:

FeatureApache HudiDelta LakeApache Iceberg
Best forStreaming + IncrementalACID TransactionsLarge-scale OLAP queries
Write PerformanceFast (MOR) / Medium (COW)MediumFast
Read PerformanceMediumFastVery Fast
Query Engine SupportSpark, Presto, Flink, TrinoSparkSpark, Trino, Flink
IndexingYesNoNo
Change Data Capture (CDC)YesNoNo

7. When to Use Apache Hudi?

✅ Use Apache Hudi if:

  • You need frequent updates and deletes in your data lake.

  • You want incremental data processing (e.g., CDC pipelines).

  • You require real-time or near real-time analytics on changing data.

🚫 Avoid Hudi if:

  • Your dataset is append-only with no updates (use standard Parquet).

  • You need low-latency queries on very large datasets (use Iceberg).

  • You use Databricks (Delta Lake is better integrated).

0
Subscribe to my newsletter

Read articles from Raju Mandal directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Raju Mandal
Raju Mandal

A digital entrepreneur, actively working as a data platform consultant. A seasoned data engineer/architect with an experience of Fintech & Telecom industry and a passion for data monetization and a penchant for navigating the intricate realms of multi-cloud data solutions.