Apache Hudi (Hadoop Upserts Deletes and Incrementals) is a data management framework that helps efficiently store, update, and query large datasets on cloud storage like Amazon S3, HDFS, or Google Cloud Storage.

Why do we need Apache Hudi?

Normally, big data storage (like Hadoop or data lakes) is designed for appending new data, but it struggles with:

Updating existing data (e.g., correcting errors in a dataset).
Deleting specific records (e.g., for compliance reasons like GDPR).
Efficient real-time querying (querying recent changes fast).

Hudi solves these issues by adding capabilities like upserts (update + insert), deletes, and incremental processing on top of big data storage.

Key Features of Apache Hudi:

Upserts & Deletes – You can update and delete records instead of just appending new ones.
Incremental Processing – Process only the changed data instead of scanning everything.
Snapshot Isolation – It keeps track of versions of data so queries always get consistent results.
Optimized Querying – Works well with Apache Spark, Presto, Hive, and Trino for fast analytics.

Example Use Case:

Imagine you are storing bank transactions in a data lake. If a customer disputes a transaction, you need to update it.

Without Hudi: You might have to rewrite the whole dataset, which is slow.
With Hudi: You can upsert just the updated transaction efficiently.

How Apache Hudi Stores Data:

Hudi has different storage modes based on your needs:

Copy-on-Write (COW) – Every update creates a new version of the data file (good for analytics).
Merge-on-Read (MOR) – Stores changes separately and merges them later for fast updates (good for real-time queries).

1. Apache Hudi Architecture

Hudi provides three key layers:

Storage Layer – Uses Parquet (for columnar storage) and Avro (for record-based updates).
Indexing Layer – Uses Bloom Filters, Hash Index, or HBase for efficient lookups and updates.
Query Layer – Works with Apache Spark, Presto, Trino, Hive, and Flink for analytics.

Hudi works with batch and streaming data using Apache Spark, Flink, or Hive as the processing engine.

2. Storage Formats in Apache Hudi

Hudi supports two main storage modes:

a) Copy-on-Write (COW)

Each update creates a new version of the Parquet file (no separate delta logs).
Good for batch processing and analytical queries (since files are optimized for reading).
Downside: Slower writes because files must be rewritten.

b) Merge-on-Read (MOR)

Updates are stored as Avro logs, and periodically merged into Parquet files.
Good for real-time analytics where fast writes and low-latency queries are required.
Downside: Read queries need to merge data from logs, which can add overhead.

Feature	Copy-on-Write (COW)	Merge-on-Read (MOR)
Write Speed	Slower (due to rewriting files)	Faster (writes to delta logs)
Read Speed	Faster (optimized Parquet files)	Slower (log merging required)
Best for	Batch analytics	Streaming & real-time updates

3. Indexing in Hudi (How Upserts Work Efficiently)

Since updating and deleting records in a data lake is not native to traditional data formats like Parquet, Hudi uses indexing to efficiently find records that need updates.

Hudi supports different types of indexes:

Bloom Filter Index (Default) – Efficiently checks if a record exists in a file.
Bucket Index – Hash-based partitioning to speed up lookups.
HBase Index – Uses HBase as an external index store.
Simple Index – Full table scan (less efficient for large tables).

Example Workflow of an Upsert in Hudi:

A new batch of data arrives.
Hudi checks the index to see if records already exist.
If they exist → The records are updated in the Parquet file (COW) or written to delta logs (MOR).
If they don’t exist → The records are inserted as new data.

4. Querying Apache Hudi Datasets

Hudi provides different table types for querying data efficiently:

a) Snapshot Table (Default)

Provides the latest view of the data (including updates and deletes).
Used for real-time analytics.

Example Query :
```
  SELECT * FROM hudi_table;
```

b) Read-Optimized Table

Provides only base columnar Parquet files (no log merges).
Faster read performance but lacks real-time updates.

Example query:
```
  SELECT * FROM hudi_table_ro;
```

c) Incremental Query Table

Fetches only new or changed records since a given time.
Used for CDC (Change Data Capture) processing.Example in Spark:

5. Apache Hudi vs Delta Lake vs Iceberg

Hudi competes with Delta Lake (Databricks) and Apache Iceberg, each having its strengths:

Feature	Apache Hudi	Delta Lake	Apache Iceberg
Best for	Streaming + Incremental	ACID Transactions	Large-scale OLAP queries
Write Performance	Fast (MOR) / Medium (COW)	Medium	Fast
Read Performance	Medium	Fast	Very Fast
Query Engine Support	Spark, Presto, Flink, Trino	Spark	Spark, Trino, Flink
Indexing	Yes	No	No
Change Data Capture (CDC)	Yes	No	No

7. When to Use Apache Hudi?

✅ Use Apache Hudi if:

You need frequent updates and deletes in your data lake.
You want incremental data processing (e.g., CDC pipelines).
You require real-time or near real-time analytics on changing data.

🚫 Avoid Hudi if:

Your dataset is append-only with no updates (use standard Parquet).
You need low-latency queries on very large datasets (use Iceberg).
You use Databricks (Delta Lake is better integrated).

Data Lake Framework ft. Apache Hudi