Apache Hudi provides two storage modes for managing updates and deletes in data lakes:

Copy-on-Write (CoW) – Optimized for read performance, but slower writes.
Merge-on-Read (MoR) – Optimized for write performance, but read queries require merging.

Let’s dive deep into their working mechanisms, advantages, disadvantages, and use cases.

Copy-on-Write (CoW)

Concept:

In CoW mode, every time data is updated or deleted, Hudi rewrites the entire Parquet file containing the affected records.
The new version of the file replaces the old version.
This ensures consistent and optimized read performance because queries access well-structured columnar Parquet files without additional merging.

How It Works (Step-by-Step)

A new batch of data arrives (either new records or updates).
Hudi identifies which records need to be updated or deleted.
It rewrites the Parquet file(s) containing those records with the updated data.
The new file version is saved, while the old file is marked for deletion (Hudi retains past versions if time-travel is enabled).
Queries always fetch the latest Parquet file, ensuring fast reads.

Example of CoW File Structure:

pgsqlCopyEdit/hudi_table/
  ├── 20240330_01.parquet  (original file)
  ├── 20240330_02.parquet  (updated version, replaces previous)
  ├── 20240330_03.parquet  (next update)
  ├── .hoodie/

Advantages of CoW

Faster read queries – Since all data is in optimized Parquet format, analytical queries are efficient.
Data consistency – Each file represents a complete, consistent snapshot.
Better for batch processing – Ideal for reporting, dashboards, and data warehousing.

Disadvantages of CoW

Slow writes – Since Hudi rewrites entire files, large updates can cause high I/O and longer ingestion times.
Not ideal for real-time use cases – If you have frequent updates, rewriting files constantly can slow down the pipeline.

Best Use Cases for CoW

Batch processing where updates are infrequent.
Data warehousing workloads where query speed matters more than write speed.
Reporting and dashboards that require fast OLAP queries.
Fact tables in a data lake where data is mostly appended with occasional corrections.

Merge-on-Read (MoR)

Concept:

In MoR mode, updates and deletes are first written to delta log files instead of rewriting the entire Parquet file immediately.
Later, these logs are merged with the base Parquet file when needed (either at query time or during scheduled compaction).
This makes writes much faster since only the small log files are updated initially.

How It Works (Step-by-Step)

A new batch of data arrives (either new records or updates).
Hudi checks which records need to be updated or deleted.
Instead of rewriting the entire Parquet file, it writes the changes to a delta log file (Avro format).
At query time:
- If reading the latest snapshot → The query engine merges logs with the base Parquet file on the fly.
- If reading an optimized version → A background compaction process merges logs into Parquet periodically.

Example of MoR File Structure:

pgsqlCopyEdit/hudi_table/
  ├── 20240330_01.parquet  (original base file)
  ├── 20240330_02.log.avro  (log file storing updates)
  ├── 20240330_03.log.avro  (new log file storing more updates)
  ├── .hoodie/

Advantages of MoR

Faster writes – Only small log files are updated instead of rewriting large Parquet files.
Efficient for streaming and real-time updates – Suitable for cases where data is constantly changing.
Good for mixed workloads – Provides a balance between write and read performance.

Disadvantages of MoR

Slower read performance – Queries that need real-time data must merge logs and Parquet files on the fly.
Requires periodic compaction – Without compaction, log files keep growing, degrading query performance.

Best Use Cases for MoR

Real-time analytics where updates happen frequently.
Streaming ETL pipelines where data is continuously ingested.
Change Data Capture (CDC) use cases where only incremental changes need to be processed.
Machine Learning pipelines where near real-time updates are needed.

CoW vs MoR: A Side-by-Side Comparison

Feature	Copy-on-Write (CoW)	Merge-on-Read (MoR)
Write Performance	Slower (rewrites entire Parquet file)	Faster (writes to log files first)
Read Performance	Faster (queries only Parquet files)	Slower (queries merge logs + Parquet)
Best for	Batch processing, OLAP, analytics	Streaming, real-time analytics
Query Latency	Low (optimized Parquet reads)	High (log merging overhead)
Storage Usage	Higher (due to multiple Parquet versions)	Lower (logs are smaller than full rewrites)
Complexity	Simple (easy to manage)	Complex (requires compaction)

Choosing Between CoW and MoR

When to Use Copy-on-Write (CoW)

You have batch workloads with occasional updates.
You need fast read queries (e.g., dashboards, reports).
You don’t mind slower writes in exchange for optimized Parquet files.
You want simpler storage management (no need for log merging).

Example Use Case:
A data warehouse storing sales transactions that is updated once per day.

When to Use Merge-on-Read (MoR)

You have real-time streaming data that needs frequent updates.
You need fast writes and can tolerate slower read queries.
You have frequent upserts and deletes and don’t want to rewrite large Parquet files every time.
You are working with CDC (Change Data Capture) pipelines.

💡 Example Use Case:
A fraud detection system that needs to update transactions every second.

Example: Writing Data in CoW and MoR using PySpark

Copy-on-Write Example

pythonCopyEditdf.write.format("hudi") \
    .option("hoodie.datasource.write.table.type", "COPY_ON_WRITE") \
    .mode("append") \
    .save("s3://data-lake/hudi_cow_table")

Merge-on-Read Example

pythonCopyEditdf.write.format("hudi") \
    .option("hoodie.datasource.write.table.type", "MERGE_ON_READ") \
    .mode("append") \
    .save("s3://data-lake/hudi_mor_table")

Summary

Aspect	Copy-on-Write (CoW)	Merge-on-Read (MoR)
Write Speed	Slower (rewrites files)	Faster (appends logs)
Read Speed	Faster (optimized Parquet files)	Slower (merging required)
Best For	Batch analytics, OLAP	Streaming, real-time updates
Complexity	Lower	Higher (requires compaction)
Storage Use	Higher (multiple file versions)	Lower (logs are smaller)

Would you like a more detailed example of how Hudi handles upserts in CoW vs MoR? 🚀

Copy on Write (CoW) vs Merge on Read (MoR)