Copy on Write (CoW) vs Merge on Read (MoR)

Raju MandalRaju Mandal
5 min read

Apache Hudi provides two storage modes for managing updates and deletes in data lakes:

  1. Copy-on-Write (CoW) – Optimized for read performance, but slower writes.

  2. Merge-on-Read (MoR) – Optimized for write performance, but read queries require merging.

Let’s dive deep into their working mechanisms, advantages, disadvantages, and use cases.


Copy-on-Write (CoW)

Concept:

  • In CoW mode, every time data is updated or deleted, Hudi rewrites the entire Parquet file containing the affected records.

  • The new version of the file replaces the old version.

  • This ensures consistent and optimized read performance because queries access well-structured columnar Parquet files without additional merging.

How It Works (Step-by-Step)

  1. A new batch of data arrives (either new records or updates).

  2. Hudi identifies which records need to be updated or deleted.

  3. It rewrites the Parquet file(s) containing those records with the updated data.

  4. The new file version is saved, while the old file is marked for deletion (Hudi retains past versions if time-travel is enabled).

  5. Queries always fetch the latest Parquet file, ensuring fast reads.

Example of CoW File Structure:

pgsqlCopyEdit/hudi_table/
  ├── 20240330_01.parquet  (original file)
  ├── 20240330_02.parquet  (updated version, replaces previous)
  ├── 20240330_03.parquet  (next update)
  ├── .hoodie/

Advantages of CoW

  • Faster read queries – Since all data is in optimized Parquet format, analytical queries are efficient.

  • Data consistency – Each file represents a complete, consistent snapshot.

  • Better for batch processing – Ideal for reporting, dashboards, and data warehousing.

Disadvantages of CoW

  • Slow writes – Since Hudi rewrites entire files, large updates can cause high I/O and longer ingestion times.

  • Not ideal for real-time use cases – If you have frequent updates, rewriting files constantly can slow down the pipeline.

Best Use Cases for CoW

  • Batch processing where updates are infrequent.

  • Data warehousing workloads where query speed matters more than write speed.

  • Reporting and dashboards that require fast OLAP queries.

  • Fact tables in a data lake where data is mostly appended with occasional corrections.

Merge-on-Read (MoR)

Concept:

  • In MoR mode, updates and deletes are first written to delta log files instead of rewriting the entire Parquet file immediately.

  • Later, these logs are merged with the base Parquet file when needed (either at query time or during scheduled compaction).

  • This makes writes much faster since only the small log files are updated initially.

How It Works (Step-by-Step)

  1. A new batch of data arrives (either new records or updates).

  2. Hudi checks which records need to be updated or deleted.

  3. Instead of rewriting the entire Parquet file, it writes the changes to a delta log file (Avro format).

  4. At query time:

    • If reading the latest snapshot → The query engine merges logs with the base Parquet file on the fly.

    • If reading an optimized version → A background compaction process merges logs into Parquet periodically.

Example of MoR File Structure:

pgsqlCopyEdit/hudi_table/
  ├── 20240330_01.parquet  (original base file)
  ├── 20240330_02.log.avro  (log file storing updates)
  ├── 20240330_03.log.avro  (new log file storing more updates)
  ├── .hoodie/

Advantages of MoR

  • Faster writes – Only small log files are updated instead of rewriting large Parquet files.

  • Efficient for streaming and real-time updates – Suitable for cases where data is constantly changing.

  • Good for mixed workloads – Provides a balance between write and read performance.

Disadvantages of MoR

  • Slower read performance – Queries that need real-time data must merge logs and Parquet files on the fly.

  • Requires periodic compaction – Without compaction, log files keep growing, degrading query performance.

Best Use Cases for MoR

  • Real-time analytics where updates happen frequently.

  • Streaming ETL pipelines where data is continuously ingested.

  • Change Data Capture (CDC) use cases where only incremental changes need to be processed.

  • Machine Learning pipelines where near real-time updates are needed.


CoW vs MoR: A Side-by-Side Comparison

FeatureCopy-on-Write (CoW)Merge-on-Read (MoR)
Write PerformanceSlower (rewrites entire Parquet file)Faster (writes to log files first)
Read PerformanceFaster (queries only Parquet files)Slower (queries merge logs + Parquet)
Best forBatch processing, OLAP, analyticsStreaming, real-time analytics
Query LatencyLow (optimized Parquet reads)High (log merging overhead)
Storage UsageHigher (due to multiple Parquet versions)Lower (logs are smaller than full rewrites)
ComplexitySimple (easy to manage)Complex (requires compaction)

Choosing Between CoW and MoR

When to Use Copy-on-Write (CoW)

  • You have batch workloads with occasional updates.

  • You need fast read queries (e.g., dashboards, reports).

  • You don’t mind slower writes in exchange for optimized Parquet files.

  • You want simpler storage management (no need for log merging).

Example Use Case:
A data warehouse storing sales transactions that is updated once per day.


When to Use Merge-on-Read (MoR)

  • You have real-time streaming data that needs frequent updates.

  • You need fast writes and can tolerate slower read queries.

  • You have frequent upserts and deletes and don’t want to rewrite large Parquet files every time.

  • You are working with CDC (Change Data Capture) pipelines.

💡 Example Use Case:
A fraud detection system that needs to update transactions every second.


Example: Writing Data in CoW and MoR using PySpark

Copy-on-Write Example

pythonCopyEditdf.write.format("hudi") \
    .option("hoodie.datasource.write.table.type", "COPY_ON_WRITE") \
    .mode("append") \
    .save("s3://data-lake/hudi_cow_table")

Merge-on-Read Example

pythonCopyEditdf.write.format("hudi") \
    .option("hoodie.datasource.write.table.type", "MERGE_ON_READ") \
    .mode("append") \
    .save("s3://data-lake/hudi_mor_table")

Summary

AspectCopy-on-Write (CoW)Merge-on-Read (MoR)
Write SpeedSlower (rewrites files)Faster (appends logs)
Read SpeedFaster (optimized Parquet files)Slower (merging required)
Best ForBatch analytics, OLAPStreaming, real-time updates
ComplexityLowerHigher (requires compaction)
Storage UseHigher (multiple file versions)Lower (logs are smaller)

Would you like a more detailed example of how Hudi handles upserts in CoW vs MoR? 🚀

0
Subscribe to my newsletter

Read articles from Raju Mandal directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Raju Mandal
Raju Mandal

A digital entrepreneur, actively working as a data platform consultant. A seasoned data engineer/architect with an experience of Fintech & Telecom industry and a passion for data monetization and a penchant for navigating the intricate realms of multi-cloud data solutions.