Apache Hudi Meta Files & Metadata Management


Apache Hudi maintains metadata files inside the .hoodie/
directory to track changes, manage transactions, and optimize data lake operations. These meta files ensure ACID transactions, incremental processing, and efficient querying.
Where are Meta Files Stored?
Hudi stores metadata inside each table's .hoodie/
directory:
pgsqlCopyEdit/hudi_table/
├── partition_1/
│ ├── 20240330_01.parquet
│ ├── 20240330_02.parquet
│ ├── .hoodie/
│ ├── hoodie_commit_20240330_01.json
│ ├── hoodie_commit_20240330_02.json
│ ├── hoodie_savepoint_20240330_01.json
│ ├── hoodie_partition_metadata
├── .hoodie/
├── hoodie.properties
├── hoodie.timeline
├── hoodie.metadata
Hudi uses these meta files to track commits, file versions, indexing, and compactions.
Important Meta Files in Apache Hudi
Here’s a breakdown of key meta files and their roles:
a) hoodie.properties
Stores table-level properties (e.g., table name, type, schema, and file format).
Example contents:
iniCopyEdithoodie.table.name=hudi_table hoodie.table.type=MERGE_ON_READ hoodie.timeline.layout.version=1
Why It Matters? Defines the configuration for the Hudi table.
b) hoodie_commit_*.json (Commit Files)
Tracks successful write operations (insert, update, delete).
Stored as JSON files named with commit timestamps, e.g.:
pgsqlCopyEdithoodie_commit_20240330_01.json hoodie_commit_20240330_02.json
Contains metadata like:
Files created/updated
Number of records inserted, updated, deleted
Execution time
Example JSON content:
jsonCopyEdit{ "commitTime": "20240330_01", "totalRecordsUpdated": 200, "totalRecordsInserted": 500, "totalBytesWritten": 10485760 }
Why It Matters? Enables incremental processing by tracking changes.
c) hoodie_instant_metadata (Timeline Metadata)
Tracks the lifecycle of each operation (commits, compactions, cleanups).
Operations are categorized as:
Completed (
COMPLETED
) → Successful commits.Requested (
REQUESTED
) → Scheduled but not executed.In Progress (
INFLIGHT
) → Running operations.
Example file list:
CopyEdithoodie_commit_20240330_01.completed hoodie_compaction_20240330_02.requested hoodie_clean_20240330_03.inflight
Why It Matters? Ensures transactional consistency and rollback support.
d) hoodie_savepoint_*.json
Stores snapshots of data for rollback or recovery.
Useful for restoring a table to a previous version.
Example JSON content:
jsonCopyEdit{ "savepointTime": "20240330_01", "savepointedBy": "admin_user" }
Why It Matters? Protects important data from accidental deletes.
e) hoodie_partition_metadata
Tracks partition structure in the table.
Helps Hudi quickly locate files in large partitioned tables.
Why It Matters? Speeds up file lookups, reducing query latency.
f) hoodie_clean_*.json
Logs garbage collection events that remove old or unused files.
Helps manage storage by removing outdated file versions.
Example JSON content:
jsonCopyEdit{ "cleanTime": "20240330_02", "deletedFiles": ["20240329_01.parquet"] }
Why It Matters? Prevents storage bloat by cleaning old data.
Metadata Table (hoodie.metadata)
Introduced in Hudi 0.8.0+, the metadata table optimizes file listing and indexing.
Instead of scanning millions of files, Hudi tracks metadata separately for fast lookups.
Stores information on:
File listings
Bloom filters for indexing
Column statistics
Example Query on Metadata Table:
sqlCopyEditSELECT * FROM hudi.metadata WHERE file_id='20240330_01.parquet';
How Hudi Uses Metadata for ACID Transactions
Write Operation (Insert/Update/Delete)
A new batch of data arrives.
Hudi updates the metadata (commit file, index, partition structure).
Data is written using Copy-on-Write (CoW) or Merge-on-Read (MoR).
Compaction (For MoR Tables)
Merges log files into optimized Parquet files.
Updates hoodie_commit_*.json files to track changes.
Cleaning
Deletes old or unreferenced files.
Updates hoodie_clean_*.json for tracking deletions.
Querying
Queries use hoodie.metadata for fast file lookups instead of scanning all files.
Ensures consistent views via hoodie_commit_*.json.
Summary Table: Meta Files & Their Purpose
Meta File | Purpose | Example File |
hoodie.properties | Stores table settings | hoodie.properties |
hoodie_commit_*.json | Tracks write transactions | hoodie_commit_20240330_01.json |
hoodie_instant_metadata | Manages operation lifecycles | hoodie_commit_20240330_01.completed |
hoodie_savepoint_*.json | Creates snapshots for rollback | hoodie_savepoint_20240330_01.json |
hoodie_partition_metadata | Tracks partition structure | hoodie_partition_metadata |
hoodie_clean_*.json | Tracks deleted old files | hoodie_clean_20240330_02.json |
hoodie.metadata | Stores optimized file listings and statistics | hoodie.metadata |
Subscribe to my newsletter
Read articles from Raju Mandal directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Raju Mandal
Raju Mandal
A digital entrepreneur, actively working as a data platform consultant. A seasoned data engineer/architect with an experience of Fintech & Telecom industry and a passion for data monetization and a penchant for navigating the intricate realms of multi-cloud data solutions.