1.Introduction:

The ORC (Optimized Row Columnar) file format is a standard high-performance big data processing, especially for analytical workloads. As a columnar file format, it is specifically designed to overcome the performance limitations of traditional row-based storage. By organizing data in a column-centric way, ORC enables a suite of powerful optimizations that lead to faster queries with less I/O. But how does it achieve this?

In this post, we'll dive deep into ORC's internal architecture, using a detailed example to show how data is divided into stripes, indexed, compressed, and efficiently stored. You'll see how this design allows for a significant boost in performance.

Let’s break it down with a real-world example to show how ORC stores your data internally — and why it’s so fast. 👇

📊 Step 0: Sample Employee Table (12 Rows)

ID	Name	Age	Dept	City
1	Alice	30	HR	Delhi
2	Bob	35	IT	Mumbai
3	Charlie	28	Finance	Chennai
4	David	40	IT	Bangalore
5	Eve	31	HR	Delhi
6	Frank	29	Finance	Pune
7	Grace	34	IT	Delhi
8	Helen	32	Finance	Mumbai
9	Ian	33	HR	Bangalore
10	John	37	IT	Chennai
11	Kevin	36	HR	Pune
12	Lily	30	Finance	Delhi

🟦 Step 1: Divide Data Into Stripes

To enable parallel processing, ORC splits data into stripes — each containing several rows.

➡️ For example (4 rows per stripe):

🟦 Stripe 1 → Rows 1–4
🟧 Stripe 2 → Rows 5–8
🟥 Stripe 3 → Rows 9–12

Each stripe acts like an independent block for storage & querying.

📂 Step 2: Column-wise Storage Inside Each Stripe

Unlike row-based formats, ORC stores each column separately inside a stripe. This allows for better compression and faster reads.

🔹 Stripe 1 (Rows 1–4):

Column	Stored Values
ID	[1, 2, 3, 4]
Name	[Alice, Bob, Charlie, David]
Age	[30, 35, 28, 40]
Dept	[HR, IT, Finance, IT]
City	[Delhi, Mumbai, Chennai, Bangalore]

✅ Query asks only for Age and Dept?
→ Skip the rest (💥 Column Pruning).

🔍 Step 3: Stripe Index (Per Column)

Each column has its own index, storing:

Min/Max values
Row group pointers

This enables predicate pushdown and fast lookups.

📊 Stripe 1 Index Summary:

Column	Min	Max	Sample Index Pointers
ID	1	4	[0 → 1, 2 → 3]
Name	Alice	David	[0 → Alice, 2 → Charlie]
Age	28	40	[0 → 30, 2 → 28]
Dept	Finance	IT	[0 → HR, 2 → Finance]
City	Bangalore	Mumbai	[0 → Delhi, 2 → Chennai]

💡 Example:
Query has WHERE Age > 40
→ Stripe 1 is skipped completely (since Max = 40)

Each stripe ends with a footer that describes:

Encodings
Data types
Offsets to index and data
Compressed sizes

🧾 Stripe 1 Footer Details:

Column	Encoding	Type	Size	Index Offset	Data Offset
ID	Direct Encoding	INT	20 bytes	0x0000	0x0010
Name	Dictionary Encoding	STRING	80 bytes	0x0020	0x0060
Age	Direct Encoding	INT	20 bytes	0x0080	0x0090
Dept	Dictionary Encoding	STRING	50 bytes	0x00A0	0x00D0
City	Dictionary Encoding	STRING	60 bytes	0x00F0	0x0120

📍 These offsets let the engine jump straight to the data, skipping unnecessary reads.

📘 Step 5: File Footer — The Big Picture

After all stripes, the File Footer appears at the end of the ORC file. It acts like a table of contents for the whole file.

📂 What It Contains:

Item	Description
📑 Number of Stripes	Total number of stripes in the file
🧬 Column Types	Data types of all columns
🧭 Stripe Offsets	Where each stripe starts (byte offset)
📦 Compression Info	Type of compression used (ZLIB, SNAPPY, etc.)
🏷 Version	ORC file version info

🔎 Why it’s important:
It helps readers (like Hive, Spark, Presto) know where to find what — without scanning the full file.

📘 Step 6: Metadata — Column-Level Insights

This block stores column-level statistics across the entire file.

📊 Example:

Column	Min	Max	Null Count	Total Count
Age	28	40	0	4
Dept	Finance	IT	0	4
City	Bangalore	Mumbai	0	4

⚡ Why it matters:
Used for file-level filtering. Queries can skip entire stripes if the column values don’t match!

📘 Step 7: PostScript — The Final Pointer

At the very end of the ORC file, a PostScript block is added.

🧾 What It Contains:

Field	Purpose
Footer Length	Size of the file footer
Metadata Length	Size of the metadata block
Compression Type	ZLIB, SNAPPY, NONE
Compression Block Size	Size of each compressed chunk
ORC Magic String	Used to identify it as an ORC file

🔐 Why PostScript is powerful:
It lets the reader jump directly to the footer and metadata without scanning the whole file. Just seek the last few bytes!

Key Performance Optimizations

Column Pruning: When you run a query like SELECT Name, Salary FROM employees;, the query engine reads the metadata and then only seeks to and reads the data for the Name and Salary columns from each stripe. It completely ignores the ID, Age, and City columns. This drastically reduces the amount of I/O.
Predicate Pushdown: For a query like SELECT Name FROM employees WHERE Salary > 70000;, ORC first checks the stripe-level statistics in the footer. If a stripe's metadata shows that its maximum salary is less than 70000, the entire stripe is skipped without reading any data, which can result in a significant speedup.
Advanced Compression: By storing columns separately, ORC can apply the most effective encoding for each data type. For instance, it can use dictionary encoding for the City column (as there are only a few distinct values) and run-length encoding for the ID column, leading to superior compression and smaller file sizes.

🚀 Why ORC is a Game-Changer for Big Data

✅ Column Pruning → Minimal I/O
✅ Predicate Pushdown → Skip entire blocks
✅ Row Indexing → Jump to needed rows
✅ Stripe Footers → Smarter navigation
✅ High Compression → Faster & smaller files
✅ Integrates with Hive, Spark, Trino, Presto

#BigData #DataEngineering #ORC #Spark #Hive #DataStorage #ETL #TechExplained #Performance #StripesNotRows #QueryOptimization

ORC File Format – Why It's Efficient & How It Works Internally (With Real Example)