ORC File Format – Why It's Efficient & How It Works Internally (With Real Example)


1.Introduction:
The ORC (Optimized Row Columnar) file format is a standard high-performance big data processing, especially for analytical workloads. As a columnar file format, it is specifically designed to overcome the performance limitations of traditional row-based storage. By organizing data in a column-centric way, ORC enables a suite of powerful optimizations that lead to faster queries with less I/O. But how does it achieve this?
In this post, we'll dive deep into ORC's internal architecture, using a detailed example to show how data is divided into stripes, indexed, compressed, and efficiently stored. You'll see how this design allows for a significant boost in performance.
Let’s break it down with a real-world example to show how ORC stores your data internally — and why it’s so fast. 👇
📊 Step 0: Sample Employee Table (12 Rows)
ID | Name | Age | Dept | City |
1 | Alice | 30 | HR | Delhi |
2 | Bob | 35 | IT | Mumbai |
3 | Charlie | 28 | Finance | Chennai |
4 | David | 40 | IT | Bangalore |
5 | Eve | 31 | HR | Delhi |
6 | Frank | 29 | Finance | Pune |
7 | Grace | 34 | IT | Delhi |
8 | Helen | 32 | Finance | Mumbai |
9 | Ian | 33 | HR | Bangalore |
10 | John | 37 | IT | Chennai |
11 | Kevin | 36 | HR | Pune |
12 | Lily | 30 | Finance | Delhi |
🟦 Step 1: Divide Data Into Stripes
To enable parallel processing, ORC splits data into stripes — each containing several rows.
➡️ For example (4 rows per stripe):
🟦 Stripe 1 → Rows 1–4
🟧 Stripe 2 → Rows 5–8
🟥 Stripe 3 → Rows 9–12
Each stripe acts like an independent block for storage & querying.
📂 Step 2: Column-wise Storage Inside Each Stripe
Unlike row-based formats, ORC stores each column separately inside a stripe. This allows for better compression and faster reads.
🔹 Stripe 1 (Rows 1–4):
Column | Stored Values |
ID | [1, 2, 3, 4] |
Name | [Alice, Bob, Charlie, David] |
Age | [30, 35, 28, 40] |
Dept | [HR, IT, Finance, IT] |
City | [Delhi, Mumbai, Chennai, Bangalore] |
✅ Query asks only for Age
and Dept
?
→ Skip the rest (💥 Column Pruning).
🔍 Step 3: Stripe Index (Per Column)
Each column has its own index, storing:
Min/Max values
Row group pointers
This enables predicate pushdown and fast lookups.
📊 Stripe 1 Index Summary:
Column | Min | Max | Sample Index Pointers |
ID | 1 | 4 | [0 → 1, 2 → 3] |
Name | Alice | David | [0 → Alice, 2 → Charlie] |
Age | 28 | 40 | [0 → 30, 2 → 28] |
Dept | Finance | IT | [0 → HR, 2 → Finance] |
City | Bangalore | Mumbai | [0 → Delhi, 2 → Chennai] |
💡 Example:
Query has WHERE Age > 40
→ Stripe 1 is skipped completely (since Max = 40)
📘 Step 4: Stripe Footer — The Metadata Engine
Each stripe ends with a footer that describes:
Encodings
Data types
Offsets to index and data
Compressed sizes
🧾 Stripe 1 Footer Details:
Column | Encoding | Type | Size | Index Offset | Data Offset |
ID | Direct Encoding | INT | 20 bytes | 0x0000 | 0x0010 |
Name | Dictionary Encoding | STRING | 80 bytes | 0x0020 | 0x0060 |
Age | Direct Encoding | INT | 20 bytes | 0x0080 | 0x0090 |
Dept | Dictionary Encoding | STRING | 50 bytes | 0x00A0 | 0x00D0 |
City | Dictionary Encoding | STRING | 60 bytes | 0x00F0 | 0x0120 |
📍 These offsets let the engine jump straight to the data, skipping unnecessary reads.
📘 Step 5: File Footer — The Big Picture
After all stripes, the File Footer appears at the end of the ORC file. It acts like a table of contents for the whole file.
📂 What It Contains:
Item | Description |
📑 Number of Stripes | Total number of stripes in the file |
🧬 Column Types | Data types of all columns |
🧭 Stripe Offsets | Where each stripe starts (byte offset) |
📦 Compression Info | Type of compression used (ZLIB, SNAPPY, etc.) |
🏷 Version | ORC file version info |
🔎 Why it’s important:
It helps readers (like Hive, Spark, Presto) know where to find what — without scanning the full file.
📘 Step 6: Metadata — Column-Level Insights
This block stores column-level statistics across the entire file.
📊 Example:
Column | Min | Max | Null Count | Total Count |
Age | 28 | 40 | 0 | 4 |
Dept | Finance | IT | 0 | 4 |
City | Bangalore | Mumbai | 0 | 4 |
⚡ Why it matters:
Used for file-level filtering. Queries can skip entire stripes if the column values don’t match!
📘 Step 7: PostScript — The Final Pointer
At the very end of the ORC file, a PostScript block is added.
🧾 What It Contains:
Field | Purpose |
Footer Length | Size of the file footer |
Metadata Length | Size of the metadata block |
Compression Type | ZLIB, SNAPPY, NONE |
Compression Block Size | Size of each compressed chunk |
ORC Magic String | Used to identify it as an ORC file |
🔐 Why PostScript is powerful:
It lets the reader jump directly to the footer and metadata without scanning the whole file. Just seek the last few bytes!
Key Performance Optimizations
Column Pruning: When you run a query like
SELECT Name, Salary FROM employees;
, the query engine reads the metadata and then only seeks to and reads the data for theName
andSalary
columns from each stripe. It completely ignores theID
,Age
, andCity
columns. This drastically reduces the amount of I/O.Predicate Pushdown: For a query like
SELECT Name FROM employees WHERE Salary > 70000;
, ORC first checks the stripe-level statistics in the footer. If a stripe's metadata shows that its maximum salary is less than70000
, the entire stripe is skipped without reading any data, which can result in a significant speedup.Advanced Compression: By storing columns separately, ORC can apply the most effective encoding for each data type. For instance, it can use dictionary encoding for the
City
column (as there are only a few distinct values) and run-length encoding for theID
column, leading to superior compression and smaller file sizes.
🚀 Why ORC is a Game-Changer for Big Data
✅ Column Pruning → Minimal I/O
✅ Predicate Pushdown → Skip entire blocks
✅ Row Indexing → Jump to needed rows
✅ Stripe Footers → Smarter navigation
✅ High Compression → Faster & smaller files
✅ Integrates with Hive, Spark, Trino, Presto
#BigData #DataEngineering #ORC #Spark #Hive #DataStorage #ETL #TechExplained #Performance #StripesNotRows #QueryOptimization
Subscribe to my newsletter
Read articles from KOTHA CHANDRA TEJA directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

KOTHA CHANDRA TEJA
KOTHA CHANDRA TEJA
Associate software engineer(Data)