ORC File Format – Why It's Efficient & How It Works Internally (With Real Example)

1.Introduction:

The ORC (Optimized Row Columnar) file format is a standard high-performance big data processing, especially for analytical workloads. As a columnar file format, it is specifically designed to overcome the performance limitations of traditional row-based storage. By organizing data in a column-centric way, ORC enables a suite of powerful optimizations that lead to faster queries with less I/O. But how does it achieve this?

In this post, we'll dive deep into ORC's internal architecture, using a detailed example to show how data is divided into stripes, indexed, compressed, and efficiently stored. You'll see how this design allows for a significant boost in performance.

Let’s break it down with a real-world example to show how ORC stores your data internally — and why it’s so fast. 👇

📊 Step 0: Sample Employee Table (12 Rows)

IDNameAgeDeptCity
1Alice30HRDelhi
2Bob35ITMumbai
3Charlie28FinanceChennai
4David40ITBangalore
5Eve31HRDelhi
6Frank29FinancePune
7Grace34ITDelhi
8Helen32FinanceMumbai
9Ian33HRBangalore
10John37ITChennai
11Kevin36HRPune
12Lily30FinanceDelhi

🟦 Step 1: Divide Data Into Stripes

To enable parallel processing, ORC splits data into stripes — each containing several rows.

➡️ For example (4 rows per stripe):

  • 🟦 Stripe 1 → Rows 1–4

  • 🟧 Stripe 2 → Rows 5–8

  • 🟥 Stripe 3 → Rows 9–12

Each stripe acts like an independent block for storage & querying.

📂 Step 2: Column-wise Storage Inside Each Stripe

Unlike row-based formats, ORC stores each column separately inside a stripe. This allows for better compression and faster reads.

🔹 Stripe 1 (Rows 1–4):

ColumnStored Values
ID[1, 2, 3, 4]
Name[Alice, Bob, Charlie, David]
Age[30, 35, 28, 40]
Dept[HR, IT, Finance, IT]
City[Delhi, Mumbai, Chennai, Bangalore]

✅ Query asks only for Age and Dept?
→ Skip the rest (💥 Column Pruning).

🔍 Step 3: Stripe Index (Per Column)

Each column has its own index, storing:

  • Min/Max values

  • Row group pointers

This enables predicate pushdown and fast lookups.

📊 Stripe 1 Index Summary:

ColumnMinMaxSample Index Pointers
ID14[0 → 1, 2 → 3]
NameAliceDavid[0 → Alice, 2 → Charlie]
Age2840[0 → 30, 2 → 28]
DeptFinanceIT[0 → HR, 2 → Finance]
CityBangaloreMumbai[0 → Delhi, 2 → Chennai]

💡 Example:
Query has WHERE Age > 40
→ Stripe 1 is skipped completely (since Max = 40)

Each stripe ends with a footer that describes:

  • Encodings

  • Data types

  • Offsets to index and data

  • Compressed sizes

🧾 Stripe 1 Footer Details:

ColumnEncodingTypeSizeIndex OffsetData Offset
IDDirect EncodingINT20 bytes0x00000x0010
NameDictionary EncodingSTRING80 bytes0x00200x0060
AgeDirect EncodingINT20 bytes0x00800x0090
DeptDictionary EncodingSTRING50 bytes0x00A00x00D0
CityDictionary EncodingSTRING60 bytes0x00F00x0120

📍 These offsets let the engine jump straight to the data, skipping unnecessary reads.

After all stripes, the File Footer appears at the end of the ORC file. It acts like a table of contents for the whole file.

📂 What It Contains:

ItemDescription
📑 Number of StripesTotal number of stripes in the file
🧬 Column TypesData types of all columns
🧭 Stripe OffsetsWhere each stripe starts (byte offset)
📦 Compression InfoType of compression used (ZLIB, SNAPPY, etc.)
🏷 VersionORC file version info

🔎 Why it’s important:
It helps readers (like Hive, Spark, Presto) know where to find what — without scanning the full file.


📘 Step 6: Metadata — Column-Level Insights

This block stores column-level statistics across the entire file.

📊 Example:

ColumnMinMaxNull CountTotal Count
Age284004
DeptFinanceIT04
CityBangaloreMumbai04

Why it matters:
Used for file-level filtering. Queries can skip entire stripes if the column values don’t match!


📘 Step 7: PostScript — The Final Pointer

At the very end of the ORC file, a PostScript block is added.

🧾 What It Contains:

FieldPurpose
Footer LengthSize of the file footer
Metadata LengthSize of the metadata block
Compression TypeZLIB, SNAPPY, NONE
Compression Block SizeSize of each compressed chunk
ORC Magic StringUsed to identify it as an ORC file

🔐 Why PostScript is powerful:
It lets the reader jump directly to the footer and metadata without scanning the whole file. Just seek the last few bytes!

Key Performance Optimizations

  1. Column Pruning: When you run a query like SELECT Name, Salary FROM employees;, the query engine reads the metadata and then only seeks to and reads the data for the Name and Salary columns from each stripe. It completely ignores the ID, Age, and City columns. This drastically reduces the amount of I/O.

  2. Predicate Pushdown: For a query like SELECT Name FROM employees WHERE Salary > 70000;, ORC first checks the stripe-level statistics in the footer. If a stripe's metadata shows that its maximum salary is less than 70000, the entire stripe is skipped without reading any data, which can result in a significant speedup.

  3. Advanced Compression: By storing columns separately, ORC can apply the most effective encoding for each data type. For instance, it can use dictionary encoding for the City column (as there are only a few distinct values) and run-length encoding for the ID column, leading to superior compression and smaller file sizes.

🚀 Why ORC is a Game-Changer for Big Data

Column Pruning → Minimal I/O
Predicate Pushdown → Skip entire blocks
Row Indexing → Jump to needed rows
Stripe Footers → Smarter navigation
High Compression → Faster & smaller files
Integrates with Hive, Spark, Trino, Presto

#BigData #DataEngineering #ORC #Spark #Hive #DataStorage #ETL #TechExplained #Performance #StripesNotRows #QueryOptimization

0
Subscribe to my newsletter

Read articles from KOTHA CHANDRA TEJA directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

KOTHA CHANDRA TEJA
KOTHA CHANDRA TEJA

Associate software engineer(Data)