Parquet File: An introduction

Purushottam Parakh

Purushottam Parakh

2 min read

Purushottam Parakh

Purushottam Parakh

·

2 min read

Lets understand this by comparing it to csv format, which is quite well known.

Query Performance: CSV vs Parquet

Example Queries and Performance

Query 1: Column Aggregation

sql

SELECT SUM(price) FROM sales;

CSV Performance:

Must read entire file (319 characters)
Parse every row and column
Extract price column from each row
Process: 6 rows × 6 columns = 36 values

Parquet Performance:

Read only price column chunk (~24 bytes)
Values already typed (no parsing needed)
Process: 6 values directly
~13x faster I/O, ~6x less data processing

Query 2: Filtered Query

sql

SELECT customer_name FROM sales WHERE product = 'Laptop';

CSV Performance:

Read entire file
Parse every row
Filter by product column
Return customer_name values
Process: 6 rows fully

Parquet Performance:

Check product column statistics (min/max)
Read only product and customer_name columns
Use dictionary encoding for fast filtering
Process: 2 columns only
~3x faster with column pruning

Query 3: Range Query

sql

SELECT * FROM sales WHERE price BETWEEN 50 AND 300;

CSV Performance:

Read and parse entire file
Convert price strings to numbers
Apply range filter
Return matching rows

Parquet Performance:

Check column statistics: price min=25.50, max=899.99
Since range overlaps, read price column
Use typed data (no conversion needed)
Apply filter with bit vector
Read other columns only for matching rows
~2-4x faster with predicate pushdown

Storage Efficiency

File Size Comparison (6 rows)

CSV: 319 bytes (uncompressed)
Parquet: ~180-200 bytes (compressed)
Space Savings: ~40-45%

File Size Comparison (1M rows, extrapolated)

CSV: ~53 MB
Parquet: ~15-20 MB (with better compression on larger datasets)
Space Savings: ~60-70%

When to Use Each Format

Use CSV When:

Data size < 100MB
Need human readability
Simple row-by-row processing
Streaming data ingestion
Quick data exchange/sharing
No complex analytics needed

Use Parquet When:

Data size > 100MB
Analytical workloads
Column-heavy queries
Data warehousing
Need compression
Working with big data tools (Spark, Hive, etc.)
Schema evolution requirements

0

Subscribe to my newsletter

Read articles from Purushottam Parakh directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Purushottam Parakh

Purushottam Parakh

Purushottam Parakh