Parquet File: An introduction


Lets understand this by comparing it to csv format, which is quite well known.
Query Performance: CSV vs Parquet
Example Queries and Performance
Query 1: Column Aggregation
sql
SELECT SUM(price) FROM sales;
CSV Performance:
Must read entire file (319 characters)
Parse every row and column
Extract price column from each row
Process: 6 rows × 6 columns = 36 values
Parquet Performance:
Read only price column chunk (~24 bytes)
Values already typed (no parsing needed)
Process: 6 values directly
~13x faster I/O, ~6x less data processing
Query 2: Filtered Query
sql
SELECT customer_name FROM sales WHERE product = 'Laptop';
CSV Performance:
Read entire file
Parse every row
Filter by product column
Return customer_name values
Process: 6 rows fully
Parquet Performance:
Check product column statistics (min/max)
Read only product and customer_name columns
Use dictionary encoding for fast filtering
Process: 2 columns only
~3x faster with column pruning
Query 3: Range Query
sql
SELECT * FROM sales WHERE price BETWEEN 50 AND 300;
CSV Performance:
Read and parse entire file
Convert price strings to numbers
Apply range filter
Return matching rows
Parquet Performance:
Check column statistics: price min=25.50, max=899.99
Since range overlaps, read price column
Use typed data (no conversion needed)
Apply filter with bit vector
Read other columns only for matching rows
~2-4x faster with predicate pushdown
Storage Efficiency
File Size Comparison (6 rows)
CSV: 319 bytes (uncompressed)
Parquet: ~180-200 bytes (compressed)
Space Savings: ~40-45%
File Size Comparison (1M rows, extrapolated)
CSV: ~53 MB
Parquet: ~15-20 MB (with better compression on larger datasets)
Space Savings: ~60-70%
When to Use Each Format
Use CSV When:
Data size < 100MB
Need human readability
Simple row-by-row processing
Streaming data ingestion
Quick data exchange/sharing
No complex analytics needed
Use Parquet When:
Data size > 100MB
Analytical workloads
Column-heavy queries
Data warehousing
Need compression
Working with big data tools (Spark, Hive, etc.)
Schema evolution requirements
Subscribe to my newsletter
Read articles from Purushottam Parakh directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
