Parquet File: An introduction

Lets understand this by comparing it to csv format, which is quite well known.

Query Performance: CSV vs Parquet

Example Queries and Performance

Query 1: Column Aggregation

sql

SELECT SUM(price) FROM sales;

CSV Performance:

  • Must read entire file (319 characters)

  • Parse every row and column

  • Extract price column from each row

  • Process: 6 rows × 6 columns = 36 values

Parquet Performance:

  • Read only price column chunk (~24 bytes)

  • Values already typed (no parsing needed)

  • Process: 6 values directly

  • ~13x faster I/O, ~6x less data processing

Query 2: Filtered Query

sql

SELECT customer_name FROM sales WHERE product = 'Laptop';

CSV Performance:

  • Read entire file

  • Parse every row

  • Filter by product column

  • Return customer_name values

  • Process: 6 rows fully

Parquet Performance:

  • Check product column statistics (min/max)

  • Read only product and customer_name columns

  • Use dictionary encoding for fast filtering

  • Process: 2 columns only

  • ~3x faster with column pruning

Query 3: Range Query

sql

SELECT * FROM sales WHERE price BETWEEN 50 AND 300;

CSV Performance:

  • Read and parse entire file

  • Convert price strings to numbers

  • Apply range filter

  • Return matching rows

Parquet Performance:

  • Check column statistics: price min=25.50, max=899.99

  • Since range overlaps, read price column

  • Use typed data (no conversion needed)

  • Apply filter with bit vector

  • Read other columns only for matching rows

  • ~2-4x faster with predicate pushdown

Storage Efficiency

File Size Comparison (6 rows)

  • CSV: 319 bytes (uncompressed)

  • Parquet: ~180-200 bytes (compressed)

  • Space Savings: ~40-45%

File Size Comparison (1M rows, extrapolated)

  • CSV: ~53 MB

  • Parquet: ~15-20 MB (with better compression on larger datasets)

  • Space Savings: ~60-70%

When to Use Each Format

Use CSV When:

  • Data size < 100MB

  • Need human readability

  • Simple row-by-row processing

  • Streaming data ingestion

  • Quick data exchange/sharing

  • No complex analytics needed

Use Parquet When:

  • Data size > 100MB

  • Analytical workloads

  • Column-heavy queries

  • Data warehousing

  • Need compression

  • Working with big data tools (Spark, Hive, etc.)

  • Schema evolution requirements

0
Subscribe to my newsletter

Read articles from Purushottam Parakh directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Purushottam Parakh
Purushottam Parakh