Serialization Formats: When and Why They Matter

Whether you're sending data over the wire, storing it on disk, or just trying to squeeze the last bit of performance from your microservices, you’ve probably had to think about serialization - the process of turning in-memory objects into a format that can be saved or transmitted and deserialization - the reverse.
What’s Serialization Anyway?
At a high level, serialization turns an object (like a Python dict or a Java object) into a series of bytes that can be:
Sent across a network (think gRPC, Kafka, REST APIs)
Stored on disk (Parquet, Avro files)
Written to a queue or log (Kafka, Pulsar)
You deserialize it on the other side to reconstruct the original object.
Row Format vs Columnar Format
Row-Oriented Formats (e.g., JSON, Protobuf, Avro)
These store data row-by-row. Imagine a table:
id | name | age |
1 | Alice | 30 |
2 | Bob | 32 |
In a row format, it's stored like:
{id: 1, name: "Alice", age: 30}
{id: 2, name: "Bob", age: 32}
When to use:
OLTP workloads (CRUD-heavy apps)
Streaming (Kafka, Pulsar)
APIs (gRPC, REST)
Log shipping
These are fast for writing and reading full rows but not efficient if you're only reading one or two columns from large datasets.
Columnar Formats (e.g., ORC, Parquet)
Columnar formats store data column-by-column. That same table above gets stored like:
id: [1, 2]
name: ["Alice", "Bob"]
age: [30, 32]
This layout makes it extremely efficient for analytics, especially when you're reading a subset of columns over millions of rows.
When to use:
OLAP workloads (data lakes, warehousing)
Spark, Hive, Trino/Presto, Athena
Batch jobs
Any time you're scanning large datasets
Metadata in Columnar Formats
Columnar formats like ORC and Parquet are smart. They store rich metadata along with your data:
What kind of metadata?
Schema: Column names, types, nullability
Statistics: min/max values, counts, distinct values (NDV)
Compression information
Encoding methods
Bloom filters (optional): For quick lookups
File version & format flags
This metadata enables powerful optimizations like:
Predicate pushdown: Skip blocks that don't match your WHERE clause
Column pruning: Only read the columns you're querying
Efficient compression: Choose codecs based on data types
Subscribe to my newsletter
Read articles from Harsha Gelivi directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
