Whether you're sending data over the wire, storing it on disk, or just trying to squeeze the last bit of performance from your microservices, you’ve probably had to think about serialization - the process of turning in-memory objects into a format that can be saved or transmitted and deserialization - the reverse.

What’s Serialization Anyway?

At a high level, serialization turns an object (like a Python dict or a Java object) into a series of bytes that can be:

Sent across a network (think gRPC, Kafka, REST APIs)
Stored on disk (Parquet, Avro files)
Written to a queue or log (Kafka, Pulsar)

You deserialize it on the other side to reconstruct the original object.

Row Format vs Columnar Format

Row-Oriented Formats (e.g., JSON, Protobuf, Avro)

These store data row-by-row. Imagine a table:

id	name	age
1	Alice	30
2	Bob	32

In a row format, it's stored like:

{id: 1, name: "Alice", age: 30}
{id: 2, name: "Bob", age: 32}

When to use:

OLTP workloads (CRUD-heavy apps)
Streaming (Kafka, Pulsar)
APIs (gRPC, REST)
Log shipping

These are fast for writing and reading full rows but not efficient if you're only reading one or two columns from large datasets.

Columnar Formats (e.g., ORC, Parquet)

Columnar formats store data column-by-column. That same table above gets stored like:

id: [1, 2]
name: ["Alice", "Bob"]
age: [30, 32]

This layout makes it extremely efficient for analytics, especially when you're reading a subset of columns over millions of rows.

When to use:

OLAP workloads (data lakes, warehousing)
Spark, Hive, Trino/Presto, Athena
Batch jobs
Any time you're scanning large datasets

Metadata in Columnar Formats

Columnar formats like ORC and Parquet are smart. They store rich metadata along with your data:

What kind of metadata?

Schema: Column names, types, nullability
Statistics: min/max values, counts, distinct values (NDV)
Compression information
Encoding methods
Bloom filters (optional): For quick lookups
File version & format flags

This metadata enables powerful optimizations like:

Predicate pushdown: Skip blocks that don't match your WHERE clause
Column pruning: Only read the columns you're querying
Efficient compression: Choose codecs based on data types

Serialization Formats: When and Why They Matter

Table of contents

What’s Serialization Anyway?

Row Format vs Columnar Format

Row-Oriented Formats (e.g., JSON, Protobuf, Avro)

When to use:

Columnar Formats (e.g., ORC, Parquet)

When to use:

Metadata in Columnar Formats

What kind of metadata?

Subscribe to my newsletter

Harsha Gelivi

Harsha Gelivi