Serialization Formats: When and Why They Matter

Harsha GeliviHarsha Gelivi
2 min read

Whether you're sending data over the wire, storing it on disk, or just trying to squeeze the last bit of performance from your microservices, you’ve probably had to think about serialization - the process of turning in-memory objects into a format that can be saved or transmitted and deserialization - the reverse.

What’s Serialization Anyway?

At a high level, serialization turns an object (like a Python dict or a Java object) into a series of bytes that can be:

  • Sent across a network (think gRPC, Kafka, REST APIs)

  • Stored on disk (Parquet, Avro files)

  • Written to a queue or log (Kafka, Pulsar)

You deserialize it on the other side to reconstruct the original object.

Row Format vs Columnar Format

Row-Oriented Formats (e.g., JSON, Protobuf, Avro)

These store data row-by-row. Imagine a table:

idnameage
1Alice30
2Bob32

In a row format, it's stored like:

{id: 1, name: "Alice", age: 30}
{id: 2, name: "Bob", age: 32}

When to use:

  • OLTP workloads (CRUD-heavy apps)

  • Streaming (Kafka, Pulsar)

  • APIs (gRPC, REST)

  • Log shipping

These are fast for writing and reading full rows but not efficient if you're only reading one or two columns from large datasets.

Columnar Formats (e.g., ORC, Parquet)

Columnar formats store data column-by-column. That same table above gets stored like:

id: [1, 2]
name: ["Alice", "Bob"]
age: [30, 32]

This layout makes it extremely efficient for analytics, especially when you're reading a subset of columns over millions of rows.

When to use:

  • OLAP workloads (data lakes, warehousing)

  • Spark, Hive, Trino/Presto, Athena

  • Batch jobs

  • Any time you're scanning large datasets

Metadata in Columnar Formats

Columnar formats like ORC and Parquet are smart. They store rich metadata along with your data:

What kind of metadata?

  • Schema: Column names, types, nullability

  • Statistics: min/max values, counts, distinct values (NDV)

  • Compression information

  • Encoding methods

  • Bloom filters (optional): For quick lookups

  • File version & format flags

This metadata enables powerful optimizations like:

  • Predicate pushdown: Skip blocks that don't match your WHERE clause

  • Column pruning: Only read the columns you're querying

  • Efficient compression: Choose codecs based on data types

0
Subscribe to my newsletter

Read articles from Harsha Gelivi directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Harsha Gelivi
Harsha Gelivi