Protobuf: What is it and why is it doing so well?

“Data is the elixir of the digital age.“

This reflection only finds ever-increasing relevance, as technology continues to evolve exponentially. The fashion of knowledge and intelligence consumption may change over time, but most of the computational resource and effort will always surround the “data”, which forms the crux of it. Its representation, delivery, storage and analysis.

So what exactly is Protobuf?

Simply put, it is a means to data representation and archival.

Much like the other data representation formats, such as JSON or YAML, Protobuf, short for “Protocol Buffers”, provides a way for serializing and deserializing some data.

To describe aptly,

“Protocol Buffers are a language-neutral, platform-neutral extensible mechanism for serializing structured data.”

“It’s like JSON, except it’s smaller and faster”

Both these quotes were lifted from the official Protobuf site.

It enables developers to define structured schema in a .proto file, which is then used to generate source code that can write and read data from different data streams.

Backdrop: Whenever I wish to understand a concept, its working, its need, its intricate details, its shortcomings, the first viewpoint I consider is, what purpose does this fulfill..? what issue do we face, if this concept is not implied..? what if I was the one who had to develop such a thing..? what are the things I would consider important, how would I go about designing this..? and so on and so forth.
This kind of attitude helps in comprehending its ins and outs and keeping it at your disposal so that you may take informed decisions while building your software.

When you design a data representation format, the first element you consider is maintaining a standard, such that distinct, unrelated servers and interfaces could communicate through it. Here comes the .proto file.

Example:

syntax = "proto3";

package techblog;

// A message representing a user in our system.
message User {
  int32 id = 1;
  string username = 2;
  string email = 3;

  enum AccountType {
    UNKNOWN = 0;
    FREE = 1;
    PRO = 2;
    ENTERPRISE = 3;
  }
  AccountType account_type = 4;
}

Like all things that are used to represent data, it has its own syntax, which is understood by the Protoc.

The .proto file can be compiled into several programming languages using Protoc, which is the Protobuf compiler. This compiler generates source code in the programming language that the developer specifies. This source code includes classes and methods for writing, reading, and manipulating the message type defined in the .proto file.

Brief History

Originally developed at Google for efficient data serialization, Protocol Buffers (Protobuf) was first used internally as Proto1 before its public, open-source release in 2008 as Proto2. The initial release supported key languages like C++, Java, and Python and quickly gained traction due to its performance and efficiency.

Protobuf's popularity surged following the 2015 release of gRPC, a Google framework for service-to-service communication that uses Protobuf as its preferred data format. Building on this momentum, Google launched Proto3 in 2016. This major update emphasized simplicity and cross-language uniformity, creating a more compact format by removing features like explicit default values and field presence.

What problems does it solve?

Suppose you want to send some data from one service to another. The important thing to consider here is that speed and low latency are of paramount importance. (Take a trading or a gaming software for example).
Now, for the data to be sent, if you already have its structure at both the sender and the receiver, you can send it in whatever format you want, provided that the receiver is able to understand it.
So, protocol buffers, is all about taking the data in the required format, (as defined in its schema, THE way to ensure that the data is in correct format for deserialization at the receiver), convert it to binary, which takes really low space, send it over a network, have it received at the end node which would deserialize the binary, keeping in context the original schema, based on the shared .proto file.

Another thing I thought of is that if there are two components in play, i.e. the schema and the compressed, hashed binary value, which represents some data, this binary value might as well be stored in some data persistence client, such as a data lake or a data warehouse, and retrieved as the correct data as deserialized, using the schema. And this is correct.
Protobuf is also widely used for data archival.

How is Protobuf different?

Protobuf prioritizes a structured and high-performance approach. It achieves this through:

Schema-First Design: You must define your data structures and types in a .proto file before you can use them. This file acts as a formal contract for your data.
Code Generation: A compilation step reads your .proto file and automatically generates efficient, type-safe code for creating, serializing, and deserializing your data.
Binary Serialization: Instead of human-readable text, Protobuf uses a compact binary format. This dramatically reduces payload size and speeds up parsing.
Native RPC Support: It’s designed for microservices, allowing you to define service endpoints and methods directly in the schema.

To sum it up

Feature	Protocol Buffers (Protobuf)	JSON / XML / YAML
Structure	Schema-based & Strongly-typed (enforced)	Schema-less (flexible but error-prone)
Format	Compact Binary (machine-optimized)	Human-Readable Text (verbose)
Process	Requires Compilation (generates code)	No Compilation (interpreted at runtime)
Best For	High-performance APIs (gRPC), microservices, large-scale systems	Web APIs, config files, data exchange where readability is key

Implementation

This is a typical example, comprising a proto schema around which all the logic will be based. This schema will be the single source of truth.

syntax = "proto3";

package iot.devices;


message SensorReading {
  enum Status {
    UNKNOWN = 0;
    OK = 1;
    ERROR = 2;
  }

  string device_id = 1;
  double temperature_c = 2; // in Celsius
  int64 timestamp_ms = 3;
  Status status = 4;

}

This is the generated code, keeping it here just for a contextual understanding. I’ll be using python for this example. This piece of code is not to be edited as it is referenced in the actual implementation.

# -*- coding: utf-8 -*-
# Generated by the protocol buffer compiler.  DO NOT EDIT!
# source: sensor.proto
"""Generated protocol buffer code."""
from google.protobuf.internal import builder as _builder
from google.protobuf import descriptor as _descriptor
from google.protobuf import descriptor_pool as _descriptor_pool
from google.protobuf import symbol_database as _symbol_database
# @@protoc_insertion_point(imports)

_sym_db = _symbol_database.Default()




DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(b'\n\x0csensor.proto\"\xa0\x01\n\rSensorReading\x12\x11\n\tdevice_id\x18\x01 \x01(\t\x12\x15\n\rtemperature_c\x18\x02 \x01(\x01\x12\x14\n\x0ctimestamp_ms\x18\x03 \x01(\x03\x12%\n\x06status\x18\x04 \x01(\x0e\x32\x15.SensorReading.Status\"(\n\x06Status\x12\x0b\n\x07UNKNOWN\x10\x00\x12\x06\n\x02OK\x10\x01\x12\t\n\x05\x45RROR\x10\x02\x62\x06proto3')

_builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, globals())
_builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'sensor_pb2', globals())
if _descriptor._USE_C_DESCRIPTORS == False:

  DESCRIPTOR._options = None
  _SENSORREADING._serialized_start=17
  _SENSORREADING._serialized_end=177
  _SENSORREADING_STATUS._serialized_start=137
  _SENSORREADING_STATUS._serialized_end=177
# @@protoc_insertion_point(module_scope)

Serialization

Now, let's use our generated Python class to create a sensor reading and serialize it into the compact binary format.

import sensor_pb2
import time


reading = sensor_pb2.SensorReading()

# reading = sensor_pb2.()


reading.device_id = "temp_probe-Z24"
reading.temperature_c = 22.5
reading.timestamp_ms = int(time.time() * 1000)  # Current time in milliseconds
reading.status = reading.Status.OK  # Assuming the status is OK
# Print the reading
print(f"Device ID: {reading.device_id}")
print(f"Temperature (C): {reading.temperature_c}")
print(f"Timestamp (ms): {reading.timestamp_ms}")
print(f"Status: {reading.status}")
# Serialize the reading to a binary format
serialized_reading = reading.SerializeToString()
# Print the serialized reading
# 3. See the result!
print(f"Python Object: {reading.device_id}, {reading.temperature_c}°C")
print("-" * 20)
print(f"Serialized Data ({len(serialized_reading)} bytes): {serialized_reading}")

The output, when and if you run this python code snippet, will look like this

Device ID: temp_probe-Z24
Temperature (C): 22.5
Timestamp (ms): 1749709129914
Status: 1
Python Object: temp_probe-Z24, 22.5°C
--------------------
Serialized Data (34 bytes): b'\n\x0etemp_probe-Z24\x11\x00\x00\x00\x00\x00\x806@\x18\xba\x91\xaa\x96\xf62 \x01'

The output is a Python bytes literal. The parts starting with\x are hexadecimal escape codes representing the raw binary data.

The data is in hexadecimal format, which is just another way to represent binary data, each hex digit comprises of 4 bits.

Deserialization

import sensor_pb2
import time


reading = sensor_pb2.SensorReading()

# reading = sensor_pb2.()


reading.device_id = "temp_probe-Z24"
reading.temperature_c = 22.5
reading.timestamp_ms = int(time.time() * 1000)  # Current time in milliseconds
reading.status = reading.Status.OK  # Assuming the status is OK
# Print the reading
print(f"Device ID: {reading.device_id}")
print(f"Temperature (C): {reading.temperature_c}")
print(f"Timestamp (ms): {reading.timestamp_ms}")
print(f"Status: {reading.status}")
# Serialize the reading to a binary format
serialized_reading = reading.SerializeToString()

deserialized_reading = sensor_pb2.SensorReading()
deserialized_reading.ParseFromString(serialized_reading)
# Print the deserialized reading
print("Deserialized Reading:")
print(f"Device ID: {deserialized_reading.device_id}")
print(f"Temperature (C): {deserialized_reading.temperature_c}")
print(f"Timestamp (ms): {deserialized_reading.timestamp_ms}")

When and if you run this, the output will look something like this,

Device ID: temp_probe-Z24
Temperature (C): 22.5
Timestamp (ms): 1749713180269
Status: 1
Deserialized Reading:
Device ID: temp_probe-Z24
Temperature (C): 22.5
Timestamp (ms): 1749713180269

Archival

We can use protobuf for data archival. This has significant advantage given the fact that protobuf is backward and forward compatible. An entity schema can be defined in a .proto file, for example

syntax = "proto3";

package example.db;

message UserProfile {
  string user_id = 1;     // A unique ID (e.g., a UUID string)
  string username = 2;
  string email = 3;
  int64 created_timestamp = 4;
  bool is_active = 5;
  // A field that might be added later
  string display_name = 6;
}Advantages / Disadvantages

An SQL-equivalent entity declaration would look like,

CREATE TABLE user_profiles (
    -- Promoted columns for fast querying and indexing
    user_id     UUID PRIMARY KEY,
    username    VARCHAR(50) UNIQUE NOT NULL,
    email       VARCHAR(255) UNIQUE NOT NULL,
    created_at  TIMESTAMPTZ NOT NULL,
    is_active   BOOLEAN NOT NULL,

    -- The column to store the full, serialized Protobuf data
    payload     BYTEA NOT NULL
);

Given a DBMS, db-A, it would most probably be able to handle binary data in some format, and that would be the entrypoint for serialized binary data handling in db-A.
Here, it is BYTEA for postgres. For mongo client, you can store it as a BSON binary string
For Redis, it stores string value as a binary-safe container, so you can simply store and retrieve the serialized string and do with it as you please. And, as such, for other data management systems, you have the idea.

Usage as an API Contract

Since the .proto file can be used to declare entities, it defines the services, the remote procedures (methods) you can call, and the request/response messages for those methods.

This opens up the potential for .proto files to be used as sources of truth in a project of any scale.
An example,

syntax = "proto3";

package warehouse;

// A message representing a product in inventory.
message Product {
  string id = 1;
  string name = 2;
  int32 quantity_on_hand = 3;
}

// The request message for getting a product, containing the ID.
message GetProductRequest {
  string id = 1;
}

// The service definition.
// This is the API contract for our gRPC service.
service Warehouse {
  // A simple RPC to get a product's details by its ID.
  rpc GetProduct(GetProductRequest) returns (Product) {}
}

Its implementation would involve a gRPC server and client and all the properties will be based on this schema.

Review

Advantages ✅

High Performance: The binary format is compact and serializes/deserializes extremely quickly, leading to low CPU usage and network latency, which is the highlight of this concept.
Strict Schema & Type Safety: Defining data structures in a .proto file acts as a contract, catching data errors at compile time, not runtime.
Automated Code Generation: The compiler generates all the boilerplate data access code for you, saving time and reducing bugs. This is similar to schema-based code generation for OOP languages like Java.
Excellent Cross-Language Support: A single .proto file can be used to generate native code for dozens of languages, perfect for polyglot microservice environments.
Backward & Forward Compatibility: Protobuf makes it easy to evolve your API over time without breaking existing clients or services.

Disadvantages ❌

Lack of Human Readability: The binary format cannot be easily inspected or debugged with standard text-based tools like curl or a browser. You need special tools to decode the payload.
This is the primary reason Protobuf is not in the same league as JSON or YAML, usage-wise. Much of debugging and development requires data in transit. In that situation, data present, being human-readable by default, makes much of a difference.
Added Complexity & Tooling: Requires a .proto schema definition and a compilation step, which adds an initial setup hurdle compared to schemaless formats like JSON.
Less Flexibility: Not ideal for unstructured or rapidly changing data. Every change requires an update to the schema and recompilation.
Not Native to Web Browsers: While usable on the web via gRPC-Web, it's not as seamless as JSON, which is the native language of web APIs.
- There are ways you can communicate protobuf data over REST calls.
  - Raw Binary Data
```
  # POSTing binary protobuf data to an endpoint
  curl --request POST https://api.example.com/users \
       --header "Content-Type: application/protobuf" \
       --header "Accept: application/protobuf" \
       --data-binary "@user.bin"
```
    This is the most compact format.
    
    As noticeable, the content-type for the server to interpret is protobuf. There is another content-type, namely application/x-protobuf, the difference purely being the semantic metadata. The latter is seldom used in legacy systems and is deprecated. A robust server might just interpret both the MIME types as valid protobuf, but don’t bet on it. Always use application/protobuf for building new software, as and when required, of course.
  - Encapsulate it in a JSON Object (for strict JSON compatibility)
```
  // The HTTP body is standard JSON
  {
    "event_name": "USER_CREATED",
    "payload_format": "protobuf_base64",
    "data": "CgVhbGljZRIKMTAyNy01NQ==" // <-- Base64 encoded Protobuf data
  }
```
    Serialized binary has to be converted to Base64 format. Long process and frankly defeats the purpose.
  - General binary data
    
    There is also an option to send protobuf serialized data as general binary data, leaving to the server to figure out whether it is serialized data binary or just a simple binary format. This comes with the header of content-type: application/octet-stream.
    
    Think of application/octet-stream as the most generic MIME type possible for binary data. It simply means: "Here is a stream of bytes. I'm not telling you what's inside. It's up to you, the receiver, to figure it out."
    
    It's the digital equivalent of shipping a package with no label on it. The recipient knows they received a box, but they have to open it to see what's inside.
```
  curl --request POST "http://127.0.0.1:5000/api/ingest" \
       --header "Content-Type: application/octet-stream" \
       --data-binary "@user.bin"
```

Future Scope

Protobuf is all about the schema so, naturally, all the deliberation and venues of improvement will eventually revolve around it.

Format Evolution

New Well-Known Types: The potential for adding new standard types to handle common use cases, like a precise Decimal type for financial applications or more sophisticated Timestamp and Duration options.
Custom Options and Annotations: The ability to add richer metadata to .proto files is expanding. This allows tools to auto-generate not just code but also documentation, validation rules, and even API gateway configurations directly from the schema.

Increasing Acceptance

The Web (Client-side): This is the biggest frontier. Technologies like gRPC-Web and the newer Connect protocol are making it far easier to use Protobuf directly from a web browser, creating a viable, high-performance alternative to traditional JSON REST APIs.
IoT and Edge Computing: For resource-constrained devices where every byte and CPU cycle counts, Protobuf's compact binary format and low overhead are a perfect fit.

Evolving Usage Pattern

This is the one I’m most excited about.

Schema as the Source of Truth: More and more tools are using the .proto file as a central contract to configure an entire system. For example, a service mesh like Istio or Linkerd can use Protobuf definitions to configure routing, retries, and security policies.

Growing Ecosystem

This is all about standardization.

Schema Registries: Managing thousands of .proto files across hundreds of microservices is a major challenge. Tools like the Buf Schema Registry (BSR) are emerging to provide versioning, linting, and dependency management for schemas, treating them like first-class code artifacts.

So, in simple terms, Protobuf is about encoding some data based on a shared schema. You can send it, store it, version it, analyze it as you see fit. New needs and solutions will arise in due time. Nevertheless, it is solidifying its position as the high-performance backbone for modern distributed systems.

Protobuf: What the heck is it and why is it doing so well?

Table of contents