Serialization with Protobuf

what are protocol buffers as name says it's a buffer for different protocols like gRPC

Protocol Buffers are language-neutral, platform-neutral extensible mechanisms for serializing structured data. -Proto Buff Docs

Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. In short proto buffer takes your response model/ response data (JSON or Whatever the protocol is using ) on the server side encodes it into a string based on the choice of encoding, and then it sends to the client/receiver, and is decoded on the client into original data/response.

Who developed it? (Overview)

The first version of protocol buffers, called "Proto1," was developed in 2001 at Google. It kept growing and improving over the years as people added new features whenever there was a need and someone was willing to put in the effort to develop them. Google had projects that they wanted to Open-Source which used proto buff, hence they needed to Open-Source Protocol Buffer.

Google released protocol buffers intending to offer public APIs that could accept both protocol buffers and XML. They preferred protocol buffers because they were more efficient. Even if XML was submitted, Google would convert it to protocol buffers on their end.
What's the need for it ? There are quite a few reasons to add Protocol Buffer to your project.
1. Serialization - the word "serialize" usually refers to converting in-memory data into a string or bytes
  - Serialization is fundamental concept in CS which involves converting structured data into a format that can be easily stored and transferred.
  - During Serialization, we convert structured data such as JSON into an encoded string according to predefined models. This string can then be stored or transferred over a network. The recipient or user of this data then decodes it using a known configuration for serialization.
2. Faster Response time Transfer rate for a string over a network is much faster compared to JSON . Why ?
  - Strings are easier to check for errors on a network, and they tend to transfer faster compared to other data types. In the context of network automation scripting, string methods empower network engineers to perform a wide range of string operations. Commonly used string methods like upper(), split(), and others, allow for tasks such as converting strings to uppercase, splitting strings into substrings, concatenation, etc.
  - JSON, with its more structured format and potentially larger and more complex data sets, can be more challenging for networks to handle and understand. Handling deeply nested JSON structures can lead to increased complexity in parsing and processing responses, potentially causing performance bottlenecks. JSON responses may include redundant information or unnecessary fields, leading to larger response sizes and increased processing requirements.
3. An extra layer of Security
  - Serialization ensures that data is structured and encoded in a specific format before transmission. This structured format makes it more difficult for malicious actors to manipulate or tamper with the data while it's being transmitted.
  - Combining encryption techniques with serialization secures data during transmission. By serializing data before encryption, sensitive information is safeguarded from unauthorized access.
  - Serialization enables input validation and sanitization before data transmission, thereby mitigating common security vulnerabilities such as injection attacks. This ensures that only validated and sanitized data is processed, enhancing system security.
  - Serialization can incorporate error-checking mechanisms to detect and handle errors during data transmission. By including error codes or checksums in serialized data, potential issues can be identified and addressed promptly.
4. Serialization Solutions: Transforming Data for Leading Companies
  - Google: Google uses serialization in various services and products to optimize data processing and sharing between different systems and applications. For instance, in Google Cloud Platform, serialization plays a crucial role in efficiently encoding and decoding structured data for storage and transmission.
  - Signal Signal utilizes Protocol Buffers (protobuf) for efficient encoding and decoding of structured data within its messaging platform. By integrating protobuf, Signal leverages a compact and high-speed serialization format, optimizing data transmission and storage.
    
    The use of protobuf in Signal allows for the following key functionalities:
    - Signal leverages Protocol Buffers to serialize and deserialize structured data, ensuring efficient encoding and decoding processes for messages exchanged between users.
    - By using protobuf, Signal can represent data in a compact binary format, optimizing the size of transmitted messages and reducing bandwidth usage.
    - Protocol Buffers facilitate interoperability by providing language-neutral and platform-neutral serialization, enabling Signal to communicate seamlessly across different programming languages and services.
    - Signal's utilization of Protocol Buffers enables efficient handling of large volumes of data transmission, ensuring scalability while maintaining optimal performance and speed.
Practical Application of Serialization

Let us explore some practical examples highlighting the significance of serialization in modern software architectures.
- Serialization is particularly useful in scenarios where objects need to be saved to disk or transmitted over a network and then reconstructed at a later time. It is also used to implement features such as data backup & restore data migration, and data exchange between different systems.
- Consider a scenario where a Client-Server Architecture communicates via a REST API. The response to a request in this API is typically transmitted in JSON format. However, JSON's intricate structure makes parsing and transmission challenging. Serialization offers a solution by compressing JSON data, resulting in shorter request-response cycles. Additionally, serialization facilitates efficient storage and caching, despite potential challenges in cache validation.
- Another use case is when we have a decentralized system where data is stored in a distributed manner. In this case, when a user requests a certain file/data, firstly it needs to be gathered or it needs to be transmitted from multiple places. We will necessarily need serialization here as unserialized data will be way to complex to bind together after transmission. Also, unserialized data will have more packet loss and will take more time to transmit compared to the transmission of the same data as a serialized string.
Practical implementation Best example that shows protobuffs implementation and easy to understand would be a grpc service or proxy server . whats a grpc server/service ? To understand what a grpc service is first you have to think of a RESTAPI service , REST api service/server generally responds with a JSON serialization format , which is human-readable and widely supported. However, JSON's textual nature can lead to larger message sizes due to the inclusion of field names and other textual metadata, resulting in higher payload overhead. Now imagine that same REST api uses binary-based/string based serialization format for its responses and the protocol to convert JSON or to get serialized binary-based format is called protobuf or proto3 protocol.

To implement a grpc service or to serialize the data in proto3 format we have define prto3 syntax for each of our response formats

I'm gonna use FastAPI to create my REST endpoints which will call grpc service endpoint using protobuf3 . Our service is a simple REST-grpc where we hit a /pokemon endpoint and it provides us with a pokemon in response after running some pokedex logic in it . So now think of the JSON data that will be created everytime the endpoint gets hit. All the pokemon JSON responses must have same types of keys. So now lets make a pokedex.proto file using the general attributes/needed properties of a pokemon request data type.



    //pokedex type settings

    syntax = "proto3"; //Initialize by declaring the syntax version 

    package pokedex;

    message Pokemon {
        uint64 number = 1;  // id 
        string name = 2;    // name 
        string type_one = 3; //type
        string type_two = 4;
        uint64 total = 5;  //NO of pokemons
        uint64 hit_points = 6;  //HP
        uint64 attack = 7;      //ATCK
        uint64 defense = 8;     // DEF
        uint64 special_attack = 9;  // SPCL ATCK
        uint64 special_defense = 10;  //  SPCL DEF
        uint64 speed = 11;           // SPEED
        uint64 generation = 12;
        bool legendary = 13;
    }

These fields will become a class property to Pokemon response class and will help us decode the string to original message as well as during serialization after the syntax gets compiled into python server class.

Using this command or go to protobuf docs and get the right command for you !!, to genrate pyhton server class for pokedex-pokemon into pokemon_pb2.py module .

python -m pokedex.proto -I$INCLUDE --python_out=$OUTPUT --grpc_python_out=$OUTPUT $PROTO_FILES

Here we verify our pokemon response serialization by first taking request_body as a parameter, which is expected to be a bytes object representing a serialized Protobuf message. A new Pokemon object is created using the pokemon_pb2 module. The ParseFromString method is used to deserialize the request_body bytes into the pokemon object. By protobuf_to_dict() we get a dictonary out of a proto object and then a nested loop iters over reason_dict which contains rules to verify our prto_dict response .

https://gist.github.com/harry-urek/eacd69e2c60ab9ae31d1028dcf7743d8

Things to take care of during Serialization using Protobuff

Precisely, Concisely Document Most Fields and Messages

Chances are good your proto will be inherited and used by people who don’t know what you were thinking when you wrote or modified it. Document each field in terms that will be useful to a new team-member or client with little knowledge of your system.

Some Official guide :

https://gist.github.com/harry-urek/48000f114ffdcb805ad22a2b8ac49c53

Include a Field Read Mask in Read Requests

%[https://gist.github.com/harry-urek/cc9935e9fbac91dd755cd3654de48098]

If you use the recommended google.protobuf.FieldMask, you can use the FieldMaskUtil ( Java/C++) libraries to automatically filter a proto.
Clients typically expect to retrieve what they wrote when performing a write followed by a read of the same object, even if this expectation may not be reasonable for the underlying storage system.
An alternative is to populate all fields by default, which becomes costly as the protocol expands. Implicit read masks, varying based on the method used, can lead to apparent data loss for clients building local caches from response protos.
The worst failure mode is to have an implicit (undeclared) read mask that varies depending on which method populated the message. This anti-pattern leads to apparent data loss on clients that build a local cache from response protos.

Include a Version Field to Allow for Consistent Reads

Clients typically expect to retrieve what they wrote when performing a write followed by a read of the same object, even if this expectation may not be reasonable for the underlying storage system.

Your server will read the local value and if the local version_info is less than the expected version_info, it will read from remote replicas to find the latest value. Typically version_info is a proto encoded as a string that includes the datacenter the mutation went to and the timestamp at which it was committed.

Even systems backed by consistent storage often want a token to trigger the more expensive read-consistent path rather than incurring the cost on every read.

Use Consistent Request Options for RPCs that Return the Same Data Type

An example of a failure pattern is when a service's request options for each RPC return the same data type but have separate request options for specifying details such as maximum comments, embeded supported types list, and similar parameters.

Approaching this ad hoc incurs increased complexity on both the client and server sides. Clients must navigate how to fill out each request, while servers must handle the transformation of multiple request options into a common internal format. A not-small number of real-life bugs are traceable to this example.

Instead, create a single, separate message to hold request options and include that in each of the top-level request messages. Here’s a better-practices example:

%[https://gist.github.com/harry-urek/b6b5d62cc46911c15d9e3b83451fcf18]

Don’t Encode Data in a String That You Expect a Client to Construct or Parse

It’s less efficient over the wire, more work for the consumer of the proto, and confusing for someone reading your documentation. Your clients also have to wonder about the encoding:

Are lists comma-separated? Did I escape this untrusted data correctly? Are numbers base-10? Better to have clients send an actual message or primitive type. It’s more compact over the wire and clearer for your clients.

This issue becomes particularly problematic when your service acquires clients in multiple languages. Now each client will have to select the appropriate parser or builder, or even worse, develop one from scratch.

More generally, choose the right primitive type. See the Scalar Value Types table in the Protocol Buffer Language Guide

Returning HTML in a Front-End Proto

With a JavaScript client, it’s tempting to return HTML or JSON in a field of your API. This is a slippery slope towards tying your API to a specific UI. Here are three concrete dangers:
- A “scrappy” non-web client will end up parsing your HTML or JSON to get the data they want leading to fragility if you change formats and vulnerabilities if their parsing is bad.
- Your web-client is now vulnerable to an XSS exploit if that HTML is ever returned unsanitized.
- The tags and classes you’re returning expect a particular style-sheet and DOM structure. From release to release, that structure will change, and you risk a version-skew problem where the JavaScript client is older than the server and the HTML the server returns no longer renders properly on old clients. For projects that release often, this is not an edge case.

Other than the initial page load, it’s usually better to return data and use client-side templating to construct HTML on the client .

Sources

PROTOBUFF - Serialization

Table of contents

Things to take care of during Serialization using Protobuff

Precisely, Concisely Document Most Fields and Messages

Include a Field Read Mask in Read Requests

Include a Version Field to Allow for Consistent Reads

Use Consistent Request Options for RPCs that Return the Same Data Type

Don’t Encode Data in a String That You Expect a Client to Construct or Parse

Returning HTML in a Front-End Proto

Subscribe to my newsletter

Hari Om Bhardwaj

Hari Om Bhardwaj