Data Encoding and Evolution

Mahendra
5 min readJul 2, 2023

--

Every part of any system ( even the human body ) runs on data. Just difference is data is transferred in different formats and there are different way of data transfers. For example in human traits are passed using DNAs across different generations.
If you talk about computer systems , specially OSI model then at different level it uses different data format and varying protocols. In this blog I will mainly focus on application layers.

If you talk about different systems , how they interacts with each other then this picture will describe better

Here if you see there are different ways to communicate within node, across nodes or across organisation. Here APIs can be written using widely used REST or RPC ( can be preferred for communications inside same organisation ) styles. Apart from these styles there are few more styles like SOAP, GraphQL.

So all these mode uses various data formats according to need. Which are JSON, XML, CSV , different binary variants ( MessagePack, BSON , thrift , protobuf, avro etc. ).

Here JSON , XML, CSV are human readable and they are used widely for different cases. If you talk about APIs many of them uses JSON. XML used in WebApps. CSV you can see in every report download wether it’s your bank statement. But all above mentioned formats are too verbose and have their own flow. These formats are required so that human can easily read this data.

But when your system grows you want faster data movement across network, you want to reduce data footprints on your databases or data lakes. At that time we think of reducing size by compacting them. Which gives birth to binary encoded data formats.

Here I will go through few common binary encoded variant used across different organisations.

Here scope of this blog is to understand commonly used binary encoded formats which will include messagepack, thrift , protobuf and avro. Main focus will be on how data is encoded , what compaction we get. For more details you can visit them individually.

Let’s take one example to understand data footprint in various formats. Here I’m taking one example from book Designing Data-Intensive Application by Martin Kleppman. Thanks Martin for this wonderful book.

{
“username” : “Martin”,
“favoriteNumber” : 1337,
“interests” : [ “daydreaming”, “hacking”]
}

Above example is written in JSON format. And it takes 81 bytes.

MessagePack

In this format every data format will have one binary representation, string will be transformed into array of their ASCII values, and numbers will be transformed into their binary values.

Here
Object is 0x80, String is 0xa0 , Integer 0xcd , array is 0x90

This storage takes 66 bytes which is not so significant reduction compared to actual JSON which was taking 81 bytes at cost of human readability. Let’s reduce it more.

Thrift

Thrift was originally developed at facebook and they made it open source in 2007–08. For reduction of size of our example it uses tag value in place of field names.

struct Person {
1: required string userName,
2: optional i64 favoriteNumber,
3: optional list<string> interests
}

Above figure describes how example can be compacted using thrift binary protocol to 59 bytes. Here if you see there is no field names. But wait it’s just 59 bytes from 66 can we improve it more.

YES we can using thrift CompactProtocol which packs field type and tag number into single byte and uses variable length integer. Rather than using full eight bytes for 1337 it encodes it into just 2 bytes.

Here encoding of integers is tricky. Below I have explained encoding of 1337, 60 ,9999.

Protocol Buffers

Protocol buffers is binary encoding library developed by google. representation will look like:

message Person {

required string user_name = 1;

optional int64 favorite_number = 2;

repeated string interests =3;

}

It does binary picking bit different then thrift CompactProtocol but in terms of data footprints it is almost similar to thrift. Below diagram explain it’s encoding.

Field tags and schema evolution

We know schemas can change with time as different business usecases comes to our system. That is called schema evolution. It is further explained in Working of avro schemas .

Here in case of thrift and protobuf we are not using field name anywhere instead we use tags. So if there is change in field name then no impact but when it comes to tag change then new problem arises.

You can add new feilds to schema, provided that you give each feild a new tag.

Avro

Avro uses schema to specify the structure of data being encoded. Our example can be represented in this two form

Compaction of data will look like

Interestingly for our example avro has compacted it to 32 bytes from 81 bytes.
You can refer my other blog Working of avro schemas which explains avro and evolution of schema in more details.

Refrences

--

--

Mahendra
Mahendra

Written by Mahendra

Software developer working in payments domain. Earlier worked on ads. Passionate about learning about tech domain. I write my feelings via Hindi poetry/stories.

No responses yet