Avro
Avro is a row-based data format and a data serialisation system released by Hadoop working group in 2009.
For more details please visit official documentation and to read about comparison with other formats read Demystify Hadoop Data Formats: Avro, ORC, and Parquet.
Mainly I’m focusing on following points:
- Avro serialisation / deserialisation
- Schema evolution and avro compatibility
Avro serialisation
Here I’m taking example an example how avro bytes are encoded.
public class MyRecord{
int id;
String data;
String section;
}
Avro schema for above class will be
{
"type": "record",
"name": "MyRecord",
"namespace": "com.avro.test$",
"fields": [
{
"name": "id",
"type": "int",
"default": 0
},
{
"name": "data",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "section",
"type": [
"null",
"string"
],
"default": null
}
]
}
Now take a object
MyRecord r1= new MyRecord(2,”abcdefgh”,”ijk”)
Now let’s see how binary encoder works for it. Encoding is recursive call.
Stack for parsing the schema will look like:
ROOT: start symbol of any grammar.
SEQUENCE : Represent field has multiple field inside it, or type is having multiple values.
TERMINAL : fields with primitive types null, boolean, int , long, float, double.
IMPLICIT_ACTION : non-terminal symbols which can be automatically consumed e.g. string, bytes.
There are few more symbols please check org.apache.avro.io.parsing.Symbol class.
So encoding is recursive call, here are the steps which will be involved:
Initially buffer = []
Step 0 :
schema = root schema and datum = MyRecord(2,”abcdefgh”,”ijk”). As schema is root it will call to encode it field sequence first.
Step 0.1 : Encode id = 2, schema = {“name”: “id”,”type”: “int”,”default”: 0}
Here int is TERMINAL symbol so it will encode it.
integer encoded with following formula
n = (n << 1) ^ (n >> 31) // move sign to low-order bit, and flip others if negative
for n=2 this value will become 4
buffer=[4]
Step 0.2 : now encode data= “abcdefgh”
as type of data is string in which is of SEQUENCE
{
"name": "data",
"type": [
"null",
"string"
],
"default": null
}
So first encode index of string in this schema that is 1
so for n=1 encoded value will be 2 ( using n = (n << 1) ^ (n >> 31) )
buffer=[4,2]
Step 0.2.1 : now encode string with IMPLICIT_ACTION
For string first length of string is encoded
length=8 became 16 ( using n = (n << 1) ^ (n >> 31) )
buffer=[4,2,16]
Then actual string “abcdefgh” with ASCII value i.e [97, 98, 99, 100, 101, 102, 103, 104]
buffer = [ 4,2,16 , 97, 98, 99, 100, 101, 102, 103, 104]
Step 0.3 : similarly for section=”ijk” it will add [2, 6, 105,106,107]
so finally bytes will be
buffer= [ 4,2,16 , 97, 98, 99, 100, 101, 102, 103, 104 , 2, 6, 105,106,107]
See https://avro.apache.org/docs/1.8.1/spec.html#schema_primitive for more details on encoding of different data types.
Avro deserialisation
So while decoding it makes parser according to writer schema you have supplied and try to decode bytes using reverse engineering of encoding logic.
Here are possible deserialisation of bytes = [ 4,2,16 , 97, 98, 99, 100, 101, 102, 103, 104 , 2, 6, 105,106,107]
Here I’m trying to illustrate how data will be read if we pass following schemas while decoding above bytes.
Case 1 : Decode with same schema as writer schema
public class MyRecord{ int id; String data; String section; }
id=2,
data=“abcdefgh”
section=“ijk”
So with original schema there is no problem in reading.
Case 2:
public class MyRecord{ int id; String data;}
id =2 ,
data = “abcdefgh”
Here we are not able to read section and section we are losing here. So that is issue.
Case 3:
public class MyRecord{ int id; String section;}
id=2
section = “abcdefgh”
Now here I see a problem as section field is wrongly decoded as value of data field in original.
Case 4:
public class Entity{ int id; String data; String section; String version}
While decoding with above schema it will give Array out of bound exception which will result in Avro deserialisation exception for field version.
So this is clear that binary encoded bytes can only be deserialised using original schema only.
Now we need to read this deserialised data before that understanding of compatiblity in avro is needed.
How read works in Avro?
Now to read this avro encoded data , two steps are needed
- Decode above encode bytes using original schema
- Now read it with whatever schema you want to read with.
Here one example is illustrated:
Avro compatibility
Backward compatible
Every field in reader ( i.e. latest ) schema with no default value must be present in writer ( i.e. older ) schema. Older schema may have additional fields. In simple words new schema must be able to read data written with older schema.
Allowed operations on schemas are:
→ Delete fields
→ Add optional field ( i.e. field with default value )
As explained in figure 1 above. Here schema2 , schema3 are backward compatible to original schema1.
As in schema2 we have deleted one field data. Whereas in schema3 we added one optional field version.
In case of backward compatible schema changes consumers need to upgrade first.
Reason is simple let’s take example schema1 is our original schema. Now producer ( for e.g. your server ) move to new schema2 where data field is not there. Now you client ( for e.g. mobile application ) will not be able to read new data as it is expecting data field.
Forward compatible
It is opposite of backward compatibility where all non-default fields of older schema must be present in new schema. In simple words older schema must be able to read data written with new schema.
Allowed operations on schemas are:
→ Add fields
→ delete optional field ( i.e. field with default value )
Here in our case schema4 is forward compatible to schema1. Reason schema1 can read data written with schema4 as while reading it will ignore version field of schema4.
In case of forward compatible schema changes producers need to upgrade first.
Reason you can easily find leaving it to you.
Full compatible
Backward + Forward. In simple words data written with either of schema can be read by other one.
Allowed operations on schemas are:
→ Add optional fields
→ delete optional field ( i.e. field with default value )
In case of full compatible schema changes upgrade order does not matter.
Incompatible
None of the above cases
Conclusions
- While reading avro data. We always need writer schema so that we can deserialise avro bytes.
- Reader schema must be compatible ( Backward , forward , full ) to writer schema to read data. Else it will throw exception.
- In case of backward compatible changes consumers always have to upgrade first. Else it will break the consumer apps.
- Interesting : Sometimes in real life developers confuse forward compatible schema changes with backward compatibility of you application. Reason is simple backward compatible meaning your older application must not break. So whenever you do forward compatible schema changes then you are just upgrading you server first not consumer application. So in total w.r.t your application whatever changes you made are backward as it is not breaking older build of consumers.
Keep exploring , keep learning !!