Thursday, 16 April 2009

HUGUK #2

I recently organized the second Hadoop user group UK meetup at Sun's customer briefing center in London. On the off chance that there's a Hadoop user in the UK that reads this blog and didn't attend, shame on you!

For more information on what you missed have a look at huguk.org

Sunday, 12 April 2009

Avro - serialization follow up

Since I wrote this post about various serialization formats a new one called Avro have been released by Doug Cutting of Hadoop fame.

Test results
Here are the results of the same test as in the previous post with the addition of Avro.

Integer dataset









LibraryAvg input record (bytes)Avg output record (bytes)Avg output record, lzo compressed (bytes)
Avro552618
Text555529
Thrift (old binary format)557823
Thrift (dense binary format)553821
ProtoBuf553620
Record IO552719





Weblogs










LibraryAvg input record (bytes)Avg output record (bytes)Avg output record, lzo compressed (bytes)
Avro301266105
Text301295107
Thrift (old binary format)301323111
Thrift (dense binary format)301275107
ProtoBuf301276106
Record IO301267104







Avro
As you can see Avro does very well, beating all the others in average record size, making it very suitable for storing large data sets. Avro saves space by not storing schema information with every record, it's worth noting that after compression the size difference is reduced significantly.

Avro specifies the record structure using JSON, see docs. Unlike the other projects Avro doesn't require code generation from this spec, although I did generate Java code for my tests.

More details about how Avro solves this problem from an email by Doug Cutting:


Avro's Java implementation currently includes three different data representations:

- a "generic" representation uses a standard set of datastructures for all datatypes: records are represented as Map<String,Object>, arrays as List<Object>, longs as Long, etc.

- a "reflect" representation uses Java reflection to permit one to read and write existing Java classes with Avro.

- a "specific" representation generates Java classes that are compiled and loaded, much like Thrift and Protocol Buffers.

We don't expect most scripting languages to use more than a single representation. Implementing Avro is quite simple, by design. We have a Python implementation, and hope to add more soon.


Avro is a very early stage project and so far it only supports Java and Python, but I wouldn't be surprised if it picks up other implementations very quickly. Avro just got voted a Hadoop sub project, that ought to help in the marketing dept.

All in all there are a lot of interesting projects out there, I'm looking forward to see how they all do in the next year or so.

Just like last time the bit missing from these tests is performance, when various Hadoop tickets are resolved I intend to benchmark real world Hadoop data processing with Avro, RecordIO, Thrift and Protocol buffers.