Wednesday, 25 February 2009

A compact and cross language serialization library?

What's all this about then?
It's about picking the right library to serialize lots of data in a way that is still easy to access from different programming languages. Benefits will hopefully include properly structured data and saving disk space at the same time.

Background
A lot of different RPC and/or serialization projects have popped up in the last year or so. You might have heard of Thrift, Protocol Buffers and Etch.
They're all based on roughly the same idea, write a file describing the service and objects you're exposing, use a tool on the file to generate code in the target programming language and with a few extra lines of code you can have a server or client up and running. For more information on the details have a look at one of the project websites linked above.

Why?
Enough about RPC, this is about data. All of the above mentioned projects can also be used to serialize objects to disk. I work for Last.fm and we gather a lot of data that we need to store and process, mainly using Hadoop. Most of it is currently in text format, large parts of it compressed with Lzo. The output of certain jobs serialize the data using Hadoop's RecordIO. This reduces the space needed and adds other nice bits such as raw comparators that speed up Hadoop jobs.
What if we could find a method of storing data that would give us the compactness of Record IO, but in a way that we can easily access it from many programming languages?

Language support
Let's have a look at what programming languages each of the projects support, this is not a complete list but it contains the "important" ones.










LanguageThriftProtocol buffersRecord IO
JavaYesYesYes
C++YesYesYes
PythonYesYesNo
RubyYesYes, unofficiallyNo
PHPYesYes, unofficiallyNo
ErlangYesYes, unofficiallyNo



Record size comparison
I have written a simple program to compare how compact the different formats are and how well they compress. The program simply reads each input dataset in text format and serializes it back to disk using the libraries. Below are the results for two of our datasets. One is made up of ~10 or so integers per record. The other is your average apache weblogs.

I used the trunk of Thrift (needed for the compact binary format), v2.0.3 of Protocol Buffers and Record IO from Hadoop 0.18.

I make no claims that this is a perfect or even remotely scientific test, but it should give you a rough idea.

I didn't bother to include Etch in the comparison since it didn't pass my "I want to get this working right NOOOW!"-test. The two different Thrift formats are the normal BinaryProtocol and the compact format recently introduced in: THRIFT-110.


Integer dataset









LibraryAvg input record (bytes)Avg output record (bytes)Avg output record, lzo compressed (bytes)
Text555529
Thrift (old binary format)557823
Thrift (dense binary format)553821
ProtoBuf553620
Record IO552719






Weblogs









LibraryAvg input record (bytes)Avg output record (bytes)Avg output record, lzo compressed (bytes)
Text301295107
Thrift (old binary format)301323111
Thrift (dense binary format)301275107
ProtoBuf301276106
Record IO301267104







Final thoughts
As you can see in the tables above, Record IO does a very good job at producing compact output. The obvious downside is the poor language support, where both Thrift and Protocol Buffers do a great job in comparison.
Thrift have chosen to include the language implementations in the official distribution and to build a community that supports them. This is in contrast to what Protocol Buffers does, they haven't included any new language libraries in the official source repository as far as I know.

We already use Thrift extensively for RPC at Last.fm, making it a convenient choice for us. However, I look forward to see what happens with all the projects in this area.

That concludes my first proper blog post, I'm not going to pick a winner or loser, they all have strong and weak points. There's plenty of room for follow ups and more in detail comparisons of code etc. Given time I might post a speed comparison, for example the number of serializations and deserializations per second from each library.

I'd love to see pros/cons of any of the libraries in the comments.

Wednesday, 18 February 2009

First post, hello world etc

I've just invented this thing called "blogging", I think it's going to be big, all the kids will do it.