It's about picking the right library to serialize lots of data in a way that is still easy to access from different programming languages. Benefits will hopefully include properly structured data and saving disk space at the same time.
Background
A lot of different RPC and/or serialization projects have popped up in the last year or so. You might have heard of Thrift, Protocol Buffers and Etch.
They're all based on roughly the same idea, write a file describing the service and objects you're exposing, use a tool on the file to generate code in the target programming language and with a few extra lines of code you can have a server or client up and running. For more information on the details have a look at one of the project websites linked above.
Why?
Enough about RPC, this is about data. All of the above mentioned projects can also be used to serialize objects to disk. I work for Last.fm and we gather a lot of data that we need to store and process, mainly using Hadoop. Most of it is currently in text format, large parts of it compressed with Lzo. The output of certain jobs serialize the data using Hadoop's RecordIO. This reduces the space needed and adds other nice bits such as raw comparators that speed up Hadoop jobs.
What if we could find a method of storing data that would give us the compactness of Record IO, but in a way that we can easily access it from many programming languages?
Language support
Let's have a look at what programming languages each of the projects support, this is not a complete list but it contains the "important" ones.
| Language | Thrift | Protocol buffers | Record IO |
|---|---|---|---|
| Java | Yes | Yes | Yes |
| C++ | Yes | Yes | Yes |
| Python | Yes | Yes | No |
| Ruby | Yes | Yes, unofficially | No |
| PHP | Yes | Yes, unofficially | No |
| Erlang | Yes | Yes, unofficially | No |
Record size comparison
I have written a simple program to compare how compact the different formats are and how well they compress. The program simply reads each input dataset in text format and serializes it back to disk using the libraries. Below are the results for two of our datasets. One is made up of ~10 or so integers per record. The other is your average apache weblogs.
I used the trunk of Thrift (needed for the compact binary format), v2.0.3 of Protocol Buffers and Record IO from Hadoop 0.18.
I make no claims that this is a perfect or even remotely scientific test, but it should give you a rough idea.
I didn't bother to include Etch in the comparison since it didn't pass my "I want to get this working right NOOOW!"-test. The two different Thrift formats are the normal BinaryProtocol and the compact format recently introduced in: THRIFT-110.
Integer dataset
| Library | Avg input record (bytes) | Avg output record (bytes) | Avg output record, lzo compressed (bytes) |
|---|---|---|---|
| Text | 55 | 55 | 29 |
| Thrift (old binary format) | 55 | 78 | 23 |
| Thrift (dense binary format) | 55 | 38 | 21 |
| ProtoBuf | 55 | 36 | 20 |
| Record IO | 55 | 27 | 19 |
Weblogs
| Library | Avg input record (bytes) | Avg output record (bytes) | Avg output record, lzo compressed (bytes) |
|---|---|---|---|
| Text | 301 | 295 | 107 |
| Thrift (old binary format) | 301 | 323 | 111 |
| Thrift (dense binary format) | 301 | 275 | 107 |
| ProtoBuf | 301 | 276 | 106 |
| Record IO | 301 | 267 | 104 |
Final thoughts
As you can see in the tables above, Record IO does a very good job at producing compact output. The obvious downside is the poor language support, where both Thrift and Protocol Buffers do a great job in comparison.
Thrift have chosen to include the language implementations in the official distribution and to build a community that supports them. This is in contrast to what Protocol Buffers does, they haven't included any new language libraries in the official source repository as far as I know.
We already use Thrift extensively for RPC at Last.fm, making it a convenient choice for us. However, I look forward to see what happens with all the projects in this area.
That concludes my first proper blog post, I'm not going to pick a winner or loser, they all have strong and weak points. There's plenty of room for follow ups and more in detail comparisons of code etc. Given time I might post a speed comparison, for example the number of serializations and deserializations per second from each library.
I'd love to see pros/cons of any of the libraries in the comments.
5 comments:
You could have mentioned typed bytes as well. The simplicity of the typed bytes format might eventually lead to very good language support (although, afaik, there currently only are Java and Python implementations), but it does (intentionally) sacrifice some compactness in order to reach this level of simplicity.
You put it in quotes, so I can't fault you too much, but another important issue with these libraries is support for C and C#, which combined are over 20% on the TIOBE index. C# support is great in both Thrift and (as far as I can tell) Protocol Buffers. No for Record IO. C support, on the other hand, is weak. There have been submissions to the Thrift project, but none with complete implementations. The C library for protobuf uses its own custom compiler and is pretty new as well.
As far as additional libraries, http://eigenclass.org/R2/writings/extprot-extensible-protocols-intro looks interesting... certainly more interesting than Etch.
One more advantage of Record IO that should be mentioned here is that the generated classes contain a static raw comparator method that is used by Hadoop Map-Reduce while sorting/shuffling map outputs and reduce inputs. Thus, for simply comparing keys, the records do not have to be instantiated and deserialized, the serialized versions can be compared directly. This speeds up the shuffle, which is the most time-consuming stage of many map-reduce jobs.
For debugging ease, record I/O supports different serialization methods (in addition to compact binary), such as CSV, and XML.
Supporting additional languages is simple too, since the code generators are modularly supported from the serialization mechanisms.
(But then I may be biased, since I was the primary author of Record IO.)
Michael: Thanks for the info, I drew the line fairly arbitrarily based on my own personal favorites :) I'll update the table when I get a minute.
Can't get the extprot link to work right now, will check later.
Techmilind: I do mention the raw comparators in passing, there's even a link. I certainly appreciate the speedup they give.
I'm also sure that implementing Record IO in new languages would be possible, but there doesn't seem to be a community or momentum to do so right now (correct me if I'm wrong). Perhaps it should be broken out into a separate project? That would give it more attention and a chance to build a community.
Müzik Dinle
Post a Comment