Test results
Here are the results of the same test as in the previous post with the addition of Avro.
Integer dataset
| Library | Avg input record (bytes) | Avg output record (bytes) | Avg output record, lzo compressed (bytes) |
|---|---|---|---|
| Avro | 55 | 26 | 18 |
| Text | 55 | 55 | 29 |
| Thrift (old binary format) | 55 | 78 | 23 |
| Thrift (dense binary format) | 55 | 38 | 21 |
| ProtoBuf | 55 | 36 | 20 |
| Record IO | 55 | 27 | 19 |
Weblogs
| Library | Avg input record (bytes) | Avg output record (bytes) | Avg output record, lzo compressed (bytes) |
|---|---|---|---|
| Avro | 301 | 266 | 105 |
| Text | 301 | 295 | 107 |
| Thrift (old binary format) | 301 | 323 | 111 |
| Thrift (dense binary format) | 301 | 275 | 107 |
| ProtoBuf | 301 | 276 | 106 |
| Record IO | 301 | 267 | 104 |
Avro
As you can see Avro does very well, beating all the others in average record size, making it very suitable for storing large data sets. Avro saves space by not storing schema information with every record, it's worth noting that after compression the size difference is reduced significantly.
Avro specifies the record structure using JSON, see docs. Unlike the other projects Avro doesn't require code generation from this spec, although I did generate Java code for my tests.
More details about how Avro solves this problem from an email by Doug Cutting:
Avro's Java implementation currently includes three different data representations:
- a "generic" representation uses a standard set of datastructures for all datatypes: records are represented as Map<String,Object>, arrays as List<Object>, longs as Long, etc.
- a "reflect" representation uses Java reflection to permit one to read and write existing Java classes with Avro.
- a "specific" representation generates Java classes that are compiled and loaded, much like Thrift and Protocol Buffers.
We don't expect most scripting languages to use more than a single representation. Implementing Avro is quite simple, by design. We have a Python implementation, and hope to add more soon.
Avro is a very early stage project and so far it only supports Java and Python, but I wouldn't be surprised if it picks up other implementations very quickly. Avro just got voted a Hadoop sub project, that ought to help in the marketing dept.
All in all there are a lot of interesting projects out there, I'm looking forward to see how they all do in the next year or so.
Just like last time the bit missing from these tests is performance, when various Hadoop tickets are resolved I intend to benchmark real world Hadoop data processing with Avro, RecordIO, Thrift and Protocol buffers.
0 comments:
Post a Comment