Comparison
I did some quick tests using a few popular compression applications on an example weblog file:
| Compressor | Original size | Compressed size | Compression speed | Decompression speed |
|---|---|---|---|---|
| Bzip2 | 8.3 gb | 1.1 gb | 2.4 mb/s | 9.5 mb/s |
| Gzip | 8.3 gb | 1.8 gb | 17.5 mb/s | 58 mb/s |
| Lzo --best | 8.3 gb | 2 gb | 4 mb/s | 60.6 mb/s |
| Lzo | 8.3 gb | 2.9 gb | 49.3 mb/s | 74.6 mb/s |
As you can see Lzo doesn't compress all that well compared to the others but it sure is fast! That fits our needs quite well, we don't want to sacrifice too much speed for disk space.
TextInputFormat limitations
Unfortunately if we were to compress all our raw data with Lzo we would run into problems with Hadoop. If you dig around in the source for the TextInputFormat you'll see why. Hadoop needs to be able to split large input files into sensible chunks, called InputSplits. The splits enable Hadoop to schedule map tasks that process one part of the file instead of the whole file. After all the whole point of Hadoop is the ability to process the data in parallel.
File splitting doesn't work so well with compressed files since the TextInputFormat won't know where each line starts in all that compressed data. Hadoop would simply process the whole file in one mapper, making the program very slow. Luckily it turns out that Lzo files consist of small chunks of compressed data, so we can split on the position of those chunks, hooray!
LzoTextInputFormat
I implemented a drop in replacement for TextInputFormat called LzoTextInputFormat. It allows us to process the big Lzo files in multiple mappers. First you have to do a one off indexing of the file, to find out where each Lzo block starts so the InputSplits can be adjusted accordingly. Once that is done you just process the files as usual, but with the LzoTextInputFormat instead of TextInputFormat.
Benchmark
A quick and unscientific MapReduce benchmark I did on 272GB of uncompressed data gave me some interesting results. If we compress the data with Lzo before we run the job it ran about 30% faster compared to when we ran it over the raw data. Win-win! We save both space and improve read speeds by avoiding the extra disk access.
Licensing issues
By now you may have noticed that the LzoTextInputFormat isn't in Hadoop trunk in the SVN repository. It turned out that the Lzo classes were GPL infected and had to be removed from the Apache 2 licensed Hadoop, they will get a new home though. There's also been discussions about using a non GPL library with similar speed/compression ratio tradeoffs as Lzo, such as FastLZ. That said, I don't think anyone checked if the files can be made splittable yet. For further information have a look at the discussions.
Alternatives
It should be mentioned that one alternative to compressing these files and storing them as-is would be to stick your data in SequenceFiles. They have built in support for compression and are splittable. The downside would be that the SequenceFiles are fairly Hadoop specific and only have implementations in Java as far as I know.
There are other efforts under way to make bzip2 and gzip files splittable in Hadoop.
Big thanks to the developers involved in adding Lzo support to Hadoop, Arun C Murthy and Chris Douglas to name a few.