Tuesday, 3 March 2009

Hadoop feat. Lzo - save disk space and speed up your programs

The whole point of Hadoop is to process very large datasets. This implies that you will be using a lot of disk space, all those big files replicated a couple of times add up. Let's look at how we can compress text files, saving disk space without losing performance.

Comparison
I did some quick tests using a few popular compression applications on an example weblog file:







































CompressorOriginal sizeCompressed sizeCompression speedDecompression speed
Bzip28.3 gb1.1 gb2.4 mb/s9.5 mb/s
Gzip8.3 gb1.8 gb17.5 mb/s58 mb/s
Lzo --best8.3 gb2 gb4 mb/s60.6 mb/s
Lzo8.3 gb2.9 gb49.3 mb/s74.6 mb/s






As you can see Lzo doesn't compress all that well compared to the others but it sure is fast! That fits our needs quite well, we don't want to sacrifice too much speed for disk space.

TextInputFormat limitations
Unfortunately if we were to compress all our raw data with Lzo we would run into problems with Hadoop. If you dig around in the source for the TextInputFormat you'll see why. Hadoop needs to be able to split large input files into sensible chunks, called InputSplits. The splits enable Hadoop to schedule map tasks that process one part of the file instead of the whole file. After all the whole point of Hadoop is the ability to process the data in parallel.
File splitting doesn't work so well with compressed files since the TextInputFormat won't know where each line starts in all that compressed data. Hadoop would simply process the whole file in one mapper, making the program very slow. Luckily it turns out that Lzo files consist of small chunks of compressed data, so we can split on the position of those chunks, hooray!

LzoTextInputFormat
I implemented a drop in replacement for TextInputFormat called LzoTextInputFormat. It allows us to process the big Lzo files in multiple mappers. First you have to do a one off indexing of the file, to find out where each Lzo block starts so the InputSplits can be adjusted accordingly. Once that is done you just process the files as usual, but with the LzoTextInputFormat instead of TextInputFormat.

Benchmark
A quick and unscientific MapReduce benchmark I did on 272GB of uncompressed data gave me some interesting results. If we compress the data with Lzo before we run the job it ran about 30% faster compared to when we ran it over the raw data. Win-win! We save both space and improve read speeds by avoiding the extra disk access.

Licensing issues
By now you may have noticed that the LzoTextInputFormat isn't in Hadoop trunk in the SVN repository. It turned out that the Lzo classes were GPL infected and had to be removed from the Apache 2 licensed Hadoop, they will get a new home though. There's also been discussions about using a non GPL library with similar speed/compression ratio tradeoffs as Lzo, such as FastLZ. That said, I don't think anyone checked if the files can be made splittable yet. For further information have a look at the discussions.

Alternatives
It should be mentioned that one alternative to compressing these files and storing them as-is would be to stick your data in SequenceFiles. They have built in support for compression and are splittable. The downside would be that the SequenceFiles are fairly Hadoop specific and only have implementations in Java as far as I know.

There are other efforts under way to make bzip2 and gzip files splittable in Hadoop.

Big thanks to the developers involved in adding Lzo support to Hadoop, Arun C Murthy and Chris Douglas to name a few.

5 comments:

Tolga Demir said...

Müzik Dinle

Oded Rotem said...

When using sequence files with gzip compression, do they remain splittable?

Johan Oskarsson said...

Oded: Yes, they do.

Mahadevan GSS said...

Recently I have ported the minilzo.c to pure java. Initial code is hosted at http://code.google.com/p/java-compress/

Raj said...

For the benchmark statistics, just wanted to make sure if the compression speed (bold in the below line) is actually 4mb/s and not 40mb/s???

Lzo --best 8.3 gb 2 gb 4 mb/s 60.6 mb/s

Post a Comment