Yahoo! is one of the most-visited web sites in the world. It runs one of the largest private cloud infrastructures, one that operates on petabytes of data every day. Being able to store and manage that data well is essential to the efficient functioning of Yahoo!`s Hadoop clusters. A key component that enables this efficient operation is data compression. With regard to compression algorithms, there is an underlying tension between compression ratio and compression performance. Consequently, Hadoop provides support for several compression algorithms, including gzip, bzip2, Snappy, LZ4 and others. This plethora of options can make it difficult for users to select appropriate codecs for their MapReduce jobs. This paper attempts to provide guidance in that regard. Performance results with Gridmix and with several corpuses of data are presented. The paper also describes enhancements we have made to the bzip2 codec that improve its performance. This will be of particular interest to the increasing number of users operating on “Big Data” who require the best possible ratios. The impact of using the Intel IPP libraries is also investigated; these have the potential to improve performance significantly. Finally, a few proposals for future enhancements to Hadoop in this area are outlined.
Compression Options in Hadoop - A Tale of Tradeoffs
1. Compression Options In Hadoop –
A Tale of Tradeoffs
Govind Kamat, Sumeet Singh
Hadoop Summit (San Jose), June 27, 2013
2. Introduction
2
Sumeet Singh
Director of Products, Hadoop
Cloud Engineering Group
701 First Avenue
Sunnyvale, CA 94089 USA
Govind Kamat
Technical Yahoo!, Hadoop
Cloud Engineering Group
Member of Technical Staff in the Hadoop Services
team at Yahoo!
Focuses on HBase and Hadoop performance
Worked with the Performance Engineering Group on
improving the performance and scalability of several
Yahoo! applications
Experience includes development of large-scale
software systems, microprocessor architecture,
instruction-set simulators, compiler technology and
electronic design
701 First Avenue
Sunnyvale, CA 94089 USA
Leads Hadoop products team at Yahoo!
Responsible for Product Management, Customer
Engagements, Evangelism, and Program
Management
Prior to this role, led Strategy functions for the Cloud
Platform Group at Yahoo!
3. Agenda
3
Data Compression in Hadoop1
Available Compression Options2
Understanding and Working with Compression Options3
Problems Faced at Yahoo! with Large Data Sets4
Performance Evaluations, Native Bzip2, and IPP Libraries5
Wrap-up and Future Work6
4. Compression Needs and Tradeoffs in Hadoop
4
Storage
Disk I/O
Network bandwidth
CPU Time
Hadoop jobs are data-intensive, compressing data can speed up the I/O operations
MapReduce jobs are almost always I/O bound
Compressed data can save storage space and speed up data transfers across the
network
Capital allocation for hardware can go further
Reduced I/O and network load can bring significant performance improvements
MapReduce jobs can finish faster overall
On the other hand, CPU utilization and processing time increases during
decompression
Understanding the tradeoffs is important for MapReduce pipeline’s overall performance
The Compression Tradeoff
5. Data Compression in Hadoop’s MR Pipeline
5
Input
splits
Map
Source: Hadoop: The Definitive Guide, Tom White
Output
ReduceBuffer in
memory
Partition and Sort
fetch
Merge
on disk
Merge and sort
Other
maps
Other
reducers
I/P
compressed
Mapper
decompresses
Mapper O/P
compressed
1
Map Reduce
Reduce I/P
Map O/P
Reducer I/P
decompresses
Reducer O/P
compressed
2 3
Shuffle & Sort
Compress Decompress
6. Compression Options in Hadoop (1/2)
6
Format Algorithm Strategy Emphasis Comments
zlib
Uses DEFLATE
(LZ77 and Huffman
coding)
Dictionary-based, API Compression ratio Default codec
gzip Wrapper around zlib
Dictionary-based,
standard compression
utility
Same as zlib, codec
operates on and
produces standard gzip
files
For data interchange on
and off Hadoop
bzip2
Burrows-Wheeler
transform, MTF
Transform-based,
block-oriented
Higher compression
ratios than zlib
Common for Pig
LZO Variant of LZ77
Dictionary-based,
block-oriented, API
High compression
speeds
Common for
intermediate
compression, HBase
tables
LZ4
Simplified variant of
LZ77
Fast scan, API
Very high compression
speeds
Available in newer
Hadoop distributions
Snappy LZ77 Block-oriented, API
Very high compression
speeds
Came out of Google,
previously known as
Zippy
7. Compression Options in Hadoop (2/2)
7
Format Codec (Defined in io.compression.codecs) File Extn. Splittable
Java/
Native
zlib/ DEFLATE
(default)
org.apache.hadoop.io.compress.DefaultCodec .deflate N Y/ Y
gzip org.apache.hadoop.io.compress.GzipCodec .gz N Y/ Y
bzip2 org.apache.hadoop.io.compress.BZip2Codec .bz2 Y Y/ Y
LZO
(download
separately)
com.hadoop.compression.lzo.LzoCodec .lzo N N/ Y
LZ4 org.apache.hadoop.io.compress.Lz4Codec .lz4 N N/ Y
Snappy org.apache.hadoop.io.compress.SnappyCodec .snappy N N/ Y
NOTES:
Splittability – Bzip2 is “splittable”, can be decompressed in parallel by multiple MapReduce tasks. Other
algorithms require all blocks together for decompression with a single MapReduce task.
LZO – Removed from Hadoop because the LZO libraries are licensed under the GNU GPL. LZO format is still
supported and the codec can be downloaded separately and enabled manually.
Native bzip2 codec – added by Yahoo! as part of this work in Hadoop 0.23
8. Space-Time Tradeoff of Compression Options
8
64%, 32.3
71%, 60.0
47%, 4.842%, 4.0
44%, 2.4
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
40% 45% 50% 55% 60% 65% 70% 75%
CPUTimeinSec.
(Compress+Decompress)
Space Savings
Bzip2
Zlib
(Deflate, Gzip)
LZOSnappy
LZ4
Note:
A 266 MB corpus from Wikipedia was used for the performance comparisons.
Space savings is defined as 1- (Compressed/ Uncompressed)
Codec Performance on the Wikipedia Text Corpus
High Compression Ratio
High Compression Speed
9. Using Data Compression in Hadoop
9
Phase in MR
Pipeline
Config Values
Input data to
Map
File extension recognized automatically for
decompression
File extensions for supported formats
Note: For SequenceFile, headers have the
information [compression (boolean), block
compression (boolean), and compression
codec]
One of the supported codecs one defined in io.compression.codecs
Intermediate
(Map) Output
mapreduce.map.output.compress
false (default), true
mapreduce.map.output.compress.codec
one defined in io.compression.codecs
Final
(Reduce)
Output
mapreduce.output.fileoutputformat. compress
false (default), true
mapreduce.output.fileoutputformat.
compress.codec
one defined in io.compression.codecs
mapreduce.output.fileoutputformat.
compress.type
Type of compression to use for SequenceFile
outputs: NONE, RECORD (default), BLOCK
1
2
3
10. Compress the input
data, if large
Always use compression,
particularly if spillage or
slow network transfers
Compress for storage/
archival, better write
speeds, or chained MR jobs
Use splittable algo such as
bzip2, or use zlib with
SequenceFile format
Use faster codecs such as
LZO, LZ4, or Snappy
Use standard utility such as
gzip or bzip2 for data
interchange
When to Use Compression and Which Codec
10
Map ReduceShuffle & Sort
Input data to Map Intermediate (Map) Output
I/P
compressed
Mapper
decompresses
Mapper O/P
compressed
1
Reducer I/P
decompresses
Reducer O/P
compressed
2 3
Compress Decompress
Final Reduce Output
11. Compression in the Hadoop Ecosystem
11
Component When to Use What to Use
Pig
Compressing data between MR
job
Typical in Pig scripts that include
joins or other operators that
expand your data size
Enable compression and select the codec:
pig.tmpfilecompression = true
pig.tmpfilecompression.codec = gzip, lzo
Hive
Intermediate files produced by
Hive between multiple map-
reduce jobs
Hive writes output to a table
Enable intermediate or output compression:
hive.exec.compress.intermediate = true
hive.exec.compress.output = true
HBase
Compress data at the CF level
(support for LZO, gzip, Snappy,
and LZ4)
List required JNI libraries:
hbase.regionserver.codecs
Enabling compression:
create ’table', { NAME => 'colfam', COMPRESSION =>
’LZO' }
alter ’table', { NAME => 'colfam', COMPRESSION =>
’LZO' }
12. Compression in Hadoop at Yahoo!
12
96%
4%
lzo 98.03%
gzip 0.99%
zlib/ default 0.70%
bzip2 0.05%
18M Jobs, May 2013
Map ReduceShuffle & Sort
Input data to Map Intermediate (Map) Output
1 2 3
Final Reduce Output
41%
59%
lzo 63%
gzip 27%
bzip2 6%
default 4%
18M Jobs, May 2013
98%
2%
zlib/
default
73%
gzip 22%
bzip2 4%
Lzo 1%
380M Files on Jun 16, 2013
(/data, /projects)
Included
intermediate
Pig/ Hive
compression
Pig
Intermediate
13. Compression for Data Storage Efficiency
Need to improve data storage efficiency at Yahoo!
Switch from SequenceFile to RCFile
Considered using bzip2 for improved compression ratio
Alternative library to compensate for compression effort
However, Hadoop codec is implemented in pure-Java
Had to re-implement it to call into the native-code library
HADOOP-84621, available in 0.23.7
Next step was to have the codec load the IPP library
13
1 Native-code bzip2 implementation done in collaboration with Jason Lowe, Hadoop Core PMC member
14. IPP Libraries
Integrated Performance Primitives from Intel
Includes both algorithmic and architectural optimizations
Applications remain processor-neutral
Processor-specific variants of each function
Compression: LZ, RLE, BWT, LZO
High level formats include: zlib, gzip, bzip2 and LZO
14
15. Measuring Standalone Performance
Standard utilities (gzip, bzip2) used where available
Driver program written for other cases
32-bit mode
JVM load overhead discounted
Single-threaded
Default compression level
Quad-core Xeon machine
15
16. Data Corpuses Used
Binary files
Generated text from randomtextwriter
Wikipedia corpus
Silesia corpus
16
22. Compression Performance within Hadoop
Daytona performance framework
Used GridMix v1
Loadgen and sort jobs used
Input data in the following compression modes:
Compressed with zlib
Compressed with bzip2
LZO used for intermediate compression
35 datanodes, dual-quad-core machines
22
26. Future Work
Splittability support for native-code bzip2 codec
Enhancing Pig to use common bzip2 codec
Optimizing the JNI interface and buffer copies
Performance evaluation for 64-bit mode
Varying the compression effort parameter
Updating the zlib codec to specify alternative libraries
Other codec combinations, such as zlib for transient data
Other compression algorithms
26
27. Considerations in Selecting Compression Type
Nature of the data set
Frequency of compression vs. decompression
Data-storage efficiency requirements
Requirement for compatibility with a standard data format
Splittability requirements
Machine architecture, whether 32-bit or 64-bit
Size of the intermediate and final data
Alternative implementations of compression libraries
27
Editor's Notes
Time: 1 min
Time: 1 min (Total: 2 min)
Time: 2 min (Total: 4 min)Benefits both the Hadoop user and the ops team in improving cluster utilization.
Time: 2 min (Total: 6 min)Compression is integral to Hadoop in this sense. One other factor to consider with Hadoop’s is the data replication (3x replication by default) which means lots of data transfers across the network. Compression helps in that regard as well.
Time: 2 min (Total: 8 min)Ask Govind if only gzip has the recursive option for all files in the directory.
Time: 2 min (Total: 10 min)Splittability was not there initially. Seq. file format was designed to tackle with the splittability issue (aware of keys and values). Compression cares about byte streams only. Pig has the same problem, added split capability to their compression. Made a copy of bzip2 code and added split capability. Later on, hadoop added split support and made it work for bzip2. Split capability could be added to block oriented compression algos such as LZO, Snappy and LZ4.
Time: 1 min (Total: 11 min)
Time: 1 min (Total: 12 min)Try to add Yahoo! numbers here.
Time: 2 min (Total: 14 min)Input data is large (Govind to provide a rule of thumb)Intermediate (Spillage is one, network transfers are slow)Final (space, speed of writes, chained MR)I/P – O/P once that gives you better space savings | compression ratios such as Zlib or Bzip2.Intermediate (LZO type compression, faster codecs)
Time: 2 min (Total: 16 min)
Time: 1 min (Total: 17 min)
Circle around zlib and bzip2Circle around LZO, Snappy and LZ4
Circle around zlib and IPP-zlibCircle around bzip2 and IPP-bzip2