Compression Options in Hadoop - A Tale of Tradeoffs

Compression Options In Hadoop –
A Tale of Tradeoffs
Govind Kamat, Sumeet Singh
Hadoop Summit (San Jose), June 27, 2013

Introduction
2
Sumeet Singh
Director of Products, Hadoop
Cloud Engineering Group
701 First Avenue
Sunnyvale, CA 94089 USA
Govind Kamat
Technical Yahoo!, Hadoop
Cloud Engineering Group
 Member of Technical Staff in the Hadoop Services
team at Yahoo!
 Focuses on HBase and Hadoop performance
 Worked with the Performance Engineering Group on
improving the performance and scalability of several
Yahoo! applications
 Experience includes development of large-scale
software systems, microprocessor architecture,
instruction-set simulators, compiler technology and
electronic design
701 First Avenue
Sunnyvale, CA 94089 USA
 Leads Hadoop products team at Yahoo!
 Responsible for Product Management, Customer
Engagements, Evangelism, and Program
Management
 Prior to this role, led Strategy functions for the Cloud
Platform Group at Yahoo!

Agenda
3
Data Compression in Hadoop1
Available Compression Options2
Understanding and Working with Compression Options3
Problems Faced at Yahoo! with Large Data Sets4
Performance Evaluations, Native Bzip2, and IPP Libraries5
Wrap-up and Future Work6

Compression Needs and Tradeoffs in Hadoop
4
 Storage
 Disk I/O
 Network bandwidth
 CPU Time
 Hadoop jobs are data-intensive, compressing data can speed up the I/O operations
 MapReduce jobs are almost always I/O bound
 Compressed data can save storage space and speed up data transfers across the
network
 Capital allocation for hardware can go further
 Reduced I/O and network load can bring significant performance improvements
 MapReduce jobs can finish faster overall
 On the other hand, CPU utilization and processing time increases during
decompression
 Understanding the tradeoffs is important for MapReduce pipeline’s overall performance
The Compression Tradeoff

Data Compression in Hadoop’s MR Pipeline
5
Input
splits
Map
Source: Hadoop: The Definitive Guide, Tom White
Output
ReduceBuffer in
memory
Partition and Sort
fetch
Merge
on disk
Merge and sort
Other
maps
Other
reducers
I/P
compressed
Mapper
decompresses
Mapper O/P
compressed
1
Map Reduce
Reduce I/P
Map O/P
Reducer I/P
decompresses
Reducer O/P
compressed
2 3
Shuffle & Sort
Compress Decompress

Compression Options in Hadoop (1/2)
6
Format Algorithm Strategy Emphasis Comments
zlib
Uses DEFLATE
(LZ77 and Huffman
coding)
Dictionary-based, API Compression ratio Default codec
gzip Wrapper around zlib
Dictionary-based,
standard compression
utility
Same as zlib, codec
operates on and
produces standard gzip
files
For data interchange on
and off Hadoop
bzip2
Burrows-Wheeler
transform, MTF
Transform-based,
block-oriented
Higher compression
ratios than zlib
Common for Pig
LZO Variant of LZ77
Dictionary-based,
block-oriented, API
High compression
speeds
Common for
intermediate
compression, HBase
tables
LZ4
Simplified variant of
LZ77
Fast scan, API
Very high compression
speeds
Available in newer
Hadoop distributions
Snappy LZ77 Block-oriented, API
Very high compression
speeds
Came out of Google,
previously known as
Zippy

Compression Options in Hadoop (2/2)
7
Format Codec (Defined in io.compression.codecs) File Extn. Splittable
Java/
Native
zlib/ DEFLATE
(default)
org.apache.hadoop.io.compress.DefaultCodec .deflate N Y/ Y
gzip org.apache.hadoop.io.compress.GzipCodec .gz N Y/ Y
bzip2 org.apache.hadoop.io.compress.BZip2Codec .bz2 Y Y/ Y
LZO
(download
separately)
com.hadoop.compression.lzo.LzoCodec .lzo N N/ Y
LZ4 org.apache.hadoop.io.compress.Lz4Codec .lz4 N N/ Y
Snappy org.apache.hadoop.io.compress.SnappyCodec .snappy N N/ Y
NOTES:
 Splittability – Bzip2 is “splittable”, can be decompressed in parallel by multiple MapReduce tasks. Other
algorithms require all blocks together for decompression with a single MapReduce task.
 LZO – Removed from Hadoop because the LZO libraries are licensed under the GNU GPL. LZO format is still
supported and the codec can be downloaded separately and enabled manually.
 Native bzip2 codec – added by Yahoo! as part of this work in Hadoop 0.23

Space-Time Tradeoff of Compression Options
8
64%, 32.3
71%, 60.0
47%, 4.842%, 4.0
44%, 2.4
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
40% 45% 50% 55% 60% 65% 70% 75%
CPUTimeinSec.
(Compress+Decompress)
Space Savings
Bzip2
Zlib
(Deflate, Gzip)
LZOSnappy
LZ4
Note:
A 266 MB corpus from Wikipedia was used for the performance comparisons.
Space savings is defined as 1- (Compressed/ Uncompressed)
Codec Performance on the Wikipedia Text Corpus
High Compression Ratio
High Compression Speed

Using Data Compression in Hadoop
9
Phase in MR
Pipeline
Config Values
Input data to
Map
File extension recognized automatically for
decompression
File extensions for supported formats
Note: For SequenceFile, headers have the
information [compression (boolean), block
compression (boolean), and compression
codec]
One of the supported codecs one defined in io.compression.codecs
Intermediate
(Map) Output
mapreduce.map.output.compress
false (default), true
mapreduce.map.output.compress.codec
one defined in io.compression.codecs
Final
(Reduce)
Output
mapreduce.output.fileoutputformat. compress
false (default), true
mapreduce.output.fileoutputformat.
compress.codec
one defined in io.compression.codecs
mapreduce.output.fileoutputformat.
compress.type
Type of compression to use for SequenceFile
outputs: NONE, RECORD (default), BLOCK
1
2
3

 Compress the input
data, if large
 Always use compression,
particularly if spillage or
slow network transfers
 Compress for storage/
archival, better write
speeds, or chained MR jobs
 Use splittable algo such as
bzip2, or use zlib with
SequenceFile format
 Use faster codecs such as
LZO, LZ4, or Snappy
 Use standard utility such as
gzip or bzip2 for data
interchange
When to Use Compression and Which Codec
10
Map ReduceShuffle & Sort
Input data to Map Intermediate (Map) Output
I/P
compressed
Mapper
decompresses
Mapper O/P
compressed
1
Reducer I/P
decompresses
Reducer O/P
compressed
2 3
Compress Decompress
Final Reduce Output

Compression in the Hadoop Ecosystem
11
Component When to Use What to Use
Pig
 Compressing data between MR
job
 Typical in Pig scripts that include
joins or other operators that
expand your data size
Enable compression and select the codec:
pig.tmpfilecompression = true
pig.tmpfilecompression.codec = gzip, lzo
Hive
 Intermediate files produced by
Hive between multiple map-
reduce jobs
 Hive writes output to a table
Enable intermediate or output compression:
hive.exec.compress.intermediate = true
hive.exec.compress.output = true
HBase
 Compress data at the CF level
(support for LZO, gzip, Snappy,
and LZ4)
List required JNI libraries:
hbase.regionserver.codecs
Enabling compression:
create ’table', { NAME => 'colfam', COMPRESSION =>
’LZO' }
alter ’table', { NAME => 'colfam', COMPRESSION =>
’LZO' }

Compression in Hadoop at Yahoo!
12
96%
4%
lzo 98.03%
gzip 0.99%
zlib/ default 0.70%
bzip2 0.05%
18M Jobs, May 2013
Map ReduceShuffle & Sort
Input data to Map Intermediate (Map) Output
1 2 3
Final Reduce Output
41%
59%
lzo 63%
gzip 27%
bzip2 6%
default 4%
18M Jobs, May 2013
98%
2%
zlib/
default
73%
gzip 22%
bzip2 4%
Lzo 1%
380M Files on Jun 16, 2013
(/data, /projects)
Included
intermediate
Pig/ Hive
compression
Pig
Intermediate

Compression for Data Storage Efficiency
 Need to improve data storage efficiency at Yahoo!
 Switch from SequenceFile to RCFile
 Considered using bzip2 for improved compression ratio
 Alternative library to compensate for compression effort
 However, Hadoop codec is implemented in pure-Java
 Had to re-implement it to call into the native-code library
 HADOOP-84621, available in 0.23.7
 Next step was to have the codec load the IPP library
13
1 Native-code bzip2 implementation done in collaboration with Jason Lowe, Hadoop Core PMC member

IPP Libraries
 Integrated Performance Primitives from Intel
 Includes both algorithmic and architectural optimizations
 Applications remain processor-neutral
 Processor-specific variants of each function
 Compression: LZ, RLE, BWT, LZO
 High level formats include: zlib, gzip, bzip2 and LZO
14

Measuring Standalone Performance
 Standard utilities (gzip, bzip2) used where available
 Driver program written for other cases
 32-bit mode
 JVM load overhead discounted
 Single-threaded
 Default compression level
 Quad-core Xeon machine
15

Data Corpuses Used
 Binary files
 Generated text from randomtextwriter
 Wikipedia corpus
 Silesia corpus
16

Compression Ratio
0
50
100
150
200
250
300
uncomp zlib bzip2 LZO Snappy LZ4
FileSize(MB)
exe rtext wiki silesia
17

Compression Performance
29
23
63
44
26
0
10
20
30
40
50
60
70
80
90
zlib IPP-zlib Java-bzip2 bzip2 IPP-bzip2
CPUTime(sec)
18

Compression Performance (Fast Algorithms)
3.2
2.9
1.7
0
0.5
1
1.5
2
2.5
3
3.5
LZO Snappy LZ4
CPUTime(sec)
19

Decompression Performance
3
2
21
17
12
0
5
10
15
20
25
zlib IPP-zlib Java-bzip2 bzip2 IPP-bzip2
CPUTime(sec)
20

Decompression Performance (Fast Algorithms)
1.6
1.1
0.7
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
LZO Snappy LZ4
CPUTime(sec)
21

Compression Performance within Hadoop
 Daytona performance framework
 Used GridMix v1
 Loadgen and sort jobs used
 Input data in the following compression modes:
 Compressed with zlib
 Compressed with bzip2
 LZO used for intermediate compression
 35 datanodes, dual-quad-core machines
22

Map Performance
47 46 46
33
0
5
10
15
20
25
30
35
40
45
50
Java-bzip2 bzip2 IPP-bzip2 zlib
MapTime(sec)
23

Reduce Performance
31
28
18
14
0
5
10
15
20
25
30
35
ReduceTime(min)
24

Job Performance
38
34
23
19
38
34
25
18
0
5
10
15
20
25
30
35
40
JobTime(min)
sort loadgen
25

Future Work
 Splittability support for native-code bzip2 codec
 Enhancing Pig to use common bzip2 codec
 Optimizing the JNI interface and buffer copies
 Performance evaluation for 64-bit mode
 Varying the compression effort parameter
 Updating the zlib codec to specify alternative libraries
 Other codec combinations, such as zlib for transient data
 Other compression algorithms
26

Considerations in Selecting Compression Type
 Nature of the data set
 Frequency of compression vs. decompression
 Data-storage efficiency requirements
 Requirement for compatibility with a standard data format
 Splittability requirements
 Machine architecture, whether 32-bit or 64-bit
 Size of the intermediate and final data
 Alternative implementations of compression libraries
27

Compression Options in Hadoop - A Tale of Tradeoffs

Compression Options in Hadoop - A Tale of Tradeoffs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Compression Options in Hadoop - A Tale of Tradeoffs

Similar to Compression Options in Hadoop - A Tale of Tradeoffs (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Compression Options in Hadoop - A Tale of Tradeoffs

Editor's Notes