Presented at NoSQL Now! in San Jose
The problem:
Running bare-metal only practical for some organizations
Performance varies significantly across various job types
Utilization of most clusters in production is low
Optimizing Hadoop/MapReduce performance is hard
This deck described some details of a real-life performance optimization effort.
If interested in Cloud Foundry PaaS deployments and/or Hadoop cluster deployment and optimization services - reach out via Email or Twitter or phone # on the last page of the slide deck
Best,
Renat
2. About Altoros
Cloud Foundry PaaS Consulting & Integration
Hadoop/NoSQL performance engineering
Cluster Automation & Server Templates on Joyent, AWS, SoftLayer, Rackspace,
CloudStack and OpenStack using Chef/Puppet, RightScale
200+ employees globally (US, Eastern Europe, Argentina, UK, Denmark,
Switzerland, Norway)
Vertical application experience:
• Automated device analytics
• Advertising analytics
• Big data warehouse
Customers
Partners
3. About Joyent
The high-performance public cloud
infrastructure provider
Cloud IaaS Virtual Machines:
Linux, Windows, BSD, SmartOS
(fka Solaris) with Zones
Core founding sponsors of Node.js
Four global datacenters
Key markets:
Big data, mobile, e-commerce,
finsvc, SaaS
Open Source contributions:
Node.js, KVM, DTrace, ZFS,
SmartOS
4. The Problem
Running bare-metal only practical for some organizations
Performance varies significantly across various job types
In fact, for many jobs, less = more
Utilization of most clusters in production is low
Optimizing Hadoop/MapReduce performance is hard
4
5. Hadoop Vendors
Get upset when truth comes out!
Biased (to the shiny side of the coin)
Often add controversy and confusion
5
6. Goals of the Study
- For Hadoop, what is the impact of Container-based virtualization vs Hardware
emulation (KVM)*
- What are the Hadoop optimization strategies? Is there a “rule of thumb” when it
comes to determining the optimization approach?
- What are the optimal Hadoop cluster settings for 1TB TeraSort benchmark on
100 and 400 node clusters running Linux and SmartOS on the Joyent Public
Cloud?
6
7. Factors Influencing Performance
Physical (disks, cpu, network)
OS/Hypervisor (especially for virtualized environments)
Hadoop/MapReduce (tons of settings)
Algorithmic (data structures, join strategies, big-O…)
Implementation (code efficiency, architecture decisions that fit all other factors)
7
8. Benchmarking tool set:
operating system based on the Debian
Linux distribution and distributed as free
and open source software.
Open source Unix operating system based on the active
fork of Open Solaris technology (illumos) for the cloud.
Uses containerized OS virtualization, called Zones (think a
mature LXC with secure RBAC and auditing)
Apache Hadoop is an open-source software framework that
supports data-intensive distributed applications, licensed
under the Apache v2 license. Derived from Google's
MapReduce and Google File System (GFS) papers, Hadoop
enables applications to work with thousands of computationindependent computers and petabytes of data.
8
9. Benchmarking tool set:
Written by Opscode and released as open source under the
Apache License 2.0., Chef is a DevOps tool used for configuring
cloud services or to streamline the task of configuring a
company's internal servers. Chef automatically sets up and
tweaks the operating systems and programs that run in massive
data centers.
Developed by creators of the Starfish project from Duke
University, Unravel brings run-time profiling of Hadoop jobs
followed by a cost-based database query optimization. Unravel
connects to streams of Hadoop and system instrumentation
data, and applies statistical machine learning to optimize cost of
Hadoop jobs and increase cluster utilization.
9
10. Comparing I/O Path on
Bare Metal Unix Vs Zones Vs KVM
Bare-metal
Kernel Virtualization
OS Virtualization
•
•
1
0
•
Zones partition at the OS
level
KVM is encapsulated
by hypervisor
•
Code path is much
more circuitous in a
KVM process.
•
•
Code path is essentially
the same as bare metal
Performance is
impacted
Performance is higher
11. Bare Metal
Joyent Zone (aka SmartMachine)
Start
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
Start
Skips stepping
through
39 functions
required
when Fedora
is running on
KVM/qemu
1
1
mac`mac_tx+0xda
dld`str_mdata_fastpath_put+0x53
ip`ip_xmit+0x82d
ip`ire_send_wire_v4+0x3e9
ip`conn_ip_output+0x190
ip`tcp_send_data+0x59
ip`tcp_output+0x58c
ip`squeue_enter+0x426
ip`tcp_sendmsg+0x14f
sockfs`so_sendmsg+0x26b
sockfs`socket_sendmsg+0x48
sockfs`socket_vop_write+0x6c
genunix`fop_write+0x8b
genunix`write+0x250
genunix`write32+0x1e
unix`_sys_sysenter_post_swapgs+0x14
Fedora VM on KVM VM
Start
Note that
a Joyent Zone
is exactly the
same as “Bare
Metal”
mac`mac_tx+0xda
dld`str_mdata_fastpath_put+0x53
ip`ip_xmit+0x82d
ip`ire_send_wire_v4+0x3e9
ip`conn_ip_output+0x190
ip`tcp_send_data+0x59
ip`tcp_output+0x58c
ip`squeue_enter+0x426
ip`tcp_sendmsg+0x14f
sockfs`so_sendmsg+0x26b
sockfs`socket_sendmsg+0x48
sockfs`socket_vop_write+0x6c
genunix`fop_write+0x8b
genunix`write+0x250
genunix`write32+0x1e
unix`_sys_sysenter_post_swapgs+0x14
kernel`start_xmit
kernel`dtrace_int3_handler+0xd2
kernel`kmem_cache_free+0x2f
kernel`dtrace_int3+0x3a
kernel`eth_header
kernel`__kfree_skb+0x47
kernel`start_xmit+0x1
kernel`dev_hard_start_xmit+0x322
kernel`sch_direct_xmit+0xef
kernel`dev_queue_xmit+0x184
kernel`eth_header+0x3a
kernel`neigh_resolve_output+0x11e
kernel`nf_hook_slow+0x75
kernel`ip_finish_output
kernel`ip_finish_output+0x17e
kernel`ip_output+0x98
kernel`__ip_local_out+0xa4
kernel`ip_local_out+0x29
kernel`ip_queue_xmit+0x14f
kernel`tcp_transmit_skb+0x3e4
kernel`__kmalloc_node_track_caller+0x185
kernel`sk_stream_alloc_skb+0x41
kernel`tcp_write_xmit+0xf7
kernel`__alloc_skb+0x8c
kernel`__tcp_push_pending_frames+0x26
kernel`tcp_sendmsg+0x895
kernel`inet_sendmsg+0x64
kernel`sock_aio_write+0x13a
kernel`do_sync_write+0xd2
kernel`security_file_permission+0x2c
kernel`rw_verify_area+0x61
kernel`vfs_write+0x16d
kernel`sys_write+0x4a
kernel`sys_rt_sigprocmask+0x84
kernel`system_call_fastpath+0x16
igb`igb_tx_ring_send+0x33
mac`mac_hwring_tx+0x1d
mac`mac_tx_send+0x5dc
mac`mac_tx_single_ring_mode+0x6e
mac`mac_tx+0xda
dld`str_mdata_fastpath_put+0x53
ip`ip_xmit+0x82d
ip`ire_send_wire_v4+0x3e9
ip`conn_ip_output+0x190
ip`tcp_send_data+0x59
ip`tcp_output+0x58c
ip`squeue_enter+0x426
ip`tcp_sendmsg+0x14f
sockfs`so_sendmsg+0x26b
sockfs`socket_sendmsg+0x48
sockfs`socket_vop_write+0x6c
genunix`fop_write+0x8b
genunix`write+0x250
genunix`write32+0x1e
unix`_sys_sysenter_post_swapgs+0x149
No over
head for
Zones:
Stack traces
show how a
network
packet is
transmitted
from:
Bare Metal
vs
Joyent Zone
vs
Fedora VM
on KVM
12. Benchmarking setup:
Three identical Apache Hadoop 1.0.4 clusters were provisioned on Joyent
infrastructure using Joyent REST API and Opscode Chef
Each cluster was tweaked for optimal performance following best practices for
TeraSort benchmark.
13. Benchmarking:
1) Cluster of 100 virtual machines
Script launches virtual machines and stores information about them in a json file.
13
14. Benchmarking:
2) We used Chef to install and configure Hadoop
Each machine in cluster is being configured according to its role in cluster using
Chef cookbooks.
14
15. Benchmarking:
3) We ran the Teragen program generate 1TB of data
As part of TeraSort benchmark a dataset is generated using TeraGen utility
included in Apache Hadoop.
15
16. Benchmarking:
4) We ran the Terasort benchmark
On one of the nodes a Hadoop TeraSort job using previously generated dataset is
submitted.
16
17. Benchmarking:
6) The Hadoop output file was as following
See: Hadoop job_201210261134_0010 on hadoop-smartos-r-1.html
The key difference between the two clusters was unveiled when monitoring I/O and
CPU utilization. Ubuntu cluster was spending too much time in OS kernel while
performing I/O operations as demonstrated on Figure 1.
17
18. Hadoop Cluster Specifications for Linux and
SmartOS
SmartOS cluster was using CPU much more efficiently and was able to utilize larger
number of Hadoop mappers and reducers, key configuration parameters for Hadoop:
Operating System Base
Memory
Image
CPUs
Nodes (Virtual Instances)
Input Size
Run time (seconds, lower is better)
Mappers
Reducers
io.sort.mb
io.sort.factor
dfs.block.size (mb)
mapred.reduce.child.java.opts
mapred.job.shuffle.input.buffer.percent
mapred.reduce.slowstart.completed.maps
Linux
32 GB
sdc:jpc:ubuntu12.04:2.0.2
4 VCPUs
98
1T
819
6
3
610
300
512
-Xmx=2700m
1
0.5
SmartOS
32 GB
sdc:sdc:base64:1.8.1
4 VCPUs
100
1T
360
10
8
610
300
512
-Xmx=2500m
1
0.5
23. OS/hypervisor choice matters – more benchmarks coming?
The key difference
between the clusters was
unveiled when monitoring
I/O and CPU utilization.
Ubuntu cluster was
spending too much time in
OS kernel while performing
I/O (for copies of config
files and job reports –
email
renat.k@altoros.com)
24. Simple ways to increase Hadoop
performance
1) Basic cluster configuration is key (one time effort for typical workloads)
DATA DISK SCALING
COMPRESSION
JVM REUSE POLICY
HDFS BLOCK SIZE
MAP-SIDE SPILLS
COPY/SHUFFLE PHASE TUNING
REDUCE-SIDE SPILLS
2) Tune the number of map and reduce tasks appropriately
3) Consider GPU for some workloads
24
25. Joyent’s Brendan Gregg’s performance book
Systems Performance
• Forthcoming in October
• Includes cloud performance
• Co-author DTrace book
• More here on his techniques:
• http://dtrace.org/blogs/brendan/
25