Talk held at a combined meeting of the Web Performance Karlsruhe (http://www.meetup.com/Karlsruhe-Web-Performance-Group/events/153207062) & Big Data Karlsruhe/Stuttgart (http://www.meetup.com/Big-Data-User-Group-Karlsruhe-Stuttgart/events/162836152) user groups.
Agenda:
- Why Hadoop 2?
- HDFS 2
- YARN
- YARN Apps
- Write your own YARN App
- Tez, Hive & Stinger Initiative
6. Hadoop 1
13.02.2014
Built for web-scale batch apps
Single App
Single App
Batch
Batch
Single App
Single App
Single App
Batch
Batch
Batch
HDFS
HDFS
HDFS
7. 13.02.2014
MapReduce is good for…
• Embarrassingly parallel algorithms
• Summing, grouping, filtering, joining
• Off-line batch jobs on massive data
sets
• Analyzing an entire large dataset
8. 13.02.2014
MapReduce is OK for…
• Iterative jobs (i.e., graph algorithms)
– Each iteration must read/write data to
disk
– I/O and latency cost of an iteration is
high
9. 13.02.2014
MapReduce is not good for…
• Jobs that need shared state/coordination
– Tasks are shared-nothing
– Shared-state requires scalable state store
• Low-latency jobs
• Jobs on small datasets
• Finding individual records
10. 13.02.2014
•
Hadoop 1 limitations
Scalability
– Maximum cluster size ~ 4,500 nodes
– Maximum concurrent tasks – 40,000
– Coarse synchronization in JobTracker
•
Availability
– Failure kills all queued and running jobs
•
Hard partition of resources into map & reduce slots
– Low resource utilization
•
Lacks support for alternate paradigms and services
– Iterative applications implemented using MapReduce are 10x
slower
11. 13.02.2014
A brief history of Hadoop 2
• Originally conceived & architected by the
team at Yahoo!
–
Arun Murthy created the original JIRA in 2008 and now is
the Hadoop 2 release manager
The community has been working on
Hadoop 2 for over 4 years
•
• Hadoop 2 based architecture running at
scale at Yahoo!
–
Deployed on 35,000+ nodes for 6+ months
12. 13.02.2014
Hadoop 2: Next-gen platform
Single use system
Multi-purpose platform
Batch Apps
Batch, Interactive, Streaming, …
Hadoop 1
MapReduce
Cluster resource mgmt.
+ data processing
HDFS
Redundant, reliable
storage
Hadoop 2
MapReduce
Others
Data processing
Data processing
YARN
Cluster resource management
HDFS 2
Redundant, reliable storage
13. Taking Hadoop beyond batch
13.02.2014
Store all data in one place
Interact with data in multiple ways
Applications run natively in Hadoop
Batch
Interactive
Online
MapReduce
Tez
HOYA
Streaming Graph In-Memory
Storm, …
Giraph
YARN
Cluster resource management
HDFS 2
Redundant, reliable storage
Spark
Other
Search, …
15. 13.02.2014
HDFS 2: In a nutshell
• Removes tight coupling of Block
Storage and Namespace
• High Availability
• Scalability & Isolation
• Increased performance
Details: https://issues.apache.org/jira/browse/HDFS-1052
16. HDFS 2: Federation
13.02.2014
NameNodes do not talk to each other
NameNode 1
NameNode 2
NameNodes manages
Namespace 2 only slice of namespace
Namespace 1
logs
finance
insights
Pool 1
Pool 2
Block
DataNode
1
reports
DataNode
2
Block Storage as
generic storage service
Pools
DataNode
3
Common storage
DataNode
4
DataNodes can store
blocks managed by
any NameNode
17. 13.02.2014
HDFS 2: Architecture
Only the active
writes edits
NFS
Active NameNode
Block
Map
DataNode
Shared state on
shared directory
on NFS
Edits
File
DataNode
Standby NameNode
Block
Map
DataNode
Edits
File
DataNode
The Standby
simultaneously
reads and applies
the edits
DataNode
DataNodes report to both NameNodes but listen
only to the orders from the active one
18. 13.02.2014
Only the active
writes edits
HDFS 2: Quorum based storage
Journal
Node
Journal
Node
Active NameNode
Block
Map
DataNode
Edits
File
DataNode
Journal
Node
Standby NameNode
Block
Map
DataNode
Edits
File
DataNode
The state is shared
on a quorum of
journal nodes
The Standby
simultaneously
reads and applies
the edits
DataNode
DataNodes report to both NameNodes but listen
only to the orders from the active one
19. 13.02.2014
When healthy,
holds persistent
session
Holds special
lock znode
ZKFailover
Controller
Process that
monitors health
of NameNode
DataNode
HDFS 2: High Availability
ZooKeeper
Node
ZooKeeper
Node
Journal
Node
ZooKeeper
Node
Journal
Node
Journal
Node
SharedNameNode state
Active NameNode
Block
Map
Edits
File
DataNode
Standby NameNode
Block
Map
DataNode
Send block reports to Active & Standby NameNode
Edits
File
DataNode
When healthy,
holds persistent
session
ZKFailover
Controller
Process that
monitors health
of NameNode
DataNode
20. 13.02.2014
HDFS 2: Snapshots
• Admin can create point in time snapshots of HDFS
• Of the entire file system
• Of a specific data-set (sub-tree directory of file system)
• Restore state of entire file system or data-set to a
snapshot (like Apple Time Machine)
• Protect against user errors
• Snapshot diffs identify changes made to data set
• Keep track of how raw or derived/analytical data changes
over time
21. 13.02.2014
HDFS 2: NFS Gateway
• Supports NFS v3 (NFS v4 is work in progress)
• Supports all HDFS commands
• List files
• Copy, move files
• Create and delete directories
• Ingest for large scale analytical workloads
• Load immutable files as source for analytical processing
• No random writes
• Stream files into HDFS
• Log ingest by applications writing directly to HDFS client
mount
22. 13.02.2014
HDFS 2: Performance
• Many improvements
• Write pipeline (e.g. new primitives hflush, hsync)
• Read path improvements for fewer memory copies
• Short-circuit local reads for 2-3x faster random
reads
• I/O improvements using posix_fadvise()
• libhdfs improvements for zero copy reads
• Significant improvements: I/O 2.5 - 5x faster
24. 13.02.2014
YARN: Design Goals
• Build a new abstraction layer by splitting up
the two major functions of the JobTracker
• Cluster resource management
• Application life-cycle management
• Allow other processing paradigms
• Flexible API for implementing YARN apps
• MapReduce becomes YARN app
• Lots of different YARN apps
25. YARN: Concepts
13.02.2014
• Application
– Application is a job committed to the YARN framework
– Example: MapReduce job, Storm topology, …
• Container
– Basic unit of allocation
– Fine-grained resource allocation across multiple resource
types (RAM, CPU, Disk, Network, GPU, etc.)
•
•
container_0 = 4 GB, 1 CPU
container_1 = 512 MB, 6 CPU’s
– Replaces the fixed map/reduce slots
26. 13.02.2014
YARN: Architecture
• Resource Manager
– Global resource scheduler
– Hierarchical queues
•
Node Manager
– Per-machine agent
– Manages the life-cycle of container
– Container resource monitoring
•
Application Master
– Per-application
– Manages application scheduling and task execution
– e.g. MapReduce Application Master
27. 13.02.2014
YARN: Architectural Overview
Split up the two major functions of the JobTracker
Cluster resource management & Application life-cycle management
ResourceManager
Scheduler
NodeManager
NodeManager
AM 1
NodeManager
Container 1.1
NodeManager
Container 2.1
Container 2.3
NodeManager
NodeManager
NodeManager
NodeManager
Container 1.2
AM 2
Container 2.2
34. 13.02.2014
MapReduce 2: In a nutshell
• MapReduce is now a YARN app
•
•
No more map and reduce slots, it’s containers now
No more JobTracker, it’s YarnAppmaster library now
• Multiple versions of MapReduce
•
•
The older mapred APIs work without modification or recompilation
The newer mapreduce APIs may need to be recompiled
• Still has one master server component: the Job History Server
•
•
•
The Job History Server stores the execution of jobs
Used to audit prior execution of jobs
Will also be used by YARN framework to store charge backs at that level
• Better cluster utilization
• Increased scalability & availability
35. 13.02.2014
MapReduce 2: Shuffle
• Faster Shuffle
•
Better embedded server: Netty
• Encrypted Shuffle
•
•
•
•
Secure the shuffle phase as data moves across the cluster
Requires 2 way HTTPS, certificates on both sides
Causes significant CPU overhead, reserve 1 core for this work
Certificates stored on each node (provision with the cluster), refreshed every
10 secs
• Pluggable Shuffle Sort
•
•
•
•
Shuffle is the first phase in MapReduce that is guaranteed to not be datalocal
Pluggable Shuffle/Sort allows application developers or hardware
developers to intercept the network-heavy workload and optimize it
Typical implementations have hardware components like fast networks and
software components like sorting algorithms
API will change with future versions of Hadoop
36. 13.02.2014
MapReduce 2: Performance
• Key Optimizations
• No hard segmentation of resource into map and reduce slots
• YARN scheduler is more efficient
• MR2 framework has become more efficient than MR1: shuffle
phase in MRv2 is more performant with the usage of Netty.
• 40.000+ nodes running YARN across over 365 PB of data.
• About 400.000 jobs per day for about 10 million hours of
compute time.
• Estimated 60% – 150% improvement on node usage per day
• Got rid of a whole 10,000 node datacenter because of their
increased utilization.
40. 13.02.2014
HOYA Client
YARNClient
HOYA
specific API
HOYA: Deployment of HBase
ResourceManager
Scheduler
NodeManager
NodeManager
Region Server
Container
HOYA
Application
Master
Container
NodeManager
Container
HBase Master
Container
Container
Region Server
Container
41. 13.02.2014
HOYA Client
YARNClient
HOYA
specific API
HOYA: Bind via ZooKeeper
ResourceManager
Scheduler
NodeManager
NodeManager
Region Server
Container
HBase
Client
HOYA
Application
Master
Container
NodeManager
Container
HBase Master
Container
Container
Region Server
Container
ZooKeeper
43. 13.02.2014
Storm: In a nutshell
• Stream-processing
• Real-time processing
• Developed as standalone application
• https://github.com/nathanmarz/storm
• Ported on YARN
• https://github.com/yahoo/storm-yarn
44. 13.02.2014
Storm: Conceptual view
Bolt
Spout
Source of streams
• Consumer of streams
• Processing of tuples
• Possibly emits new
tuples
Bolt
Stream
Spout
Unbound sequence of
tuples
Tuple
List of name-value pairs
Bolt
Tuple
Bolt
Tuple
Spout
Tuple
Bolt
Bolt
Topology
Network of spouts & bolts as the nodes and
streams as the edges
46. 13.02.2014
Spark: In a nutshell
• High-speed in-memory analytics over
Hadoop and Hive
• Separate MapReduce-like engine
–
–
Speedup of up to 100x
On-disk queries 5-10x faster
• Compatible with Hadoop‘s Storage API
• Available as standalone application
– https://github.com/mesos/spark
• Experimental support for YARN since 0.6
– http://spark.incubator.apache.org/docs/0.6.0/running-on-yarn.html
49. 13.02.2014
Giraph: In a nutshell
• Giraph is a framework for processing semistructured graph data on a massive scale.
• Giraph is loosely based upon Google's
Pregel
• Giraph performs iterative calculations on top
of an existing Hadoop cluster.
• Available on GitHub
– https://github.com/apache/giraph
53. 13.02.2014
YARN: More things to consider
• Container / Tasks
• Client
• UI
• Recovery
• Container-to-AM communication
• Application history
54. YARN: Testing / Debugging
13.02.2014
•
MiniYARNCluster
•
•
•
•
Unmanaged AM
•
•
•
Cluster within process in threads (HDFS, MR, HBase)
Useful for regression tests
Avoid for unit tests: too slow
Support to run the AM outside of a YARN cluster for
manual testing
Debugger, heap dumps, etc.
Logs
•
•
•
Log aggregation support to push all logs into HDFS
Accessible via CLI, UI
Useful to enable all application logs by default
56. 13.02.2014
•
•
Apache Tez: In a nutshell
Distributed execution framework that works on
computations represented as dataflow graphs
Tez is Hindi for “speed”
•
Naturally maps to execution plans
produced by query optimizers
•
Highly customizable to meet a
broad spectrum of use cases and to
enable dynamic performance
optimizations at runtime
•
Built on top of YARN
57. Apache Tez: The new primitive
13.02.2014
MapReduce as Base
Apache Tez as Base
Hadoop 2
Hadoop 1
MR
Pig
Hive
Pig
Hive
Other
MapReduce
Cluster resource mgmt. + data
processing
Tez
Real
time
Storm
Execution Engine
YARN
Cluster resource management
HDFS
Redundant, reliable storage
HDFS
Redundant, reliable storage
O
t
h
e
r
58. 13.02.2014
Apache Tez: Architecture
• Task with pluggable Input, Processor & Output
Task
HDFS
Input
Map
Processor
Sorted
Output
„Classical“ Map
Task
Shuffle
Input
Reduce
Processor
„Classical“ Reduce
HDFS
Output
YARN ApplicationMaster to run
DAG of Tez Tasks
59. 13.02.2014
Apache Tez: Tez Service
• MapReduce Query Startup is expensive:
– Job launch & task-launch latencies are fatal for
short queries (in order of 5s to 30s)
• Solution:
– Tez Service (= Preallocated Application Master)
• Removes job-launch overhead (Application Master)
• Removes task-launch overhead (Pre-warmed Containers)
– Hive (or Pig)
• Submit query plan to Tez Service
– Native Hadoop service, not ad-hoc
60. 13.02.2014
Apache Tez: Performance
Performance gains over Map Reduce
• Eliminate replicated write barrier between successive
computations
• Eliminate job launch overhead of workflow jobs
• Eliminate extra stage of map reads in every workflow job
• Eliminate queue & resource contention suffered by workflow
jobs that are started after a predecessor job completes.
61. 13.02.2014
Apache Tez: Performance
Optimal resource management
• Reuse YARN containers to launch new tasks.
• Reuse YARN containers to enable shared objects across
tasks.
Tez Task Host
Tez
Application
Master
Task started
Task finished
Tez Task 1
Task started
Tez Task 2
Shared Objects
YARN Container
YARN Container
62. 13.02.2014
Apache Tez: Performance
Execution plan reconfiguration at runtime
• Dynamic runtime concurrency control based on data size,
user operator resources, available cluster resources and
locality
• Advanced changes in dataflow graph structure
• Progressive graph construction in concert with user
optimizer
Stage 1
< 10 GB
data
HDFS
Blocks
•
•
YARN
resources
Stage 2
•
10 reducers
•
100 reducers
50 Mapper
100 Partions
> 10 GB
data
Decision made at runtime!
63. 13.02.2014
Apache Tez: Performance
Dynamic physical data flow decisions
• Decide the type of physical byte movement and storage on
the fly
• Store intermediate data on distributed store, local store or
in-memory
• Transfer bytes via block files or streaming or anything in
between
> 32 GB
Local file
Consumer
Producer
< 32 GB
Decision made at runtime!
In-Memory
64. 13.02.2014
Apache Tez: Performance
SELECT a.state, COUNT(*),
AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
Existing Hive
Tez & Hive Service
Hive/Tez
Parse Query
0.5s
Parse Query
0.5s
Parse Query
0.5s
Create Plan
0.5s
Create Plan
0.5s
Create Plan
0.5s
Launch MapReduce
20s
Launch MapReduce
20s
Submit to Tez
Service
0.5s
Process MapReduce
10s
Process MapReduce
2s
Total
31s
Total
23s
Process Map-Reduce
Total
2s
3.5s
* No exact numbers, for illustration only
65. 13.02.2014
Real-Time
• Online systems
• R-T analytics
• CEP
0-5s
Hive: Current Focus Area
Interactive
• Parameterized
Reports
• Drilldown
• Visualization
• Exploration
NonInteractive
Batch
• Data preparation
• Operational
• Incremental
batch
batch
processing
processing
• Enterprise
• Dashboards /
Reports
Scorecards
• Data Mining
Current Hive Sweet Spot
1m – 1h
5s – 1m
Data Size
1h+
66. 13.02.2014
Real-Time
• Online systems
• R-T analytics
• CEP
Stinger: Extending the sweet spot
NonInteractive
Interactive
• Parameterized
Reports
• Drilldown
• Visualization
• Exploration
• Data preparation
• Incremental
batch
processing
• Dashboards /
Scorecards
Batch
• Operational
batch
processing
• Enterprise
Reports
• Data Mining
Future Hive Expansion
0-5s
1m – 1h
5s – 1m
1h+
Data Size
Improve Latency & Throughput
• Query engine improvements
• New “Optimized RCFile” column store
• Next-gen runtime (elim’s M/R latency)
Extend Deep Analytical Ability
• Analytics functions
• Improved SQL coverage
• Continued focus on core Hive use cases
69. 13.02.2014
Stinger: Tez revisited
SELECT a.state, COUNT(*),
AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
Job 1
Job 2
I/O
Synchronization
Barrier
I/O
Synchronization
Barrier
Job 3
Hive - MapReduce
Single
Job
Hive - Tez
71. 13.02.2014
Stinger: ORCFile in a nutshell
• Optimized Row Columnar File
• Columns stored separately
• Knows types
• Uses type-specific encoders
• Stores statistics (min, max, sum, count)
• Has light-weight index
• Skip over blocks of rows that don’t matter
• Larger blocks
• 256 MB by default
• Has an index for block boundaries
72. 13.02.2014
Large block
size well
suited for
HDFS
Stinger: ORCFile Layout
Columnar format
arranges columns
adjacent within the
file for compression
and fast access.
73. 13.02.2014
Stinger: ORCFile advantages
• High Compression
• Many tricks used out-of-the-box to ensure high
compression rates.
• RLE, dictionary encoding, etc.
• High Performance
• Inline indexes record value ranges within blocks of
ORCFile data.
• Filter pushdown allows efficient scanning during
precise queries.
• Flexible Data Model
• All Hive types including maps, structs and unions.
75. 13.02.2014
Stinger: More optimizations
• Join Optimizations
• New Join Types added or improved in Hive 0.11
• More Efficient Query Plan Generation
• Vectorization
• Make the most use of L1 and L2 caches
• Avoid branching whenever possible
• Optimized Query Planner
• Automatically determine optimal execution parameters.
• Buffering
• Cache hot data in memory.
• …