Hadoop 2 - Beyond MapReduce

13.02.2014

uweseiler

Hadoop 2 Beyond MapReduce

13.02.2014

About me

Big Data Nerd

Hadoop Trainer

Photography Enthusiast

NoSQL Fanboy

Travelpirate

13.02.2014

About us

is a bunch of…

Big Data Nerds

Agile Ninjas

Continuous Delivery Gurus

Join us!
Enterprise Java Specialists Performance Geeks

13.02.2014

Agenda

• Why Hadoop 2?
• HDFS 2
• YARN
• YARN Apps
• Write your own YARN App
• Tez, Hive & Stinger Initiative

Hadoop 1

13.02.2014

Built for web-scale batch apps
Single App

Single App

Batch

Batch

Single App

Single App

Single App

Batch

Batch

Batch

HDFS

HDFS

HDFS

13.02.2014

MapReduce is good for…

• Embarrassingly parallel algorithms
• Summing, grouping, filtering, joining
• Off-line batch jobs on massive data
sets
• Analyzing an entire large dataset

13.02.2014

MapReduce is OK for…

• Iterative jobs (i.e., graph algorithms)
– Each iteration must read/write data to
disk
– I/O and latency cost of an iteration is
high

13.02.2014

MapReduce is not good for…

• Jobs that need shared state/coordination
– Tasks are shared-nothing
– Shared-state requires scalable state store

• Low-latency jobs
• Jobs on small datasets
• Finding individual records

13.02.2014

•

Hadoop 1 limitations

Scalability
– Maximum cluster size ~ 4,500 nodes
– Maximum concurrent tasks – 40,000
– Coarse synchronization in JobTracker

•

Availability
– Failure kills all queued and running jobs

•

Hard partition of resources into map & reduce slots
– Low resource utilization

•

Lacks support for alternate paradigms and services
– Iterative applications implemented using MapReduce are 10x
slower

13.02.2014

A brief history of Hadoop 2

• Originally conceived & architected by the
team at Yahoo!
–

Arun Murthy created the original JIRA in 2008 and now is
the Hadoop 2 release manager

The community has been working on
Hadoop 2 for over 4 years

•

• Hadoop 2 based architecture running at
scale at Yahoo!
–

Deployed on 35,000+ nodes for 6+ months

13.02.2014

Hadoop 2: Next-gen platform

Single use system

Multi-purpose platform

Batch Apps

Batch, Interactive, Streaming, …

Hadoop 1

MapReduce
Cluster resource mgmt.
+ data processing

HDFS
Redundant, reliable
storage

Hadoop 2
MapReduce

Others

Data processing

Data processing

YARN
Cluster resource management

HDFS 2
Redundant, reliable storage

Taking Hadoop beyond batch

13.02.2014

Store all data in one place
Interact with data in multiple ways
Applications run natively in Hadoop
Batch

Interactive

Online

MapReduce

Tez

HOYA

Streaming Graph In-Memory
Storm, …

Giraph

YARN

HDFS 2

Spark

Other
Search, …

13.02.2014

HDFS 2: In a nutshell

• Removes tight coupling of Block
Storage and Namespace
• High Availability
• Scalability & Isolation
• Increased performance
Details: https://issues.apache.org/jira/browse/HDFS-1052

HDFS 2: Federation

13.02.2014

NameNodes do not talk to each other

NameNode 1

NameNode 2

NameNodes manages
Namespace 2 only slice of namespace

Namespace 1
logs

finance

insights

Pool 1

Pool 2

Block

DataNode
1

reports

DataNode
2

Block Storage as
generic storage service

Pools

DataNode
3

Common storage

DataNode
4

DataNodes can store
blocks managed by
any NameNode

13.02.2014

HDFS 2: Architecture

Only the active
writes edits

NFS

Active NameNode
Block
Map

DataNode

Shared state on
shared directory
on NFS

Edits
File

DataNode

Standby NameNode

Block
Map

DataNode

Edits
File

DataNode

The Standby
simultaneously
reads and applies
the edits

DataNode

DataNodes report to both NameNodes but listen
only to the orders from the active one

13.02.2014

Only the active
writes edits

HDFS 2: Quorum based storage
Journal
Node

Journal
Node

Active NameNode
Block
Map

DataNode

Edits
File

DataNode

Journal
Node

Standby NameNode

Block
Map

DataNode

Edits
File

DataNode

The state is shared
on a quorum of
journal nodes
The Standby
simultaneously
reads and applies
the edits

DataNode

DataNodes report to both NameNodes but listen
only to the orders from the active one

13.02.2014

When healthy,
holds persistent
session
Holds special
lock znode
ZKFailover
Controller

Process that
monitors health
of NameNode
DataNode

HDFS 2: High Availability
ZooKeeper
Node

ZooKeeper
Node

Journal
Node

ZooKeeper
Node

Journal
Node

Journal
Node

SharedNameNode state
Active NameNode
Block
Map

Edits
File

DataNode

Standby NameNode

Block
Map

DataNode

Send block reports to Active & Standby NameNode

Edits
File

DataNode

When healthy,
holds persistent
session

ZKFailover
Controller

Process that
monitors health
of NameNode
DataNode

13.02.2014

HDFS 2: Snapshots

• Admin can create point in time snapshots of HDFS
• Of the entire file system
• Of a specific data-set (sub-tree directory of file system)

• Restore state of entire file system or data-set to a
snapshot (like Apple Time Machine)
• Protect against user errors

• Snapshot diffs identify changes made to data set
• Keep track of how raw or derived/analytical data changes
over time

13.02.2014

HDFS 2: NFS Gateway

• Supports NFS v3 (NFS v4 is work in progress)
• Supports all HDFS commands
• List files
• Copy, move files
• Create and delete directories

• Ingest for large scale analytical workloads
• Load immutable files as source for analytical processing
• No random writes

• Stream files into HDFS
• Log ingest by applications writing directly to HDFS client
mount

13.02.2014

HDFS 2: Performance

• Many improvements
• Write pipeline (e.g. new primitives hflush, hsync)
• Read path improvements for fewer memory copies
• Short-circuit local reads for 2-3x faster random
reads
• I/O improvements using posix_fadvise()
• libhdfs improvements for zero copy reads

• Significant improvements: I/O 2.5 - 5x faster

13.02.2014

YARN: Design Goals

• Build a new abstraction layer by splitting up
the two major functions of the JobTracker
• Cluster resource management
• Application life-cycle management

• Allow other processing paradigms
• Flexible API for implementing YARN apps
• MapReduce becomes YARN app
• Lots of different YARN apps

YARN: Concepts

13.02.2014

• Application
– Application is a job committed to the YARN framework
– Example: MapReduce job, Storm topology, …

• Container
– Basic unit of allocation
– Fine-grained resource allocation across multiple resource
types (RAM, CPU, Disk, Network, GPU, etc.)
•
•

container_0 = 4 GB, 1 CPU
container_1 = 512 MB, 6 CPU’s

– Replaces the fixed map/reduce slots

13.02.2014

YARN: Architecture

• Resource Manager
– Global resource scheduler
– Hierarchical queues

•

Node Manager
– Per-machine agent
– Manages the life-cycle of container
– Container resource monitoring

•

Application Master
– Per-application
– Manages application scheduling and task execution
– e.g. MapReduce Application Master

13.02.2014

YARN: Architectural Overview

Split up the two major functions of the JobTracker
Cluster resource management & Application life-cycle management
ResourceManager

Scheduler

NodeManager

NodeManager

AM 1

NodeManager

Container 1.1

NodeManager
Container 2.1

Container 2.3
NodeManager

NodeManager

NodeManager

NodeManager

Container 1.2

AM 2

Container 2.2

13.02.2014

YARN: BigDataOS™
ResourceManager

Scheduler
NodeManager

NodeManager

NodeManager

NodeManager

MapReduce 1

map 1.1

reduce 2.2

map 2.1

Region server 2

reduce 2.1

nimbus 1

vertex 3

NodeManager

NodeManager

NodeManager

NodeManager

HBase Master

map 1.2

MapReduce 2

map 2.2

nimbus 2

Region server 1

vertex 4

vertex 2

NodeManager

NodeManager

NodeManager

NodeManager

HOYA

reduce 1.1

Tez

map 2.3

vertex 1

Region server 3

Storm

13.02.2014

YARN: Multi-tenancy
ResourceManager

Scheduler
NodeManager

NodeManager

MapReduce 1

map 1.1

Region server 2

reduce 2.1

NodeManager
HBase Master

nimbus 2

NodeManager

NodeManager

reduce 2.2
root
nimbus 1

map 2.1

vertex 3

30%
10%
60%
NodeManager
NodeManager
NodeManager
AdDWH
User
Hoc
map 1.2
MapReduce 2
map 2.2
75%
25%
20%
80%
Region server 1

NodeManager

NodeManager

HOYA

reduce 1.1

vertex 1

vertex 2

vertex 4

Dev
Dev
Prod
NodeManager
60%
40%

Tez

Dev1

Region server 3

Prod
NodeManager

Dev2

map 2.3
Storm

13.02.2014

YARN: Capacity Scheduler

• Queues
• Economics as queue capacities
•
•

• Resource Isolation
• Administration
•
•
•

Capacity
Scheduler

Hierarchical queues
SLA‘s via preemption

Queue ACL‘s
Runtime reconfig
Charge-back

root
nimbus 1
30%
10%
60%
NodeManager
AdDWH
User
Hoc
MapReduce 2
75%
25%
20%
80%

vertex 4

Dev

Prod
60%

Tez

Dev1

Dev

Prod

40%
Dev2

13.02.2014

•
•
•
•
•
•
•
•
•
•

YARN Apps: Overview

MapReduce 2
HOYA
Storm
Samza
Apache S4
Spark
Apache Giraph
Apache Hama
Elastic Search
Cloudera Llama

Batch
HBase on YARN
Stream Processing
Stream Processing
Stream Processing
Iterative/Interactive
Graph Processing
Bulk Synchronous Parallel
Scalable Search
Impala on YARN

13.02.2014

MapReduce 2: In a nutshell

• MapReduce is now a YARN app
•
•

No more map and reduce slots, it’s containers now
No more JobTracker, it’s YarnAppmaster library now

• Multiple versions of MapReduce
•
•

The older mapred APIs work without modification or recompilation
The newer mapreduce APIs may need to be recompiled

• Still has one master server component: the Job History Server
•
•
•

The Job History Server stores the execution of jobs
Used to audit prior execution of jobs
Will also be used by YARN framework to store charge backs at that level

• Better cluster utilization
• Increased scalability & availability

13.02.2014

MapReduce 2: Shuffle

• Faster Shuffle
•

Better embedded server: Netty

• Encrypted Shuffle
•
•
•
•

Secure the shuffle phase as data moves across the cluster
Requires 2 way HTTPS, certificates on both sides
Causes significant CPU overhead, reserve 1 core for this work
Certificates stored on each node (provision with the cluster), refreshed every
10 secs

• Pluggable Shuffle Sort
•
•
•
•

Shuffle is the first phase in MapReduce that is guaranteed to not be datalocal
Pluggable Shuffle/Sort allows application developers or hardware
developers to intercept the network-heavy workload and optimize it
Typical implementations have hardware components like fast networks and
software components like sorting algorithms
API will change with future versions of Hadoop

13.02.2014

MapReduce 2: Performance

• Key Optimizations
• No hard segmentation of resource into map and reduce slots
• YARN scheduler is more efficient
• MR2 framework has become more efficient than MR1: shuffle
phase in MRv2 is more performant with the usage of Netty.

• 40.000+ nodes running YARN across over 365 PB of data.
• About 400.000 jobs per day for about 10 million hours of
compute time.
• Estimated 60% – 150% improvement on node usage per day
• Got rid of a whole 10,000 node datacenter because of their
increased utilization.

13.02.2014

HOYA: In a nutshell

• Create on-demand HBase clusters
•
•
•
•
•

Small HBase cluster in large YARN cluster
Dynamic HBase clusters
Self-healing HBase Cluster
Elastic HBase clusters
Transient/intermittent clusters for workflows

• Configure custom configurations & versions
• Better isolation
• More efficient utilization/sharing of cluster

13.02.2014

HOYA Client

YARNClient
HOYA
specific API

HOYA: Creation of AppMaster
ResourceManager

Scheduler
NodeManager

NodeManager
Container

HOYA
Application
Master

Container
NodeManager

Container

Container

Container

Container

13.02.2014

HOYA Client

YARNClient
HOYA
specific API

HOYA: Deployment of HBase
ResourceManager

Scheduler
NodeManager

NodeManager
Region Server
Container

HOYA
Application
Master

Container
NodeManager

Container
HBase Master

Container

Container

Region Server
Container

13.02.2014

HOYA Client

YARNClient
HOYA
specific API

HOYA: Bind via ZooKeeper
ResourceManager

Scheduler
NodeManager

NodeManager
Region Server
Container

HBase
Client

HOYA
Application
Master

Container
NodeManager

Container
HBase Master

Container

Container

Region Server
Container

ZooKeeper

13.02.2014

Storm: In a nutshell

• Stream-processing
• Real-time processing
• Developed as standalone application
• https://github.com/nathanmarz/storm

• Ported on YARN
• https://github.com/yahoo/storm-yarn

13.02.2014

Storm: Conceptual view

Bolt

Spout
Source of streams

• Consumer of streams
• Processing of tuples
• Possibly emits new
tuples

Bolt
Stream

Spout

Unbound sequence of
tuples

Tuple
List of name-value pairs

Bolt
Tuple

Bolt

Tuple

Spout

Tuple

Bolt
Bolt

Topology
Network of spouts & bolts as the nodes and
streams as the edges

13.02.2014

Spark: In a nutshell

• High-speed in-memory analytics over
Hadoop and Hive
• Separate MapReduce-like engine
–
–

Speedup of up to 100x
On-disk queries 5-10x faster

• Compatible with Hadoop‘s Storage API
• Available as standalone application
– https://github.com/mesos/spark

• Experimental support for YARN since 0.6
– http://spark.incubator.apache.org/docs/0.6.0/running-on-yarn.html

13.02.2014

Spark: Data Sharing

13.02.2014

Giraph: In a nutshell

• Giraph is a framework for processing semistructured graph data on a massive scale.
• Giraph is loosely based upon Google's
Pregel
• Giraph performs iterative calculations on top
of an existing Hadoop cluster.
• Available on GitHub
– https://github.com/apache/giraph

13.02.2014

YARN: API & client libaries

• Application Client Protocol: Client to RM interaction
•
•
•

Library: YarnClient
Application lifecycle control
Access cluster information

• Application Master Protocol: AM – RM interaction
•
•
•

Library: AMRMClient / AMRMClientAsync
Resource negotiation
Heartbeat to the RM

• Container Management Protocol: AM to NM interaction
•
•
•

Library: NMClient/NMClientAsync
Launching allocated containers
Stop running containers

Use external frameworks like Twill, REEF or Spring

13.02.2014
App client

YARNClient
App specific
API

YARN: Application flow
Application
Client
Protocol

ResourceManager

Scheduler

Application
Master
Protocol
NodeManager

NodeManager

myAM

Container

AMRMClient

NodeManager
Container

NMClient

Container Management Protocol

13.02.2014

YARN: More things to consider

• Container / Tasks
• Client
• UI
• Recovery
• Container-to-AM communication
• Application history

YARN: Testing / Debugging

13.02.2014

•

MiniYARNCluster
•
•
•

•

Unmanaged AM
•
•

•

Cluster within process in threads (HDFS, MR, HBase)
Useful for regression tests
Avoid for unit tests: too slow

Support to run the AM outside of a YARN cluster for
manual testing
Debugger, heap dumps, etc.

Logs
•
•
•

Log aggregation support to push all logs into HDFS
Accessible via CLI, UI
Useful to enable all application logs by default

13.02.2014

•
•

Apache Tez: In a nutshell

Distributed execution framework that works on
computations represented as dataflow graphs
Tez is Hindi for “speed”

•

Naturally maps to execution plans
produced by query optimizers

•

Highly customizable to meet a
broad spectrum of use cases and to
enable dynamic performance
optimizations at runtime

•

Built on top of YARN

Apache Tez: The new primitive

13.02.2014

MapReduce as Base

Apache Tez as Base
Hadoop 2

Hadoop 1
MR
Pig

Hive

Pig

Hive

Other

MapReduce
Cluster resource mgmt. + data
processing

Tez

Real
time
Storm

Execution Engine

YARN

HDFS

HDFS

O
t
h
e
r

13.02.2014

Apache Tez: Architecture

• Task with pluggable Input, Processor & Output
Task
HDFS
Input

Map
Processor

Sorted
Output

„Classical“ Map
Task
Shuffle
Input

Reduce
Processor
„Classical“ Reduce

HDFS
Output

YARN ApplicationMaster to run
DAG of Tez Tasks

13.02.2014

Apache Tez: Tez Service

• MapReduce Query Startup is expensive:
– Job launch & task-launch latencies are fatal for
short queries (in order of 5s to 30s)

• Solution:
– Tez Service (= Preallocated Application Master)
• Removes job-launch overhead (Application Master)
• Removes task-launch overhead (Pre-warmed Containers)

– Hive (or Pig)
• Submit query plan to Tez Service

– Native Hadoop service, not ad-hoc

13.02.2014

Apache Tez: Performance

Performance gains over Map Reduce
• Eliminate replicated write barrier between successive
computations
• Eliminate job launch overhead of workflow jobs
• Eliminate extra stage of map reads in every workflow job
• Eliminate queue & resource contention suffered by workflow
jobs that are started after a predecessor job completes.

13.02.2014


Optimal resource management
• Reuse YARN containers to launch new tasks.
• Reuse YARN containers to enable shared objects across
tasks.

Tez Task Host

Tez
Application
Master

Task started
Task finished

Tez Task 1

Task started
Tez Task 2

Shared Objects

YARN Container

YARN Container

13.02.2014


Execution plan reconfiguration at runtime
• Dynamic runtime concurrency control based on data size,
user operator resources, available cluster resources and
locality
• Advanced changes in dataflow graph structure
• Progressive graph construction in concert with user
optimizer
Stage 1
< 10 GB
data

HDFS
Blocks
•
•

YARN
resources

Stage 2
•

10 reducers

•

100 reducers

50 Mapper
100 Partions
> 10 GB
data

Decision made at runtime!

13.02.2014


Dynamic physical data flow decisions
• Decide the type of physical byte movement and storage on
the fly
• Store intermediate data on distributed store, local store or
in-memory
• Transfer bytes via block files or streaming or anything in
between
> 32 GB

Local file
Consumer

Producer
< 32 GB

Decision made at runtime!

In-Memory

13.02.2014

SELECT a.state, COUNT(*),
AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state

Existing Hive

Tez & Hive Service

Hive/Tez

Parse Query

0.5s

Parse Query

0.5s

Parse Query

0.5s

Create Plan

0.5s

Create Plan

0.5s

Create Plan

0.5s

Launch MapReduce

20s

Launch MapReduce

20s

Submit to Tez
Service

0.5s

Process MapReduce

10s

Process MapReduce

2s

Total

31s

Total

23s

Process Map-Reduce
Total

2s
3.5s

* No exact numbers, for illustration only

13.02.2014

Real-Time
• Online systems
• R-T analytics
• CEP

0-5s

Hive: Current Focus Area
Interactive
• Parameterized
Reports
• Drilldown
• Visualization
• Exploration

NonInteractive

Batch

• Data preparation
• Operational
• Incremental
batch
batch
processing
processing
• Enterprise
• Dashboards /
Reports
Scorecards
• Data Mining
Current Hive Sweet Spot

1m – 1h

5s – 1m
Data Size

1h+

13.02.2014

Real-Time
• Online systems
• R-T analytics
• CEP

Stinger: Extending the sweet spot
NonInteractive

Interactive
• Parameterized
Reports
• Drilldown
• Visualization
• Exploration

• Data preparation
• Incremental
batch
processing
• Dashboards /
Scorecards

Batch
• Operational
batch
processing
• Enterprise
Reports
• Data Mining

Future Hive Expansion

0-5s

1m – 1h

5s – 1m

1h+

Data Size
Improve Latency & Throughput
• Query engine improvements
• New “Optimized RCFile” column store
• Next-gen runtime (elim’s M/R latency)

Extend Deep Analytical Ability
• Analytics functions
• Improved SQL coverage
• Continued focus on core Hive use cases

13.02.2014

Stinger Initiative: In a nutshell

13.02.2014

Stinger: Enhancing SQL Semantics

13.02.2014

Stinger: Tez revisited
SELECT a.state, COUNT(*),
AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
Job 1
Job 2

I/O
Synchronization
Barrier

I/O
Synchronization
Barrier

Job 3

Hive - MapReduce

Single
Job

Hive - Tez

Hive: Persistence Options

13.02.2014

• Built-in Formats
•
•
•
•
•
•
•

ORCFile
RCFile
Avro
Delimited Text
Regular Expression
S3 Logfile
Typed Bytes

• 3rd Party Addons
• JSON
• XML

13.02.2014

Stinger: ORCFile in a nutshell

• Optimized Row Columnar File
• Columns stored separately

• Knows types
• Uses type-specific encoders
• Stores statistics (min, max, sum, count)

• Has light-weight index
• Skip over blocks of rows that don’t matter

• Larger blocks
• 256 MB by default
• Has an index for block boundaries

13.02.2014

Large block
size well
suited for
HDFS

Stinger: ORCFile Layout
Columnar format
arranges columns
adjacent within the
file for compression
and fast access.

13.02.2014

Stinger: ORCFile advantages

• High Compression
• Many tricks used out-of-the-box to ensure high
compression rates.
• RLE, dictionary encoding, etc.

• High Performance
• Inline indexes record value ranges within blocks of
ORCFile data.
• Filter pushdown allows efficient scanning during
precise queries.

• Flexible Data Model
• All Hive types including maps, structs and unions.

13.02.2014

Stinger: ORCFile compression
Data set from TPC-DS
Measurements by Owen
O‘Malley, Hortonworks

13.02.2014

Stinger: More optimizations

• Join Optimizations
• New Join Types added or improved in Hive 0.11

• More Efficient Query Plan Generation

• Vectorization
• Make the most use of L1 and L2 caches
• Avoid branching whenever possible

• Optimized Query Planner
• Automatically determine optimal execution parameters.

• Buffering
• Cache hot data in memory.

• …

13.02.2014

Stinger: Overall Performance

* Real numbers, but handle with care!

13.02.2014

Hadoop 2: Summary

1. Scale
2. New programming models &
services
3. Beyond Java
4. Improved cluster utilization
5. Enterprise Readiness

13.02.2014

One more thing…

Let’s get started with
Hadoop 2!

13.02.2014

Hortonworks Sandbox 2.0

http://hortonworks.com/products/hortonworks-sandbox

13.02.2014

Dude, where’s my Hadoop 2?

* Table by Adam Muise, Hortonworks

13.02.2014

The end…or the beginning?

Hadoop 2 - Beyond MapReduce

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (11)

Similar a Hadoop 2 - Beyond MapReduce

Similar a Hadoop 2 - Beyond MapReduce (20)

Último

Último (20)

Hadoop 2 - Beyond MapReduce