Apache Hadoop YARN - The Future of Data Processing with Hadoop

Apache Hadoop YARN
and Tez
Future of Data Processing with Hadoop
Page 1
Chris Harris
charris@hortonworks.com

Agenda
Page 2
• Overview
• Status Quo
• Architecture
• Improvements and Updates
• Tez

Hadoop MapReduce Classic
•  JobTracker
–  Manages cluster resources and job scheduling
•  TaskTracker
–  Per-node agent
–  Manage tasks

Current Limitations
•  Scalability
–  Maximum Cluster size – 4,000 nodes
–  Maximum concurrent tasks – 40,000
–  Coarse synchronization in JobTracker
•  Single point of failure
–  Failure kills all queued and running jobs
–  Jobs need to be re-submitted by users
•  Restart is very tricky due to complex state
4

Current Limitations
•  Hard partition of resources into map and reduce slots
–  Low resource utilization
•  Lacks support for alternate paradigms
–  Iterative applications implemented using MapReduce are
10x slower
–  Hacks for the likes of MPI/Graph Processing
•  Lack of wire-compatible protocols
–  Client and cluster must be of same version
–  Applications and workflows cannot migrate to different
clusters
5

Requirements
•  Reliability
•  Availability
•  Utilization
•  Wire Compatibility
•  Agility & Evolution – Ability for customers to control
upgrades to the grid software stack.
•  Scalability - Clusters of 6,000-10,000 machines
–  Each machine with 16 cores, 48G/96G RAM, 24TB/36TB
disks
–  100,000+ concurrent tasks
–  10,000 concurrent jobs
6

Design Centre
•  Split up the two major functions of JobTracker
–  Cluster resource management
–  Application life-cycle management
•  MapReduce becomes user-land library
7

Concepts
•  Application
–  Application is a job submitted to the framework
–  Example – Map Reduce Job
•  Container
–  Basic unit of allocation
–  Example – container A = 2GB, 1CPU
–  Fine-grained resource allocation
–  Replaces the fixed map/reduce slots
8

Architecture
•  Resource Manager
–  Global resource scheduler
–  Hierarchical queues
•  Node Manager
–  Per-machine agent
–  Manages the life-cycle of container
–  Container resource monitoring
•  Application Master
–  Per-application
–  Manages application scheduling and task execution
–  E.g. MapReduce Application Master
9

Architecture
Resource
Manager
MapReduce Status
Job Submission
Client
Node
Manager
Node
Manager
Container
Node
Manager
App Mstr
Node Status
Resource Request
Resource
Manager
Client
MapReduce Status
Job Submission
Client
Node
Manager
Node
Manager
App Mstr Container
Node
Manager
App Mstr
Node Status
Resource Request
Resource
Manager
Client
MapReduce Status
Job Submission
Client
Node
Manager
Container Container
Node
Manager
App Mstr Container
Node
Manager
Container App Mstr
Node Status
Resource Request

Improvements vis-à-vis classic MapReduce
11
•  Utilization
–  Generic resource model
•  Data Locality (node, rack etc.)
•  Memory
•  CPU
•  Disk b/q
•  Network b/w
–  Remove fixed partition of map and reduce slot
•  Scalability
–  Application life-cycle management is very expensive
–  Partition resource management and application life-cycle management
–  Application management is distributed
–  Hardware trends - Currently run clusters of 4,000 machines
•  6,000 2012 machines > 12,000 2009 machines
•  <16+ cores, 48/96G, 24TB> v/s <8 cores, 16G, 4TB>

•  Fault Tolerance and Availability
–  Resource Manager
•  No single point of failure – state saved in ZooKeeper (coming
soon)
•  Application Masters are restarted automatically on RM restart
–  Application Master
•  Optional failover via application-specific checkpoint
•  MapReduce applications pick up where they left off via state saved
in HDFS
•  Wire Compatibility
–  Protocols are wire-compatible
–  Old clients can talk to new servers
–  Rolling upgrades
12

•  Innovation and Agility
–  MapReduce now becomes a user-land library
–  Multiple versions of MapReduce can run in the same cluster
(a la Apache Pig)
•  Faster deployment cycles for improvements
–  Customers upgrade MapReduce versions on their schedule
–  Users can customize MapReduce
13

© Hortonworks Inc. 2013© Hortonworks Inc. 2013
To Talk to Tez..You Have To First Talk Yarn
•  New Processing layer in Hadoop 2.0 that decouples Resource
management from application management
•  Created to manage resource needs across all uses
•  Ensures predictable performance & QoS for all apps
•  Enables apps to run “IN” Hadoop rather than “ON”
– Key to leveraging all other common services of the Hadoop platform:
security, data lifecycle management, etc.
Page 14
Applica'ons
Run
Na'vely
IN
Hadoop

HDFS2
(Redundant,
Reliable
Storage)

YARN
(Cluster
Resource
Management)

BATCH

(MapReduce)

INTERACTIVE

(Tez)

STREAMING

(Storm,
S4,…)

GRAPH

(Giraph)

IN-‐MEMORY

(Spark)

HPC
MPI

(OpenMPI)

ONLINE

(HBase)

OTHER

(Search)

(Weave…)

© Hortonworks Inc. 2013
Resource
Manager
Client
MapReduce Status
Job Submission
Client
Node
Manager
Container Container
Node
Manager
App Mstr Container
Node
Manager
Container App Mstr
Node Status
Resource Request
Tez: High Throughput and Low Latency
Tez runs in
YARN
Accelerate
High Throughput
AND Low
Latency
Processing

Tez - Core Idea
Task with pluggable Input, Processor & Output
Page 16
YARN ApplicationMaster to run DAG of Tez Tasks
Input Processor
Task
Output
Tez Task - <Input, Processor, Output>

Tez Hive Performance
• Low level data-processing, execution engine on YARN
• Base for MapReduce, Hive, Pig, Cascading, etc.
• Re-usable data processing primitives (ex: sort, merge,
intermediate data management)
• All Hive SQL, can be expressed as single job
– Jobs are no longer interrupted (efficient pipeline)
– Avoid writing intermediate output to HDFS when performance
outweights job re-start (speed and network/disk usage savings)
– Break MR contract to turn MRMRMR to MRRR (flexible DAG)
• Removes task and job overhead (10s savings is huge
for a 2s query!)
Page 17

Tez Service
• MR Query Startup Expensive
– Job launch & task-launch latencies are fatal for short queries (in
order of 5s to 30s)
• Tez Service
– An always-on pool of Application Masters
– Hive (and soon others like Pig) jobs are executed on an
Application Masterin the pool instead of starting a new Application
Master(saving precious seconds)
– Container Preallocation/Ware Containers
• In Summary…
– Removes job-launch overhead (Application Master)
– Removes task-launch overhead (Pre-warmed Containers)
Page 18

Stinger Initiative: Making Apache Hive 100X Faster
Page 19
© Hortonworks Inc. 2013
Page 19DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & SENSITIVE INFORMATION
Hadoop
Base
Op'miza'ons

Generate
simpliﬁed
DAGs

In-‐memory
Hash
Joins

Current
3 – 9 Months
9 – 18 Months
Deep
Analy'cs

SQL
CompaEble
Types

SQL
CompaEble
Windowing

More
SQL
Subqueries

Hive
YARN

Next-‐gen
Hadoop
data

processing
framework

Tez

Express
data
processing

tasks
more
simply

Eliminate
disk
writes

Hive
Tez
Service

Pre-‐warmed
Containers

Low-‐latency
dispatch

ORCFile

Column
Store

High
Compression

Predicate
/
Filter
Pushdowns

Buﬀer
Caching

Cache
accessed
data

OpEmized
for
vector
engine

Query
Planner

Intelligent
Cost-‐Based

OpEmizer

Vector
Query
Engine

OpEmized
for
modern

processor
architectures

Demo Benchmark Spec
• The TPC-DS benchmark data+query set
• Query 27
– big table(store_sales) joins lots of small tables
– A.K.A Star Schema Join
• What does Query 27 do?
For all items sold in stores located in specified states during a given
year, find the average quantity, average list price, average list sales
price, average coupon amount for a given gender, marital status,
education and customer demographic..

SELECT
col5,
avg(col6)

FROM
store_sales_fact
ssf

join
item_dim
on
(ssf.col1
=
item_dim
.col1)

join
date_dim
on
(ssf.col2
=
date_dim.col2

join
custdmgrphcs_dim
on
(ssf.col3
=custdmgrphcs_dim.col3)

join
store_dim
on
(ssf.col4
=
store_dim.col4)

GROUP
BY
col5

ORDER
BY
col5

LIMIT
100;

Query 27 - Star Schema Join
• Derived from TPC-DS Query 27
Page 21
41 GB
58 MB
11MB
80MB
106 KB

Benchmark Cluster Specs
• 6 Node HDP 1.3 Cluster
– 2 Master Nodes
– 4 Data/Slave Nodes
• Slave Node Specs
– Dual Processors
– 14 GB

Query27 Execution Before Hive 11-Text Format
Query spawned 5 MR Jobs
The intermediate output of
each job is written to HDFS
Job 1 of 5 – Mappers stream Fact table and first
dimension table sorted by join key. Reducers do the
join operation between the two tables. Job 2 of 5: Mappers and Reducers join the
output of job 1 and second dimension table.
Last job performs the order
by and group operation
Query Response Time
149 total mappers got executed

Query27 Execution With Hive 11-Text Format
Query spawned of 1 job
with Hive 11 compared to 5
MR Jobs with Hive 10
Job 1 of 1 – Each Mapper loads into memory the 4
small dimension tables and streams parts of the
large fact table. Joins then occur in Mapper hence
the name MapJoin
Increase in performance with Hive
11 as query time went down from
21 minutes to about 4 minutes

Query27 Execution With Hive 11- RC Format
Conversion from Text to
RC file format decreased
size of dimension data set
from 38 GB to 8.21 GB
Smaller file equates to less
IO causing the query time
to decrease from 246
seconds to 136 seconds

Query27 Execution With Hive 11- ORC Format
ORC File type consolidates
data more tighly than
RCFile as the size of
dataset decreased from
8.21 GB to 2.83 GB

Query27 Execution With Hive 11- RC Format/
Partitioned, Bucketed and Sorted
Partitioned the data
decreases the file input
size to 1.73 GB

Query27 Execution With Hive 11- ORCFormat/
Partitioned, Bucketed and Sorted
Partitioned Data with ORC
file format produces the
smallest input size of 687
MB a decrease from 43 GB
Smallest Input Size allows
us the fastest response
time for the query: 79
seconds

Summary of Results
File
Type
Number
of

MR
Jobs

Input
Size
Mappers
Time

Text/Hive
10
5
43.1
GB
179
21
minutes

Text/Hive
11
1
38
GB
151
246
seconds

RC/Hive
11
1
8.21
GB
76
136
seconds

ORC/Hive
11
1
2.83
GB
38
104
seconds

RC/Hive
11/
Partitioned/
Bucketed

1
1.73
GB
19
104
seconds

ORC/Hive
11/
Partitioned/
Bucketed

1
687
MB
27
79.62

Resources
Page 30
hadoop-2.0.3-alpha:
http://hadoop.apache.org/common/releases.html
Release Documentation:
http://hadoop.apache.org/common/docs/r2.0.3-alpha/
Blogs:
http://hortonworks.com/blog/category/apache-hadoop/yarn/
http://hortonworks.com/blog/introducing-apache-hadoop-yarn/
http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-
overview/
http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-
overview/

© Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION
Thank You!
Questions & Answers
Page 31

Apache Hadoop YARN - The Future of Data Processing with Hadoop

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Apache Hadoop YARN - The Future of Data Processing with Hadoop

Similar a Apache Hadoop YARN - The Future of Data Processing with Hadoop (20)

Más de Hortonworks

Más de Hortonworks (20)

Último

Último (20)

Apache Hadoop YARN - The Future of Data Processing with Hadoop