SlideShare a Scribd company logo
1 of 21
Download to read offline
YARN, the Apache Hadoop
Platform for Streaming,
Realtime and Batch Processing
Eric Charles [http://echarles.net] @echarles
Datalayer [http://datalayer.io] @datalayerio
FOSDEM 02 Feb 2014 – NoSQL DevRoom
@datalayerio | hacks@datalayer.io | https://github.com/datalayer
eric@apache.org
Eric Charles (@echarles)
Java Developer
Apache Member
Apache James Committer
Apache Onami Committer
Apache HBase Contributor
Worked in London with Hadoop, Hive,
Cascading, HBase, Cassandra,
Elasticsearch, Kafka and Storm
Just founded Datalayer

@datalayerio | hacks@datalayer.io | https://github.com/datalayer


Map Reduce V1 Limits


Scalability






Availability







Maximum Cluster size – 4,000 nodes
Maximum concurrent tasks – 40,000
Coarse synchronization in JobTracker
Job Tracker failure kills all queued and running jobs

No alternate paradigms and services
Iterative applications implemented using MapReduce are
slow (HDFS read/write)

Map Reduce V2 (= “NextGen”) based on YARN


(not 'mapred' vs 'mapreduce' package)
@datalayerio | hacks@datalayer.io | https://github.com/datalayer
YARN as a Layer
All problems in computer science can be solved
by another level of indirection
– David Wheeler
Hive

Pig

Map Reduce V2

Storm
Spark

Giraph
Hama
OpenMPI

HBase

Mem
cached

YARN

Cluster and Resource Management
HDFS
λ
λ YARN a.k.a. Hadoop 2.0 separates
λ the cluster and resource management
λ from the
λ processing components
@datalayerio | hacks@datalayer.io | https://github.com/datalayer

...
Components








A global Resource
Manager
A per-node slave Node
Manager
A per-application
Application Master
running on a Node
Manager
A per-application
Container running on a
Node Manager

@datalayerio | hacks@datalayer.io | https://github.com/datalayer
Yahoo!
Yahoo! has been running
35000 nodes of YARN in
production for over 8 months
now since begin 2013

[http://strata.oreilly.com/2013/06/moving-from-batch-to-continuous-computing-at-yahoo.html ]

@datalayerio | hacks@datalayer.io | https://github.com/datalayer
Twitter

@datalayerio | hacks@datalayer.io | https://github.com/datalayer
Get It!


Download




http://www.apache.org/dyn/closer.cgi/hadoop/comm
on/

Unzip and configure


mapred-site.xml




mapreduce.framework.name = yarn

yarn-site.xml



yarn.nodemanager.aux-services = mapreduce_shuffle
yarn.nodemanager.aux-services.mapreduce_shuffle.clas
s = org.apache.hadoop.mapred.ShuffleHandler
@datalayerio | hacks@datalayer.io | https://github.com/datalayer


Namenode

http://namenode:50070



Namenode Browser

http://namenode:50075/logs



Secondary Namenode

http://snamenode:50090



Resource Manager



Application Status



Resource Node Manager



Mapreduce JobHistory Server http://manager:19888

http://manager:8088/cluster
http://manager:8089/proxy/<app-id>
http://manager:8042/node

@datalayerio | hacks@datalayer.io | https://github.com/datalayer
YARNed


Batch











Spark



Hive / Pig /
Cascading / ...

Storm



Map Reduce

Graph

Streaming

Kafka

Realtime



Giraph



HBase



Hama



Memcached



OpenMPI
@datalayerio | hacks@datalayer.io | https://github.com/datalayer
Batch
Apache Tez : Fast response times and extreme
throughput to execute complex DAG of tasks
“The future of #Hadoop runs on #Tez”

MR-V1

MR V2
Hive

Hive

Pig

Casca
ding

Map Reduce

HDFS

Pig

Under Development
Casca
ding

MR

Hive

Pig

MR V2

Tez

YARN

YARN

HDFS

HDFS

@datalayerio | hacks@datalayer.io | https://github.com/datalayer

Casca
ding
Streaming


Storm-YARN enables Storm applications to utilize the
computational resources in a Hadoop cluster along with
accessing Hadoop storage resources such as HBase and
HDFS

Spark





YARN

Storm [https://github.com/yahoo/storm-yarn]




Storm / Spark / Kafka

Need to build a YARN-Enabled Assembly JAR
Goal is more to integrate Map Reduce e.g. SIMR supports
MRV1

Kafka with Samza [http://samza.incubator.apache.org]


Implements StreamTask



Execution Engine: YARN
Storage Layer: Kafka, not HDFS
@datalayerio | hacks@datalayer.io | https://github.com/datalayer
@Yahoo!

From “Storm and Hadoop: Convergence of Big-Data
and Low-Latency Processing | YDN Blog - Yahoo.html”
@datalayerio | hacks@datalayer.io | https://github.com/datalayer
HBase


HBase

YARN

YARN Resource Manager
Hoya [https://github.com/hortonworks/hoya.git]
YARN Node Manager



Allows users to create on-demand HBase clusters

Hoya Client




Hoya AM [HBase Master]

HDFS

Allow different users/applications to run different versions
of HBase
HDFS

Allow users to configure different HBase instances
differently
YARN Node Manager
YARN Node Manager
Stop / Suspend / Resume clusters as needed Server
HBase Region



HBase Region Server



HBase Region Server



CLIHDFS
based

Expand / shrink clusters as needed
HDFS

@datalayerio | hacks@datalayer.io | https://github.com/datalayer
Graph


Giraph / Hama

YARN

Giraph


Offline batch processing of semi-structured graph data on a massive scale






Compatible with Hadoop 2.x
"Pure YARN" build profile

Manages Failure Scenarios


Worker/container failure during a job?



What happens if our App Master fails during a job?



Application Master allows natural bootstrapping of Giraph jobs



Next Steps




Own Management WEB UI





Zookeeper in AM
...

Abstracting the Giraph framework logic away from MapReduce has made porting
Giraph to other platforms like Mesos possible
(from “Giraph on YARN - Qcon SF”)
@datalayerio | hacks@datalayer.io | https://github.com/datalayer
Options


Apache Mesos




Can run Hadoop, Jenkins, Spark, Aurora...



http://www.quora.com/How-does-YARN-compare-to-Mesos





Cluster manager

http://hortonworks.com/community/forums/topic/yarn-vs-mesos/

Apache Helix






Generic cluster management framework
YARN automates service deployment, resource allocation, and code
distribution. However, it leaves state management and fault-handling
mostly to the application developer.
Helix focuses on service operation but relies on manual hardware
provisioning and service deployment.

@datalayerio | hacks@datalayer.io | https://github.com/datalayer
You Looser!



More Devops and IO



Tuning and Debugging the Application Master and Container is hard



Both AM and RM based on an asynchronous event framework




Deal with RPC Connection loose - Split Brain, AM Recovery... !!!





No flow control
What happens if a worker/container or a App Master fails?

New Application Master per MR Job - No JVM Reuse for MR


Tez-on-Yarn will fix these



No Long living Application Master (see YARN-896)



New application code development difficult



Resource Manager SPOF (chuch... don't even ask this)



No mixed V1/V2 Map Reduce (supported by some commecrial
distribution)
@datalayerio | hacks@datalayer.io | https://github.com/datalayer
You Rocker!


Sort and Shuffle speed gain for Map Reduce



Real-time processing with Batch Processing Collocation brings





Elasticity to share resource (Memory/CPU/...)
Sharing data between realtime and batch - Reduce network transfers
and total cost of acquiring the data

High expectations from #Tez





Long Living Sessions
Avoid HDFS Read/Write

High expectations from #Twill


Remote Procedure Calls between containers



Lifecycle Management



Logging
@datalayerio | hacks@datalayer.io | https://github.com/datalayer
Your App?


Your App?

YARN

WHY porting your App on YARN?


Benefit from existing *-yarn projects



Reuse unused cluster resource



Common Monitoring, Management and
Security framework



Avoid HDFS write on reduce (via Tez)



Abstract and Port to other platforms



@datalayerio | hacks@datalayer.io | https://github.com/datalayer
Summary


YARN brings


One component, One responsiblity!!!









Resource Management
Data Processing

Multiple applications and patterns in Hadoop

Many organizations are already building and
using applications on YARN
Try YARN and Contribute!
@datalayerio | hacks@datalayer.io | https://github.com/datalayer
Thank You!
Questions ?

(Special Thx to @acmurthy and @steveloughran for helping tweets)

@echarles @datalayerio
http://datalayer.io/hacks
http://datalayer.io/jobs
@datalayerio | hacks@datalayer.io | https://github.com/datalayer

More Related Content

What's hot

Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceDataWorks Summit
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparktrihug
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Sujee Maniyam
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesDataWorks Summit
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!Edureka!
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudDataWorks Summit/Hadoop Summit
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystemJakub Stransky
 
Spark in YARN-managed Multi-tenant Clusters by Pravin Mittal and Rajesh Iyer
Spark in YARN-managed Multi-tenant Clusters by Pravin Mittal and Rajesh Iyer  Spark in YARN-managed Multi-tenant Clusters by Pravin Mittal and Rajesh Iyer
Spark in YARN-managed Multi-tenant Clusters by Pravin Mittal and Rajesh Iyer Mary Kypreos
 
Spark Application Development Made Easy
Spark Application Development Made EasySpark Application Development Made Easy
Spark Application Development Made EasyDataWorks Summit
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Hortonworks
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigDataWorks Summit/Hadoop Summit
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitDataWorks Summit
 
Apache REEF - stdlib for big data
Apache REEF - stdlib for big dataApache REEF - stdlib for big data
Apache REEF - stdlib for big dataSergiy Matusevych
 

What's hot (20)

Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2
 
Empower Hive with Spark
Empower Hive with SparkEmpower Hive with Spark
Empower Hive with Spark
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
 
Module01
 Module01 Module01
Module01
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystem
 
Spark in YARN-managed Multi-tenant Clusters by Pravin Mittal and Rajesh Iyer
Spark in YARN-managed Multi-tenant Clusters by Pravin Mittal and Rajesh Iyer  Spark in YARN-managed Multi-tenant Clusters by Pravin Mittal and Rajesh Iyer
Spark in YARN-managed Multi-tenant Clusters by Pravin Mittal and Rajesh Iyer
 
Spark Application Development Made Easy
Spark Application Development Made EasySpark Application Development Made Easy
Spark Application Development Made Easy
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Apache REEF - stdlib for big data
Apache REEF - stdlib for big dataApache REEF - stdlib for big data
Apache REEF - stdlib for big data
 

Similar to 20140202 fosdem-nosql-devroom-hadoop-yarn

Hadoop - Past, Present and Future - v1.2
Hadoop - Past, Present and Future - v1.2Hadoop - Past, Present and Future - v1.2
Hadoop - Past, Present and Future - v1.2Big Data Joe™ Rossi
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableDisaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableStefan Kupstaitis-Dunkler
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLAdam Muise
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
 
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopHow YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopPOSSCON
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7Ted Dunning
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]Shweta Patnaik
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
Spark,Hadoop,Presto Comparition
Spark,Hadoop,Presto ComparitionSpark,Hadoop,Presto Comparition
Spark,Hadoop,Presto ComparitionSandish Kumar H N
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Mich Talebzadeh (Ph.D.)
 

Similar to 20140202 fosdem-nosql-devroom-hadoop-yarn (20)

Hadoop - Past, Present and Future - v1.2
Hadoop - Past, Present and Future - v1.2Hadoop - Past, Present and Future - v1.2
Hadoop - Past, Present and Future - v1.2
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Huhadoop - v1.1
Huhadoop - v1.1Huhadoop - v1.1
Huhadoop - v1.1
 
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableDisaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopHow YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in Hadoop
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]
 
Devops Spark Streaming
Devops Spark StreamingDevops Spark Streaming
Devops Spark Streaming
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Spark,Hadoop,Presto Comparition
Spark,Hadoop,Presto ComparitionSpark,Hadoop,Presto Comparition
Spark,Hadoop,Presto Comparition
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
 

Recently uploaded

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 

Recently uploaded (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

20140202 fosdem-nosql-devroom-hadoop-yarn

  • 1. YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing Eric Charles [http://echarles.net] @echarles Datalayer [http://datalayer.io] @datalayerio FOSDEM 02 Feb 2014 – NoSQL DevRoom @datalayerio | hacks@datalayer.io | https://github.com/datalayer
  • 2. eric@apache.org Eric Charles (@echarles) Java Developer Apache Member Apache James Committer Apache Onami Committer Apache HBase Contributor Worked in London with Hadoop, Hive, Cascading, HBase, Cassandra, Elasticsearch, Kafka and Storm Just founded Datalayer @datalayerio | hacks@datalayer.io | https://github.com/datalayer
  • 3.  Map Reduce V1 Limits  Scalability     Availability     Maximum Cluster size – 4,000 nodes Maximum concurrent tasks – 40,000 Coarse synchronization in JobTracker Job Tracker failure kills all queued and running jobs No alternate paradigms and services Iterative applications implemented using MapReduce are slow (HDFS read/write) Map Reduce V2 (= “NextGen”) based on YARN  (not 'mapred' vs 'mapreduce' package) @datalayerio | hacks@datalayer.io | https://github.com/datalayer
  • 4. YARN as a Layer All problems in computer science can be solved by another level of indirection – David Wheeler Hive Pig Map Reduce V2 Storm Spark Giraph Hama OpenMPI HBase Mem cached YARN Cluster and Resource Management HDFS λ λ YARN a.k.a. Hadoop 2.0 separates λ the cluster and resource management λ from the λ processing components @datalayerio | hacks@datalayer.io | https://github.com/datalayer ...
  • 5. Components     A global Resource Manager A per-node slave Node Manager A per-application Application Master running on a Node Manager A per-application Container running on a Node Manager @datalayerio | hacks@datalayer.io | https://github.com/datalayer
  • 6. Yahoo! Yahoo! has been running 35000 nodes of YARN in production for over 8 months now since begin 2013 [http://strata.oreilly.com/2013/06/moving-from-batch-to-continuous-computing-at-yahoo.html ] @datalayerio | hacks@datalayer.io | https://github.com/datalayer
  • 7. Twitter @datalayerio | hacks@datalayer.io | https://github.com/datalayer
  • 8. Get It!  Download   http://www.apache.org/dyn/closer.cgi/hadoop/comm on/ Unzip and configure  mapred-site.xml   mapreduce.framework.name = yarn yarn-site.xml   yarn.nodemanager.aux-services = mapreduce_shuffle yarn.nodemanager.aux-services.mapreduce_shuffle.clas s = org.apache.hadoop.mapred.ShuffleHandler @datalayerio | hacks@datalayer.io | https://github.com/datalayer
  • 9.  Namenode http://namenode:50070  Namenode Browser http://namenode:50075/logs  Secondary Namenode http://snamenode:50090  Resource Manager  Application Status  Resource Node Manager  Mapreduce JobHistory Server http://manager:19888 http://manager:8088/cluster http://manager:8089/proxy/<app-id> http://manager:8042/node @datalayerio | hacks@datalayer.io | https://github.com/datalayer
  • 10. YARNed  Batch       Spark  Hive / Pig / Cascading / ... Storm  Map Reduce Graph Streaming Kafka Realtime  Giraph  HBase  Hama  Memcached  OpenMPI @datalayerio | hacks@datalayer.io | https://github.com/datalayer
  • 11. Batch Apache Tez : Fast response times and extreme throughput to execute complex DAG of tasks “The future of #Hadoop runs on #Tez” MR-V1 MR V2 Hive Hive Pig Casca ding Map Reduce HDFS Pig Under Development Casca ding MR Hive Pig MR V2 Tez YARN YARN HDFS HDFS @datalayerio | hacks@datalayer.io | https://github.com/datalayer Casca ding
  • 12. Streaming  Storm-YARN enables Storm applications to utilize the computational resources in a Hadoop cluster along with accessing Hadoop storage resources such as HBase and HDFS Spark    YARN Storm [https://github.com/yahoo/storm-yarn]   Storm / Spark / Kafka Need to build a YARN-Enabled Assembly JAR Goal is more to integrate Map Reduce e.g. SIMR supports MRV1 Kafka with Samza [http://samza.incubator.apache.org]  Implements StreamTask   Execution Engine: YARN Storage Layer: Kafka, not HDFS @datalayerio | hacks@datalayer.io | https://github.com/datalayer
  • 13. @Yahoo! From “Storm and Hadoop: Convergence of Big-Data and Low-Latency Processing | YDN Blog - Yahoo.html” @datalayerio | hacks@datalayer.io | https://github.com/datalayer
  • 14. HBase  HBase YARN YARN Resource Manager Hoya [https://github.com/hortonworks/hoya.git] YARN Node Manager  Allows users to create on-demand HBase clusters Hoya Client   Hoya AM [HBase Master] HDFS Allow different users/applications to run different versions of HBase HDFS Allow users to configure different HBase instances differently YARN Node Manager YARN Node Manager Stop / Suspend / Resume clusters as needed Server HBase Region  HBase Region Server  HBase Region Server  CLIHDFS based Expand / shrink clusters as needed HDFS @datalayerio | hacks@datalayer.io | https://github.com/datalayer
  • 15. Graph  Giraph / Hama YARN Giraph  Offline batch processing of semi-structured graph data on a massive scale    Compatible with Hadoop 2.x "Pure YARN" build profile Manages Failure Scenarios  Worker/container failure during a job?  What happens if our App Master fails during a job?  Application Master allows natural bootstrapping of Giraph jobs  Next Steps   Own Management WEB UI   Zookeeper in AM ... Abstracting the Giraph framework logic away from MapReduce has made porting Giraph to other platforms like Mesos possible (from “Giraph on YARN - Qcon SF”) @datalayerio | hacks@datalayer.io | https://github.com/datalayer
  • 16. Options  Apache Mesos   Can run Hadoop, Jenkins, Spark, Aurora...  http://www.quora.com/How-does-YARN-compare-to-Mesos   Cluster manager http://hortonworks.com/community/forums/topic/yarn-vs-mesos/ Apache Helix    Generic cluster management framework YARN automates service deployment, resource allocation, and code distribution. However, it leaves state management and fault-handling mostly to the application developer. Helix focuses on service operation but relies on manual hardware provisioning and service deployment. @datalayerio | hacks@datalayer.io | https://github.com/datalayer
  • 17. You Looser!  More Devops and IO  Tuning and Debugging the Application Master and Container is hard  Both AM and RM based on an asynchronous event framework   Deal with RPC Connection loose - Split Brain, AM Recovery... !!!   No flow control What happens if a worker/container or a App Master fails? New Application Master per MR Job - No JVM Reuse for MR  Tez-on-Yarn will fix these  No Long living Application Master (see YARN-896)  New application code development difficult  Resource Manager SPOF (chuch... don't even ask this)  No mixed V1/V2 Map Reduce (supported by some commecrial distribution) @datalayerio | hacks@datalayer.io | https://github.com/datalayer
  • 18. You Rocker!  Sort and Shuffle speed gain for Map Reduce  Real-time processing with Batch Processing Collocation brings    Elasticity to share resource (Memory/CPU/...) Sharing data between realtime and batch - Reduce network transfers and total cost of acquiring the data High expectations from #Tez    Long Living Sessions Avoid HDFS Read/Write High expectations from #Twill  Remote Procedure Calls between containers  Lifecycle Management  Logging @datalayerio | hacks@datalayer.io | https://github.com/datalayer
  • 19. Your App?  Your App? YARN WHY porting your App on YARN?  Benefit from existing *-yarn projects  Reuse unused cluster resource  Common Monitoring, Management and Security framework  Avoid HDFS write on reduce (via Tez)  Abstract and Port to other platforms  @datalayerio | hacks@datalayer.io | https://github.com/datalayer
  • 20. Summary  YARN brings  One component, One responsiblity!!!      Resource Management Data Processing Multiple applications and patterns in Hadoop Many organizations are already building and using applications on YARN Try YARN and Contribute! @datalayerio | hacks@datalayer.io | https://github.com/datalayer
  • 21. Thank You! Questions ? (Special Thx to @acmurthy and @steveloughran for helping tweets) @echarles @datalayerio http://datalayer.io/hacks http://datalayer.io/jobs @datalayerio | hacks@datalayer.io | https://github.com/datalayer