SlideShare una empresa de Scribd logo
1 de 37
Descargar para leer sin conexión
Introduction To YARN
Adam Kawa, Spotify
          th
The 9 Meeting of Warsaw Hadoop User Group




2/23/13
About Me
Data Engineer at Spotify, Sweden
Hadoop Instructor at Compendium (Cloudera Training Partner)
+2.5 year of experience in Hadoop




2/23/13
“Classical” Apache Hadoop Cluster
Can consist of one, five, hundred or 4000 nodes




2/23/13
Hadoop Golden Era
One of the most exciting open-source projects on the planet
Becomes the standard for large-scale processing
Successfully deployed by hundreds/thousands of companies
Like a salutary virus
... But it has some little drawbacks



2/23/13
HDFS Main Limitations

Single NameNode that
   keeps all metadata in RAM
   performs all metadata operations
   becomes a single point of failure (SPOF)
(not the topic of this presentation… )

2/23/13
“Classical” MapReduce Limitations
Limited scalability
Poor resource utilization
Lack of support for the alternative frameworks
Lack of wire-compatible protocols




2/23/13
Limited Scalability
Scales to only
    ~4,000-node cluster
    ~40,000 concurrent tasks
Smaller and less-powerful clusters must be created




2/23/13
Image source: http://www.flickr.com/photos/rsms/660332848/sizes/l/in/set-72157607427138760/
Poor Cluster Resources Utilization




                       Hard-coded values. TaskTrackers
                       must be restarted after a change.



2/23/13
Poor Cluster Resources Utilization




   Map tasks are waiting for the slots   Hard-coded values. TaskTrackers
   which are                             must be restarted after a change.
   NOT currently used by Reduce tasks

2/23/13
Resources Needs That Varies Over Time
                             How to
                             hard-code the
                             number of map
                             and reduce slots
                             efficiently?




2/23/13
(Secondary) Benefits Of Better Utilization
The same calculation on more efficient Hadoop cluster
    Less data center space
    Less silicon waste
    Less power utilizations
    Less carbon emissions
    =
    Less disastrous environmental consequences
2/23/13
Image source: http://globeattractions.com/wp-content/uploads/2012/01/green-leaf-drops-green-hd-leaf-nature-wet.jpg
“Classical” MapReduce Daemons




2/23/13
JobTracker Responsibilities
Manages the computational resources (map and reduce slots)
Schedules all user jobs
   Schedules all tasks that belongs to a job
   Monitors tasks executions
   Restarts failed and speculatively runs slow tasks
   Calculates job counters totals

2/23/13
Simple Observation
JobTracker has lots tasks...




2/23/13
JobTracker Redesign Idea
Reduce responsibilities of JobTracker
   Separate cluster resource management from job
   coordination
   Use slaves (many of them!) to manage jobs life-cycle
Scale to (at least) 10K nodes, 10K jobs, 100K tasks



2/23/13
Disappearance Of The Centralized JobTracker




2/23/13
YARN – Yet Another Resource Negotiator




2/23/13
Resource Manager and Application Master




2/23/13
YARN Applications
MRv2 (simply MRv1 rewritten to run on top of YARN)
   No need to rewrite existing MapReduce jobs
Distributed shell
Spark, Apache S4 (a real-time processing)
Hamster (MPI on Hadoop)
Apache Giraph (a graph processing)
Apache HAMA (matrix, graph and network algorithms)
   ... and http://wiki.apache.org/hadoop/PoweredByYarn
2/23/13
NodeManager
More flexible and efficient than TaskTracker
Executes any computation that makes sense to ApplicationMaster
   Not only map or reduce tasks
Containers with variable resource sizes (e.g. RAM, CPU, network, disk)
   No hard-coded split into map and reduce slots




2/23/13
NodeManager Containers
NodeManager creates a container for each task
Container contains variable resources sizes
   e.g. 2GB RAM, 1 CPU, 1 disk
Number of created containers is limited
   By total sum of resources available on NodeManager



2/23/13
Application Submission In YARN




2/23/13
Resource Manager Components




2/23/13
Uberization
Running the small jobs in the same JVM as AplicationMaster

mapreduce.job.ubertask.enable                     true (false is default)
mapreduce.job.ubertask.maxmaps                    9*
mapreduce.job.ubertask.maxreduces 1*
mapreduce.job.ubertask.maxbytes                   dfs.block.size*


* Users may override these values, but only downward

2/23/13
Interesting Differences
Task progress, status, counters are passed directly to the
  Application Master
Shuffle handlers (long-running auxiliary services in Node
  Managers) serve map outputs to reduce tasks
YARN does not support JVM reuse for map/reduce tasks




2/23/13
Fault-Tollerance
Failure of the running tasks or NodeManagers is similar as in MRv1
Applications can be retried several times
      yarn.resourcemanager.am.max-retries         1
      yarn.app.mapreduce.am.job.recovery.enable   false
Resource Manager can start Application Master in a new container
      yarn.app.mapreduce.am.job.recovery.enable   false



2/23/13
Resource Manager Failure
Two interesting features
   RM restarts without the need to re-run currently
   running/submitted applications [YARN-128]
   ZK-based High Availability (HA) for RM [YARN-149] (still in
   progress)




2/23/13
Wire-Compatible Protocol
Client and cluster may use different versions
More manageable upgrades
   Rolling upgrades without disrupting the service
   Active and standby NameNodes upgraded independently
Protocol buffers chosen for serialization (instead of Writables)



2/23/13
YARN Maturity
Still considered as both production and not production-ready
   The code is already promoted to the trunk
   Yahoo! runs it on 2000 and 6000-node clusters
      Using Apache Hadoop 2.0.2 (Alpha)
Anyway, it will replace MRv1 sooner than later



2/23/13
How To Quickly Setup The YARN Cluster




2/23/13
Basic YARN Configuration Parameters
yarn.nodemanager.resource.memory-mb        8192
yarn.nodemanager.resource.cpu-cores        8


yarn.scheduler.minimum-allocation-mb       1024
yarn.scheduler.maximum-allocation-mb       8192
yarn.scheduler.minimum-allocation-vcores   1
yarn.scheduler.maximum-allocation-vcores   32



2/23/13
Basic MR Apps Configuration Parameters
mapreduce.map.cpu.vcores     1
mapreduce.reduce.cpu.vcores 1


mapreduce.map.memory.mb      1536
mapreduce.map.java.opts      -Xmx1024M
mapreduce.reduce.memory.mb   3072
mapreduce.reduce.java.opts   -Xmx2560M
mapreduce.task.io.sort.mb    512

2/23/13
Apache Whirr Configuration File




2/23/13
Running And Destroying Cluster
$ whirr launch-cluster --config hadoop-yarn.properties
$ whirr destroy-cluster --config hadoop-yarn.properties




2/23/13
It Might Be A Demo
But what about running
production applications
on the real cluster
consisting of hundreds of nodes?


Join Spotify!
jobs@spotify.com
2/23/13
2/23/13
Image source: http://allthingsd.com/files/2012/07/10Questions.jpeg
Thank You!




2/23/13

Más contenido relacionado

La actualidad más candente

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Simplilearn
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 

La actualidad más candente (20)

Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Apache spark 소개 및 실습
Apache spark 소개 및 실습Apache spark 소개 및 실습
Apache spark 소개 및 실습
 
The basics of fluentd
The basics of fluentdThe basics of fluentd
The basics of fluentd
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
An Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop YarnAn Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop Yarn
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance Tuning
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introduction
 
Introduction to Ansible
Introduction to AnsibleIntroduction to Ansible
Introduction to Ansible
 
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ Salesforce
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 

Destacado

Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Adam Kawa
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
DataWorks Summit
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
Adam Kawa
 

Destacado (20)

Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
 
Apache Hadoop Java API
Apache Hadoop Java APIApache Hadoop Java API
Apache Hadoop Java API
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUG
 
Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
 
Multi risk factor model
Multi risk factor model Multi risk factor model
Multi risk factor model
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Yarn
YarnYarn
Yarn
 
Hadoop Migration from 0.20.2 to 2.0
Hadoop Migration from 0.20.2 to 2.0Hadoop Migration from 0.20.2 to 2.0
Hadoop Migration from 0.20.2 to 2.0
 
Hadoop YARN overview
Hadoop YARN overviewHadoop YARN overview
Hadoop YARN overview
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 

Similar a Apache Hadoop YARN

Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Yahoo Developer Network
 

Similar a Apache Hadoop YARN (20)

E031201032036
E031201032036E031201032036
E031201032036
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
H04502048051
H04502048051H04502048051
H04502048051
 
Map Reduce along with Amazon EMR
Map Reduce along with Amazon EMRMap Reduce along with Amazon EMR
Map Reduce along with Amazon EMR
 
Presentation on Large Scale Data Management
Presentation on Large Scale Data ManagementPresentation on Large Scale Data Management
Presentation on Large Scale Data Management
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map Reduce
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
YARN (2).pptx
YARN (2).pptxYARN (2).pptx
YARN (2).pptx
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Hadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch trainingHadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch training
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Taming Latency: Case Studies in MapReduce Data Analytics
Taming Latency: Case Studies in MapReduce Data AnalyticsTaming Latency: Case Studies in MapReduce Data Analytics
Taming Latency: Case Studies in MapReduce Data Analytics
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
Hadoop with Lustre WhitePaper
Hadoop with Lustre WhitePaperHadoop with Lustre WhitePaper
Hadoop with Lustre WhitePaper
 

Último

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Último (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Apache Hadoop YARN

  • 1. Introduction To YARN Adam Kawa, Spotify th The 9 Meeting of Warsaw Hadoop User Group 2/23/13
  • 2. About Me Data Engineer at Spotify, Sweden Hadoop Instructor at Compendium (Cloudera Training Partner) +2.5 year of experience in Hadoop 2/23/13
  • 3. “Classical” Apache Hadoop Cluster Can consist of one, five, hundred or 4000 nodes 2/23/13
  • 4. Hadoop Golden Era One of the most exciting open-source projects on the planet Becomes the standard for large-scale processing Successfully deployed by hundreds/thousands of companies Like a salutary virus ... But it has some little drawbacks 2/23/13
  • 5. HDFS Main Limitations Single NameNode that keeps all metadata in RAM performs all metadata operations becomes a single point of failure (SPOF) (not the topic of this presentation… ) 2/23/13
  • 6. “Classical” MapReduce Limitations Limited scalability Poor resource utilization Lack of support for the alternative frameworks Lack of wire-compatible protocols 2/23/13
  • 7. Limited Scalability Scales to only ~4,000-node cluster ~40,000 concurrent tasks Smaller and less-powerful clusters must be created 2/23/13 Image source: http://www.flickr.com/photos/rsms/660332848/sizes/l/in/set-72157607427138760/
  • 8. Poor Cluster Resources Utilization Hard-coded values. TaskTrackers must be restarted after a change. 2/23/13
  • 9. Poor Cluster Resources Utilization Map tasks are waiting for the slots Hard-coded values. TaskTrackers which are must be restarted after a change. NOT currently used by Reduce tasks 2/23/13
  • 10. Resources Needs That Varies Over Time How to hard-code the number of map and reduce slots efficiently? 2/23/13
  • 11. (Secondary) Benefits Of Better Utilization The same calculation on more efficient Hadoop cluster Less data center space Less silicon waste Less power utilizations Less carbon emissions = Less disastrous environmental consequences 2/23/13 Image source: http://globeattractions.com/wp-content/uploads/2012/01/green-leaf-drops-green-hd-leaf-nature-wet.jpg
  • 13. JobTracker Responsibilities Manages the computational resources (map and reduce slots) Schedules all user jobs Schedules all tasks that belongs to a job Monitors tasks executions Restarts failed and speculatively runs slow tasks Calculates job counters totals 2/23/13
  • 14. Simple Observation JobTracker has lots tasks... 2/23/13
  • 15. JobTracker Redesign Idea Reduce responsibilities of JobTracker Separate cluster resource management from job coordination Use slaves (many of them!) to manage jobs life-cycle Scale to (at least) 10K nodes, 10K jobs, 100K tasks 2/23/13
  • 16. Disappearance Of The Centralized JobTracker 2/23/13
  • 17. YARN – Yet Another Resource Negotiator 2/23/13
  • 18. Resource Manager and Application Master 2/23/13
  • 19. YARN Applications MRv2 (simply MRv1 rewritten to run on top of YARN) No need to rewrite existing MapReduce jobs Distributed shell Spark, Apache S4 (a real-time processing) Hamster (MPI on Hadoop) Apache Giraph (a graph processing) Apache HAMA (matrix, graph and network algorithms) ... and http://wiki.apache.org/hadoop/PoweredByYarn 2/23/13
  • 20. NodeManager More flexible and efficient than TaskTracker Executes any computation that makes sense to ApplicationMaster Not only map or reduce tasks Containers with variable resource sizes (e.g. RAM, CPU, network, disk) No hard-coded split into map and reduce slots 2/23/13
  • 21. NodeManager Containers NodeManager creates a container for each task Container contains variable resources sizes e.g. 2GB RAM, 1 CPU, 1 disk Number of created containers is limited By total sum of resources available on NodeManager 2/23/13
  • 24. Uberization Running the small jobs in the same JVM as AplicationMaster mapreduce.job.ubertask.enable true (false is default) mapreduce.job.ubertask.maxmaps 9* mapreduce.job.ubertask.maxreduces 1* mapreduce.job.ubertask.maxbytes dfs.block.size* * Users may override these values, but only downward 2/23/13
  • 25. Interesting Differences Task progress, status, counters are passed directly to the Application Master Shuffle handlers (long-running auxiliary services in Node Managers) serve map outputs to reduce tasks YARN does not support JVM reuse for map/reduce tasks 2/23/13
  • 26. Fault-Tollerance Failure of the running tasks or NodeManagers is similar as in MRv1 Applications can be retried several times yarn.resourcemanager.am.max-retries 1 yarn.app.mapreduce.am.job.recovery.enable false Resource Manager can start Application Master in a new container yarn.app.mapreduce.am.job.recovery.enable false 2/23/13
  • 27. Resource Manager Failure Two interesting features RM restarts without the need to re-run currently running/submitted applications [YARN-128] ZK-based High Availability (HA) for RM [YARN-149] (still in progress) 2/23/13
  • 28. Wire-Compatible Protocol Client and cluster may use different versions More manageable upgrades Rolling upgrades without disrupting the service Active and standby NameNodes upgraded independently Protocol buffers chosen for serialization (instead of Writables) 2/23/13
  • 29. YARN Maturity Still considered as both production and not production-ready The code is already promoted to the trunk Yahoo! runs it on 2000 and 6000-node clusters Using Apache Hadoop 2.0.2 (Alpha) Anyway, it will replace MRv1 sooner than later 2/23/13
  • 30. How To Quickly Setup The YARN Cluster 2/23/13
  • 31. Basic YARN Configuration Parameters yarn.nodemanager.resource.memory-mb 8192 yarn.nodemanager.resource.cpu-cores 8 yarn.scheduler.minimum-allocation-mb 1024 yarn.scheduler.maximum-allocation-mb 8192 yarn.scheduler.minimum-allocation-vcores 1 yarn.scheduler.maximum-allocation-vcores 32 2/23/13
  • 32. Basic MR Apps Configuration Parameters mapreduce.map.cpu.vcores 1 mapreduce.reduce.cpu.vcores 1 mapreduce.map.memory.mb 1536 mapreduce.map.java.opts -Xmx1024M mapreduce.reduce.memory.mb 3072 mapreduce.reduce.java.opts -Xmx2560M mapreduce.task.io.sort.mb 512 2/23/13
  • 34. Running And Destroying Cluster $ whirr launch-cluster --config hadoop-yarn.properties $ whirr destroy-cluster --config hadoop-yarn.properties 2/23/13
  • 35. It Might Be A Demo But what about running production applications on the real cluster consisting of hundreds of nodes? Join Spotify! jobs@spotify.com 2/23/13