SlideShare una empresa de Scribd logo
1 de 28
Apache Hadoop MapReduce
What next?


Arun C. Murthy
Founder & Architect
@acmurthy (@hortonworks)




                           Page 1
Hello! I’m Arun
• Founder/Architect at Hortonworks Inc.
  – Lead, Map-Reduce
  – Formerly, Architect Hadoop MapReduce, Yahoo
  – Responsible for running Hadoop MR as a service for all of Yahoo
    (50k nodes footprint)

• Apache Hadoop, ASF
  – VP, Apache Hadoop, ASF (Chair of Apache Hadoop PMC)
  – Long-term Committer/PMC member (full time >6 years)
  – Release Manager for hadoop-2




                                                                 Page 2
Agenda

• Hadoop MapReduce, State of the Art
• Hadoop YARN
   – Overview
   – State of the art

• Art of the possible
   – YARN Runtime
   – MapReduce Framework

• Q&A



                                       Page 3
Hadoop MapReduce
State of the Art




                   Page 4
Hadoop MapReduce Classic
• JobTracker
  – Manages cluster resources and job scheduling
• TaskTracker
  – Per-node agent

  – Manage tasks
Hadoop 1 – Enterprise Ready
• Hadoop 1.x is the most stable & reliable version of
  Hadoop MapReduce ever
   – Proven to be reliable at the most demanding Hadoop clusters
     in the world
• CapacityScheduler for Multi-tenancy
   –   Share clusters at scale
   –   Resource & User limits for fine-grained
   –   Queue & Job ACLs
   –   Resilient to misbehaving/rogue applications, users etc.,
       helping drive SLA for applications, pipelines etc.




                              6
Hadoop 1 – Availability for MR
• JobTracker Restart
  – Enhanced to restart all jobs on rare JT failures


• JobTracker Safemode
  – Admin driven for known issues
  – Auto-monitoring of HDFS for full-stack availability




                            7
Hadoop YARN
Overview & Status Quo




                        Page 8
MapReduce - Areas for Improvement
 • Utilization
 • Scalability
    – Maximum Cluster size – 4,000 nodes
    – Maximum concurrent tasks – 40,000
 • Hard partition of resources into map and reduce slots
 • Lacks support for alternate paradigms
 • Lack of wire-compatible protocols




                         9
Requirements
• Reliability
• Availability
• Utilization
• Wire Compatibility
• Agility & Evolution – Ability for customers to control
  upgrades to the grid software stack.
• Scalability - Clusters of 6,000-10,000 machines
   – Each machine with 16 cores, 48G/96G RAM, 24TB/36TB
     disks
   – 100,000+ concurrent tasks
   – 10,000 concurrent jobs

                           10
Design Centre
• Split up the two major functions of JobTracker
   – Cluster resource management
   – Application life-cycle management
• MapReduce becomes user-land library




                          11
Concepts
• Application
   – Application is a job submitted to the framework
   – Example – Map Reduce Job
• Container
   – Basic unit of allocation
   – Example – container A = 2GB, 1CPU
   – Replaces the fixed map/reduce slots




                           12
Architecture
• Resource Manager
   – Global resource scheduler
   – Hierarchical queues
• Node Manager
   – Per-machine agent
   – Manages the life-cycle of container
   – Container resource monitoring
• Application Master
   – Per-application
   – Manages application scheduling and task execution
   – E.g. MapReduce Application Master

                            13
Architecture

                                             Node
                                             Node
                                            Manager
                                            Manager


                                      Container   App Mstr
                                                  App Mstr


       Client

                           Resource          Node
                                             Node
                           Resource
                           Manager
                           Manager          Manager
                                            Manager
       Client
        Client

                                      App Mstr    Container
                                                  Container




        MapReduce Status                     Node
                                             Node
        MapReduce Status
                                            Manager
                                            Manager
          Job Submission
         Job Submission
           Node Status
          Node Status
        Resource Request
        Resource Request              Container   Container
How do I get it?

• Available in hadoop-2.0.0-alpha release




                          15
Performance

• 2x+ across the board (HDFS, YARN, MapReduce)
• MapReduce
  –Unlock lots of improvements from Terasort record (Owen/Arun,
   2009)
      – Shuffle 30%+
      – Merge improvements
  –Small Jobs – Uber AM
  –Re-use task slots (containers)

   http://hortonworks.com/delivering-on-hadoop-next-benchmarking-performance/




                                                                         Page 16
Resources

hadoop-2.0.0 (alpha release):
http://hadoop.apache.org/common/releases.html

Release Documentation:
http://hadoop.apache.org/common/docs/r2.0.0-alpha/




                                                     Page 17
Art of the possible
YARN Runtime
MapReduce Framework




                      Page 18
Looking ahead

• YARN
  –Runtime Improvements
  –Alternate programming models
  –Long(er) running services

• MapReduce
  –Framework enhancements
  –Unpack!




                                  Page 19
YARN - Roadmap

• Scheduler
  –Multi-dimensional resource scheduling (MAPREDUCE-4327)
  –Preemption (MAPREDUCE-3938)
  –Gang scheduling



• Runtime improvements
  –Container Isolation (MAPREDUCE-4334)




                                                       Page 20
YARN - Data Processing Applications

• OpenMPI on Hadoop
• Spark (UC Berkeley)
  –Shark is Hive-on-Spark
• Real-time data processing
  – Storm (Twitter)
  – Apache S4
• Graph processing – Apache Giraph




                                      Page 21
YARN - Beyond Data Processing Apps

• Apache Hbase
  –Deployment via YARN (HBASE-4329)
  –Co-processors via YARN (HBASE-4047)
• Simple deployment for cluster services




                                           Page 22
MapReduce – Way Forward

• MapReduce Framework Runtime
   –Monolithic software
• MR Runtime?
   –Sort, Merge, Shuffle et al
• Unpack into smaller building blocks!
   –Allow applications and Pig/Hive to ‘plug-n-play’
   –MR framework, as we know today, becomes a particular
    configuration of the building blocks




                                                           Page 23
MapReduce – Pluggable Sort

• Pig & Hive benefit from hash-based aggregation
  –Several queries don’t need full-sort of map-outputs
  –Aggregation suffices
  –Allow for pluggable MapOutputBuffer in MapTask
  –Sort Avoidance - MAPREDUCE-4039
  –External sort plugin – MAPREDUCE-2454




                                                         Page 24
MapReduce – Pluggable Shuffle

• Push v/s Pull shuffle
• Plug shuffle implementation (already in hadoop-2)
   –E.g. RDMA for shuffle
   –MAPREDUCE-4049
• Collation tasks
   –Sailfish - Yahoo Research (includes auto-tuning of reduces)




                                                              Page 25
MapReduce – More ideas

• Allow for Map-Reduce-Reduce
  –Allow for reduce output to be sorted/shuffled
  – JOIN followed by ORDER BY
  – Really big deal for Pig/Hive
• DAG Management for Pig/Hive
  – Scheduling improvements
  – Restart semantics




                                                   Page 26
MapReduce – How do we get there?

• Multiple, concurrent implementations of MapReduce
  –YARN is a really big deal…
  –Allows for safe experiments, much less risky!
  –Exposure surface is highly limited




                                                      Page 27
Questions?




Thank You.
@acmurthy




             Page 28

Más contenido relacionado

Destacado

An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
Frane Bandov
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
datasalt
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 

Destacado (15)

Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
 
Stock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce ImplementationStock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce Implementation
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
MapReduce in Simple Terms
MapReduce in Simple TermsMapReduce in Simple Terms
MapReduce in Simple Terms
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 

Más de Hortonworks

Más de Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Apache Hadoop MapReduce - What Next? Hadoop Summit 2012

  • 1. Apache Hadoop MapReduce What next? Arun C. Murthy Founder & Architect @acmurthy (@hortonworks) Page 1
  • 2. Hello! I’m Arun • Founder/Architect at Hortonworks Inc. – Lead, Map-Reduce – Formerly, Architect Hadoop MapReduce, Yahoo – Responsible for running Hadoop MR as a service for all of Yahoo (50k nodes footprint) • Apache Hadoop, ASF – VP, Apache Hadoop, ASF (Chair of Apache Hadoop PMC) – Long-term Committer/PMC member (full time >6 years) – Release Manager for hadoop-2 Page 2
  • 3. Agenda • Hadoop MapReduce, State of the Art • Hadoop YARN – Overview – State of the art • Art of the possible – YARN Runtime – MapReduce Framework • Q&A Page 3
  • 4. Hadoop MapReduce State of the Art Page 4
  • 5. Hadoop MapReduce Classic • JobTracker – Manages cluster resources and job scheduling • TaskTracker – Per-node agent – Manage tasks
  • 6. Hadoop 1 – Enterprise Ready • Hadoop 1.x is the most stable & reliable version of Hadoop MapReduce ever – Proven to be reliable at the most demanding Hadoop clusters in the world • CapacityScheduler for Multi-tenancy – Share clusters at scale – Resource & User limits for fine-grained – Queue & Job ACLs – Resilient to misbehaving/rogue applications, users etc., helping drive SLA for applications, pipelines etc. 6
  • 7. Hadoop 1 – Availability for MR • JobTracker Restart – Enhanced to restart all jobs on rare JT failures • JobTracker Safemode – Admin driven for known issues – Auto-monitoring of HDFS for full-stack availability 7
  • 8. Hadoop YARN Overview & Status Quo Page 8
  • 9. MapReduce - Areas for Improvement • Utilization • Scalability – Maximum Cluster size – 4,000 nodes – Maximum concurrent tasks – 40,000 • Hard partition of resources into map and reduce slots • Lacks support for alternate paradigms • Lack of wire-compatible protocols 9
  • 10. Requirements • Reliability • Availability • Utilization • Wire Compatibility • Agility & Evolution – Ability for customers to control upgrades to the grid software stack. • Scalability - Clusters of 6,000-10,000 machines – Each machine with 16 cores, 48G/96G RAM, 24TB/36TB disks – 100,000+ concurrent tasks – 10,000 concurrent jobs 10
  • 11. Design Centre • Split up the two major functions of JobTracker – Cluster resource management – Application life-cycle management • MapReduce becomes user-land library 11
  • 12. Concepts • Application – Application is a job submitted to the framework – Example – Map Reduce Job • Container – Basic unit of allocation – Example – container A = 2GB, 1CPU – Replaces the fixed map/reduce slots 12
  • 13. Architecture • Resource Manager – Global resource scheduler – Hierarchical queues • Node Manager – Per-machine agent – Manages the life-cycle of container – Container resource monitoring • Application Master – Per-application – Manages application scheduling and task execution – E.g. MapReduce Application Master 13
  • 14. Architecture Node Node Manager Manager Container App Mstr App Mstr Client Resource Node Node Resource Manager Manager Manager Manager Client Client App Mstr Container Container MapReduce Status Node Node MapReduce Status Manager Manager Job Submission Job Submission Node Status Node Status Resource Request Resource Request Container Container
  • 15. How do I get it? • Available in hadoop-2.0.0-alpha release 15
  • 16. Performance • 2x+ across the board (HDFS, YARN, MapReduce) • MapReduce –Unlock lots of improvements from Terasort record (Owen/Arun, 2009) – Shuffle 30%+ – Merge improvements –Small Jobs – Uber AM –Re-use task slots (containers) http://hortonworks.com/delivering-on-hadoop-next-benchmarking-performance/ Page 16
  • 17. Resources hadoop-2.0.0 (alpha release): http://hadoop.apache.org/common/releases.html Release Documentation: http://hadoop.apache.org/common/docs/r2.0.0-alpha/ Page 17
  • 18. Art of the possible YARN Runtime MapReduce Framework Page 18
  • 19. Looking ahead • YARN –Runtime Improvements –Alternate programming models –Long(er) running services • MapReduce –Framework enhancements –Unpack! Page 19
  • 20. YARN - Roadmap • Scheduler –Multi-dimensional resource scheduling (MAPREDUCE-4327) –Preemption (MAPREDUCE-3938) –Gang scheduling • Runtime improvements –Container Isolation (MAPREDUCE-4334) Page 20
  • 21. YARN - Data Processing Applications • OpenMPI on Hadoop • Spark (UC Berkeley) –Shark is Hive-on-Spark • Real-time data processing – Storm (Twitter) – Apache S4 • Graph processing – Apache Giraph Page 21
  • 22. YARN - Beyond Data Processing Apps • Apache Hbase –Deployment via YARN (HBASE-4329) –Co-processors via YARN (HBASE-4047) • Simple deployment for cluster services Page 22
  • 23. MapReduce – Way Forward • MapReduce Framework Runtime –Monolithic software • MR Runtime? –Sort, Merge, Shuffle et al • Unpack into smaller building blocks! –Allow applications and Pig/Hive to ‘plug-n-play’ –MR framework, as we know today, becomes a particular configuration of the building blocks Page 23
  • 24. MapReduce – Pluggable Sort • Pig & Hive benefit from hash-based aggregation –Several queries don’t need full-sort of map-outputs –Aggregation suffices –Allow for pluggable MapOutputBuffer in MapTask –Sort Avoidance - MAPREDUCE-4039 –External sort plugin – MAPREDUCE-2454 Page 24
  • 25. MapReduce – Pluggable Shuffle • Push v/s Pull shuffle • Plug shuffle implementation (already in hadoop-2) –E.g. RDMA for shuffle –MAPREDUCE-4049 • Collation tasks –Sailfish - Yahoo Research (includes auto-tuning of reduces) Page 25
  • 26. MapReduce – More ideas • Allow for Map-Reduce-Reduce –Allow for reduce output to be sorted/shuffled – JOIN followed by ORDER BY – Really big deal for Pig/Hive • DAG Management for Pig/Hive – Scheduling improvements – Restart semantics Page 26
  • 27. MapReduce – How do we get there? • Multiple, concurrent implementations of MapReduce –YARN is a really big deal… –Allows for safe experiments, much less risky! –Exposure surface is highly limited Page 27