SlideShare a Scribd company logo
1 of 27
Apache Hadoop MapReduce
What is next?


Arun C. Murthy
Founder & Architect
@acmurthy (@hortonworks)




                           Page 1
Hello! I’m Arun
• Founder/Architect at Hortonworks Inc.
  – Lead, Map-Reduce
  – Formerly, Architect Hadoop MapReduce, Yahoo
  – Responsible for running Hadoop MR as a service for all of Yahoo
    (50k nodes footprint)

• Apache Hadoop, ASF
  – VP, Apache Hadoop, ASF (Chair of Apache Hadoop PMC)
  – Long-term Committer/PMC member (full time >6 years)
  – Release Manager for hadoop-2




                                                                 Page 2
Agenda

• Yesterday: Hadoop MapReduce, circa 2011
• Today: Hadoop YARN
   – Overview
   – State of the art

•  Art of the possible
   – YARN Runtime
   – MapReduce Framework

• Q&A



                                            Page 3
Hadoop MapReduce
circa 2011




                   Page 4
Hadoop MapReduce Classic
•  JobTracker
   –  Manages cluster resources and job scheduling
•  TaskTracker
   –  Per-node agent

   –  Manage tasks
Current Limitations
•  Utilization

•  Scalability
   –  Maximum Cluster size – 4,000 nodes
   –  Maximum concurrent tasks – 40,000
   –  Coarse synchronization in JobTracker

•  Single point of failure
   –  Failure kills all queued and running jobs
   –  Jobs restarted on bounce




                             6
Current Limitations
•  Hard partition of resources into map and reduce slots
   –  Low resource utilization

•  Lacks support for alternate paradigms
   –  Iterative applications implemented using MapReduce are
      10x slower
   –  Hacks for the likes of MPI/Graph Processing

•  Lack of wire-compatible protocols
   –  Client and cluster must be of same version
   –  Applications and workflows cannot migrate to different
      clusters



                             7
Hadoop YARN
Overview




              Page 8
Requirements
•  Reliability
•  Availability
•  Utilization
•  Wire Compatibility
•  Agility & Evolution – Ability for customers to control
   upgrades to the grid software stack.
•  Scalability - Clusters of 6,000-10,000 machines
   –  Each machine with 16 cores, 48G/96G RAM, 24TB/36TB
      disks
   –  100,000+ concurrent tasks
   –  10,000 concurrent jobs

                            9
Design Centre
•  Split up the two major functions of JobTracker
   –  Cluster resource management
   –  Application life-cycle management

•  MapReduce becomes user-land library




                           10
Architecture
•  Application
   –  Application is a job submitted to the framework
   –  Example – Map Reduce Job

•  Container
   –  Basic unit of allocation
   –  Example – container A = 2GB, 1CPU
   –  Replaces the fixed map/reduce slots




                            11
Architecture
•  Resource Manager
   –  Global resource scheduler
   –  Hierarchical queues

•  Node Manager
   –  Per-machine agent
   –  Manages the life-cycle of container
   –  Container resource monitoring

•  Application Master
   –  Per-application
   –  Manages application scheduling and task execution
   –  E.g. MapReduce Application Master

                            12
Architecture

                                             Node
                                             Node
                                            Manager
                                            Manager


                                      Container   App Mstr
                                                  App Mstr


       Client

                           Resource          Node
                                             Node
                           Resource
                           Manager
                           Manager          Manager
                                            Manager
       Client
        Client

                                      App Mstr    Container
                                                  Container




        MapReduce Status                     Node
                                             Node
        MapReduce Status
                                            Manager
                                            Manager
          Job Submission
         Job Submission
           Node Status
          Node Status
        Resource Request
        Resource Request              Container   Container
How do I get it?

•  Available in hadoop-2.0.0-alpha release




                          14
Performance

• 2x+ across the board
• MapReduce
    – Unlock lots of improvements from Terasort record (Owen/Arun,
      2009)
          – Shuffle 30%+
          – Merge improvements
    – Small Jobs – Uber AM
    – Re-use task slots (containers)
More details: http://hortonworks.com/delivering-on-hadoop-next-benchmarking-performance/




                                                                                           Page 15
Resources

hadoop-2.0.0 (alpha release):
http://hadoop.apache.org/common/releases.html

Release Documentation:
http://hadoop.apache.org/common/docs/r2.0.0-alpha/




                                                     Page 16
Art of the possible
YARN Runtime
MapReduce Framework




                      Page 17
Looking ahead

• YARN
  – Runtime Improvements
  – Alternate programming models
  – Long(er) running services

• MapReduce
  – Framework enhancements
  – Unpack!




                                   Page 18
YARN - Roadmap

• Scheduler
  – Multi-dimensional resource scheduling (MAPREDUCE-4327)
  – Preemption (MAPREDUCE-3938)
  – Gang scheduling



• Runtime improvements
  – Container Isolation (MAPREDUCE-4334)




                                                        Page 19
YARN - Data Processing Applications

• OpenMPI on Hadoop
• Spark (UC Berkeley)
   – Shark is Hive-on-Spark
• Real-time data processing
   – Storm (Twitter)
   – Apache S4
•  Graph processing – Apache Giraph




                                      Page 20
YARN - Beyond Data Processing Apps

• Apache Hbase
  – Deployment via YARN (HBASE-4329)
  – Co-processors via YARN (HBASE-4047)
• Simple deployment for cluster services




                                           Page 21
MapReduce – Way Forward

• MapReduce Framework Runtime
  – Monolithic software
• MR Runtime?
  – Sort, Merge, Shuffle et al
• Unpack into smaller building blocks!
  – Allow applications and Pig/Hive to ‘plug-n-play’
  – MR framework, as we know today, becomes a particular
    configuration of the building blocks




                                                           Page 22
MapReduce – Pluggable Sort

• Pig & Hive benefit from hash-based aggregation
  – Several queries don’t need full-sort of map-outputs
  – Aggregation suffices
  – Allow for pluggable MapOutputBuffer in MapTask
  – Sort Avoidance - MAPREDUCE-4039
  – External sort plugin – MAPREDUCE-2454




                                                          Page 23
MapReduce – Pluggable Shuffle

• Push v/s Pull shuffle
• Plug shuffle implementation (already in hadoop-2)
   – E.g. RDMA for shuffle
   – MAPREDUCE-4049
• Collation tasks
   – Sailfish - Yahoo Research (includes auto-tuning of reduces)




                                                                   Page 24
MapReduce – More ideas

• Allow for Map-Reduce-Reduce
  – Allow for reduce output to be sorted/shuffled
  – JOIN followed by ORDER BY
  – Really big deal for Pig/Hive




                                                    Page 25
MapReduce – How do we get there?

• Multiple, concurrent implementations of MapReduce
  – YARN is a really big deal…
  – Allows for safe experiments, much less risky!
  – Exposure surface is highly limited




                                                      Page 26
Questions?




Thank You.
@acmurthy




             Page 27

More Related Content

What's hot

Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
DataWorks Summit
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
DataWorks Summit
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?
Rajarshi Guha
 

What's hot (20)

Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
YARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User GroupYARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User Group
 
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
 
Towards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersTowards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN Clusters
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
 
Hadoop - Past, Present and Future - v2.0
Hadoop - Past, Present and Future - v2.0Hadoop - Past, Present and Future - v2.0
Hadoop - Past, Present and Future - v2.0
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
 
YARN: a resource manager for analytic platform
YARN: a resource manager for analytic platformYARN: a resource manager for analytic platform
YARN: a resource manager for analytic platform
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARN
 
Anatomy of Hadoop YARN
Anatomy of Hadoop YARNAnatomy of Hadoop YARN
Anatomy of Hadoop YARN
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?
 

Viewers also liked

Viewers also liked (10)

Harmony eng(1)
Harmony eng(1)Harmony eng(1)
Harmony eng(1)
 
My Career Profile
My Career ProfileMy Career Profile
My Career Profile
 
*7* Marketing Basic Fundamentals on How to INCREASE Your Revenue
*7* Marketing Basic Fundamentals on How to INCREASE Your Revenue*7* Marketing Basic Fundamentals on How to INCREASE Your Revenue
*7* Marketing Basic Fundamentals on How to INCREASE Your Revenue
 
5. ejercicios de respiración programada
5.  ejercicios de respiración programada5.  ejercicios de respiración programada
5. ejercicios de respiración programada
 
Augueries of rhapsodies
Augueries of rhapsodiesAugueries of rhapsodies
Augueries of rhapsodies
 
Pirmais kredīts bezmaksas
Pirmais kredīts bezmaksasPirmais kredīts bezmaksas
Pirmais kredīts bezmaksas
 
Humanity weeps
Humanity weepsHumanity weeps
Humanity weeps
 
Ātrie kredīti no 20 gadiem
Ātrie kredīti no 20 gadiemĀtrie kredīti no 20 gadiem
Ātrie kredīti no 20 gadiem
 
Fiestario
FiestarioFiestario
Fiestario
 
2. fonema r preparatorio
2.  fonema r preparatorio2.  fonema r preparatorio
2. fonema r preparatorio
 

Similar to Apache Hadoop MapReduce: What's Next

Hadoop World 2011, Apache Hadoop MapReduce Next Gen
Hadoop World 2011, Apache Hadoop MapReduce Next GenHadoop World 2011, Apache Hadoop MapReduce Next Gen
Hadoop World 2011, Apache Hadoop MapReduce Next Gen
Hortonworks
 
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache HadoopYARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
Hortonworks
 
Next Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduceNext Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduce
huguk
 
YARN Hadoop Summit Bangalore 2011
YARN Hadoop Summit Bangalore 2011YARN Hadoop Summit Bangalore 2011
YARN Hadoop Summit Bangalore 2011
Sharad Agarwal
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
hdhappy001
 

Similar to Apache Hadoop MapReduce: What's Next (20)

Hadoop World 2011, Apache Hadoop MapReduce Next Gen
Hadoop World 2011, Apache Hadoop MapReduce Next GenHadoop World 2011, Apache Hadoop MapReduce Next Gen
Hadoop World 2011, Apache Hadoop MapReduce Next Gen
 
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache HadoopYARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
 
Hadoop bangalore-meetup-dec-2011-hadoop nextgen
Hadoop bangalore-meetup-dec-2011-hadoop nextgenHadoop bangalore-meetup-dec-2011-hadoop nextgen
Hadoop bangalore-meetup-dec-2011-hadoop nextgen
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Hadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton Works
Hadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton WorksHadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton Works
Hadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton Works
 
Apache Hadoop 0.23 at Hadoop World 2011
Apache Hadoop 0.23 at Hadoop World 2011Apache Hadoop 0.23 at Hadoop World 2011
Apache Hadoop 0.23 at Hadoop World 2011
 
Next Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduceNext Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduce
 
MHUG - YARN
MHUG - YARNMHUG - YARN
MHUG - YARN
 
YARN Hadoop Summit Bangalore 2011
YARN Hadoop Summit Bangalore 2011YARN Hadoop Summit Bangalore 2011
YARN Hadoop Summit Bangalore 2011
 
Yarnthug2014
Yarnthug2014Yarnthug2014
Yarnthug2014
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute Platform
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
 
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
[db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から by NTT 小沢健史
[db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史[db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から  by NTT 小沢健史
[db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から by NTT 小沢健史
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 

Apache Hadoop MapReduce: What's Next

  • 1. Apache Hadoop MapReduce What is next? Arun C. Murthy Founder & Architect @acmurthy (@hortonworks) Page 1
  • 2. Hello! I’m Arun • Founder/Architect at Hortonworks Inc. – Lead, Map-Reduce – Formerly, Architect Hadoop MapReduce, Yahoo – Responsible for running Hadoop MR as a service for all of Yahoo (50k nodes footprint) • Apache Hadoop, ASF – VP, Apache Hadoop, ASF (Chair of Apache Hadoop PMC) – Long-term Committer/PMC member (full time >6 years) – Release Manager for hadoop-2 Page 2
  • 3. Agenda • Yesterday: Hadoop MapReduce, circa 2011 • Today: Hadoop YARN – Overview – State of the art •  Art of the possible – YARN Runtime – MapReduce Framework • Q&A Page 3
  • 5. Hadoop MapReduce Classic •  JobTracker –  Manages cluster resources and job scheduling •  TaskTracker –  Per-node agent –  Manage tasks
  • 6. Current Limitations •  Utilization •  Scalability –  Maximum Cluster size – 4,000 nodes –  Maximum concurrent tasks – 40,000 –  Coarse synchronization in JobTracker •  Single point of failure –  Failure kills all queued and running jobs –  Jobs restarted on bounce 6
  • 7. Current Limitations •  Hard partition of resources into map and reduce slots –  Low resource utilization •  Lacks support for alternate paradigms –  Iterative applications implemented using MapReduce are 10x slower –  Hacks for the likes of MPI/Graph Processing •  Lack of wire-compatible protocols –  Client and cluster must be of same version –  Applications and workflows cannot migrate to different clusters 7
  • 9. Requirements •  Reliability •  Availability •  Utilization •  Wire Compatibility •  Agility & Evolution – Ability for customers to control upgrades to the grid software stack. •  Scalability - Clusters of 6,000-10,000 machines –  Each machine with 16 cores, 48G/96G RAM, 24TB/36TB disks –  100,000+ concurrent tasks –  10,000 concurrent jobs 9
  • 10. Design Centre •  Split up the two major functions of JobTracker –  Cluster resource management –  Application life-cycle management •  MapReduce becomes user-land library 10
  • 11. Architecture •  Application –  Application is a job submitted to the framework –  Example – Map Reduce Job •  Container –  Basic unit of allocation –  Example – container A = 2GB, 1CPU –  Replaces the fixed map/reduce slots 11
  • 12. Architecture •  Resource Manager –  Global resource scheduler –  Hierarchical queues •  Node Manager –  Per-machine agent –  Manages the life-cycle of container –  Container resource monitoring •  Application Master –  Per-application –  Manages application scheduling and task execution –  E.g. MapReduce Application Master 12
  • 13. Architecture Node Node Manager Manager Container App Mstr App Mstr Client Resource Node Node Resource Manager Manager Manager Manager Client Client App Mstr Container Container MapReduce Status Node Node MapReduce Status Manager Manager Job Submission Job Submission Node Status Node Status Resource Request Resource Request Container Container
  • 14. How do I get it? •  Available in hadoop-2.0.0-alpha release 14
  • 15. Performance • 2x+ across the board • MapReduce – Unlock lots of improvements from Terasort record (Owen/Arun, 2009) – Shuffle 30%+ – Merge improvements – Small Jobs – Uber AM – Re-use task slots (containers) More details: http://hortonworks.com/delivering-on-hadoop-next-benchmarking-performance/ Page 15
  • 16. Resources hadoop-2.0.0 (alpha release): http://hadoop.apache.org/common/releases.html Release Documentation: http://hadoop.apache.org/common/docs/r2.0.0-alpha/ Page 16
  • 17. Art of the possible YARN Runtime MapReduce Framework Page 17
  • 18. Looking ahead • YARN – Runtime Improvements – Alternate programming models – Long(er) running services • MapReduce – Framework enhancements – Unpack! Page 18
  • 19. YARN - Roadmap • Scheduler – Multi-dimensional resource scheduling (MAPREDUCE-4327) – Preemption (MAPREDUCE-3938) – Gang scheduling • Runtime improvements – Container Isolation (MAPREDUCE-4334) Page 19
  • 20. YARN - Data Processing Applications • OpenMPI on Hadoop • Spark (UC Berkeley) – Shark is Hive-on-Spark • Real-time data processing – Storm (Twitter) – Apache S4 •  Graph processing – Apache Giraph Page 20
  • 21. YARN - Beyond Data Processing Apps • Apache Hbase – Deployment via YARN (HBASE-4329) – Co-processors via YARN (HBASE-4047) • Simple deployment for cluster services Page 21
  • 22. MapReduce – Way Forward • MapReduce Framework Runtime – Monolithic software • MR Runtime? – Sort, Merge, Shuffle et al • Unpack into smaller building blocks! – Allow applications and Pig/Hive to ‘plug-n-play’ – MR framework, as we know today, becomes a particular configuration of the building blocks Page 22
  • 23. MapReduce – Pluggable Sort • Pig & Hive benefit from hash-based aggregation – Several queries don’t need full-sort of map-outputs – Aggregation suffices – Allow for pluggable MapOutputBuffer in MapTask – Sort Avoidance - MAPREDUCE-4039 – External sort plugin – MAPREDUCE-2454 Page 23
  • 24. MapReduce – Pluggable Shuffle • Push v/s Pull shuffle • Plug shuffle implementation (already in hadoop-2) – E.g. RDMA for shuffle – MAPREDUCE-4049 • Collation tasks – Sailfish - Yahoo Research (includes auto-tuning of reduces) Page 24
  • 25. MapReduce – More ideas • Allow for Map-Reduce-Reduce – Allow for reduce output to be sorted/shuffled – JOIN followed by ORDER BY – Really big deal for Pig/Hive Page 25
  • 26. MapReduce – How do we get there? • Multiple, concurrent implementations of MapReduce – YARN is a really big deal… – Allows for safe experiments, much less risky! – Exposure surface is highly limited Page 26