SlideShare una empresa de Scribd logo
1 de 15
Descargar para leer sin conexión
Tuning Hadoop for Performance
                      Srigurunath Chakravarthi
                      Performance Enginnering,
                          Yahoo! Bangalore
                             Doc Ver 1.0
                            March 5, 2010

Yahoo! Confidential                              1
Outline


  •         Why worry about performance?
  •         Recap of Hadoop Design
          –           Control Flow (Map, Shuffle, Reduce phases)
  •         Key performance considerations
  •         Thumb rules for tuning Hadoop
          –           Cluster level
          –           Application level
  •         Wrap up




Yahoo! Confidential                                                2
Why Worry About Performance?

  Why Measure/Track Performance?
  •         Tells your ROI on hardware.
  •         Surfaces silent performance regressions from
          –           Faulty and “slow” (malfunctioning) disks/NICs/CPUs
          –           Software/Configuration Upgrades, etc.
  Why Improve Performance?
  •         Faster results and better ROI :-)
  •         There are non-obvious, yet simple ways to
          –           Push up cluster/app performance without adding hardware
          –           Unlock cluster/app performance by mitigating bottlenecks


  And The Good News Is… Hadoop is designed to be tunable by users
          –           25+ performance influencing tunable parameters
          –           Cluster-wide and Job-specific controls
Yahoo! Confidential                                                              3
Recap of Hadoop Design


                      Map
                        Map
                      Task                      Task
                         Map
                       Task
                         Task                  Tracker
                                                          Reduce
                                                           Reduce
                                                           Task
   HDFS                          Local Disk                 Task

                      Map
                        Map                                                HDFS
                      Task
                         Map                    Task
                       Task                    Tracker    Local Disk
                         Task

    HDFS                         Local Disk
                                                          Reduce
                                                           Reduce
                                                           Task
                                                            Task
                       Map
                         Map
                       Task                      Task
                          Map
                        Task
                          Task                  Tracker                    HDFS
                                                          Local Disk
     HDFS                         Local Disk

Yahoo! Confidential                                                    4
Key Performance influencing factors

  Multiple Orthogonal factors
  •         Cluster Hardware Configuration
          –           # cores; RAM, # disks per node; disk speeds; network topology, etc.
                  Example: If your app is data intensive, can you drive sufficiently good disk throughput?
                      Do you have sufficient RAM (to decrease # trips to disk)?
  •         Application logic related
          –           Degree of Parallelism: M-R favors embarrassingly parallel apps
          –           Load Balance: Slowest tasks impact M-R job completion time.
  •         System Bottlenecks
          –           Thrashing your CPU/memory/disks/network degrades performance severely
  •         Resource Under-utilization
          –           Your app may not be pushing system limits enough.
  •         Scale
          –           Bottlenecks from centralized components (Job Tracker and Name Node).


Yahoo! Confidential                                                                          5
Key Performance influencing factors
                           Tuning Opportunities

  •         Cluster Hardware Configuration
          –           Hardware Purchase/Upgrade time decision. (Outside scope of this pres.)
  •         Application logic related
          –           Tied to app logic. (Outside scope of this presentation.)
          –           Countering Load Balance:
                  •       Typically mitigated by adapting user algorithm to avoid “long tails”.
                  •       Examples: Re-partitioning; Imposing per-task hard-limits on input/output sizes.
          –           Handling Non-Parallelism:
                  •       Run app as a pipeline of M-R jobs. Sequential portions as single reducers.
          –           Record Combining:
                  •       Map-side and reduce-side combiners
  •         System Bottlenecks & Resource Under-utilization
          –           These can be mitigated by tuning Hadoop (discussed more).
  •         Scale
          –           Relevant to large (1000+ node) clusters. (Outside scope of this pres.)
Yahoo! Confidential                                                                               6
System Usage Characteristics
                        Resource Intensiveness
              M-R Step       CPU    Memory   Network   Disk   Notes

              Serve Map                      Yes*      Yes    *For remote maps (minority)
              Input
              Execute Map    Yes*                             *Depends on App
              Function

              Store Map      Yes*   Yes+               Yes    *If compression is ON
              Output                                          +Memory Sensitive
              Shuffle               Yes+     Yes       Yes    +Memory Sensitive

              Execute        Yes*                             *Depends on App
              Reduce Func.

              Store Reduce   Yes*            Yes+      Yes    *If compression is ON
              Output                                          +For replication factor > 1




Yahoo! Confidential                                                          7
Cluster Level Tuning – CPU & Memory

  Map and Reducers task execution: Pushing Up CPU Utilization
  Tunables
  –           mapred.tasktracker.map.tasks.maximum: The maximum number of map tasks that will
              be run simultaneously by a task tracker (aka “map slots” / “M”).
  –           mapred.tasktracker.reduce.tasks.maximum: The maximum number of reduce tasks that
              will be run simultaneously by a task tracker (aka “reduce slots” / “R”).


  Thumb Rules for Tuning
  –           Over-subscribe cores (Set total “slots” > num cores)
  –           Throw more slots at the dominant phase.
  –           Don’t exceed mem limit and hit swap! (Adjust Java heap via mapred.child.javaopts)
  –           Example:
          –           8 cores. Assume map tasks account for 75% of CPU time.
          –           Per Over-subscribing rule: Total Slots (M+R) = 10 (on 8 cores)
          –           Per Biasing rule: Create more Map Slots than Reduce Slots. E.g., M,R = (8, 2) or (7,3)

Yahoo! Confidential                                                                                 8
Cluster Level Tuning – DFS Throughput

  DFS Data Read/Write: Pushing up throughput
  Tunables
  –           dfs.block.size: The default block size for new files (aka “DFS Block Size”).



  Thumb Rules for Tuning
  –           The default of 128 MB is normally a good size. Lower if disk-space is a crunch.
  –           Size it to avoid serving multiple blocks to a map task. May forsake data locality.
  –           Alternately tailor the number of map tasks at the job level.
  –           Example:
          –           If your data sets that logically go to a single map are ~180-190 MB in size, set block
                      size to 196 MB.




Yahoo! Confidential                                                                         9
Job Level Tuning –Task Granularity

  Setting optimal number of Map and Reduce tasks
  Tunables
  –         # map tasks in your job (“m”) – controlled via input splits.
  –         “mapred.reduce.tasks”: # reduce tasks in your job (“r”)



  Thumb Rules for Tuning
  –         Set # map tasks to read off approximately 1 DFS block worth of data.
  –         Use multiple “map waves”, to hide shuffle latency.
  –         Look for a “sweet range” of # of waves (this is empirical).
  # Reduce tasks:
  –         Use a single reducer wave. Second wave adds extra shuffle latency.
  –         Use multiple reducer waves, iff reducer task can’t scale in memory.


  Num “map waves” = Total # of map tasks / Total # of map slots in cluster
Yahoo! Confidential                                                               10
Job Level Tuning – io.sort.mb

  Buffering to Minimize Disk Writes
  Tunables
  –         io.sort.mb Size of map-side buffer to store and merge map output before spilling to
            disk. (Map-side buffer)
  –         fs.inmemorysize.mb Size of reduce-side buffer for storing & merging multi-map
            output before spilling to disk. (Reduce side-buffer)


  Thumb Rules for Tuning
  –         Set these to ~70% of Java heap size. Pick heap sizes to utilize ~80% RAM across
            all processes (maps, reducers, TT, DN, other)
  –         Set it small enough to avoid swap activity, but
  –         Set it large enough to minimize disk spills.
  –         Ensure that io.sort.factor is set large enough to allow full use of buffer space.
  –         Balance space for output records (default 95%) & record meta-data (5%)
                  •   Use io.sort.spill.percent and io.sort.record.percent
Yahoo! Confidential                                                               11
Job Level Tuning – Compression

  Compression: Trades off CPU cycles to reduce disk/network traffic.
  Tunables
  –         mapred.compress.map.output Should intermediate map output be compressed?
  –         mapred.output.compress Should final (reducer) output be compressed?



  Thumb Rules for Tuning
  –         Turn them on unless CPU is your bottleneck.
  –         Use BLOCK compression: Set mapred.(map).output.compression.type to BLOCK
  –         LZO does better than default (Zlib) – mapred.(map).output.compression.codec
  –         Try Intel® IPP libraries for even better compression speed on Intel platforms.


  Turn map output compression ON cluster-wide. Compression invariably improves
       performance of apps handling large data on modern multi-core systems.


Yahoo! Confidential                                                              12
Tuning multiple parameters

  •           Multiple tunables for memory, CPU, disk and network.
  •           Only the prominent ones were covered here.
  •           Inter-dependent. Can’t tune them independently.
  •           Meta rules to help multi-tune :
          -           Avoid swap. Cost of swapping is high.
          -           Minimize spills. Spilling is not as evil as swapping.
          -           It generally pays to compress and to over-subscribe cores.
  •           Several other tunable parameters exist. Look them up in config/
          –           Core-default.xml, Mapred-default.xml, dfs-default.xml
          –           Core-site.xml, Mapred-site.xml, dfs-site.xml




Yahoo! Confidential                                                                13
Sample tuning gains for a 60-job app pipeline
                        (“Mini Webmap on 64 node cluster”)

   Setting            #Maps (m)                 #Reduces   M,R slots   io.sort.mb   Job exec    Improvement
                                                (r)                                 time        over Baseline
                                                                                    (sec)

   Baseline           Two Heaviest Apps: 1215   243        4,4         500          7682        -
                      All Other Apps: 243


   Tuned1             Two Heaviest Apps: 800    243        8,3         1000         7084        7.78%
                      All Other Apps: 243


   Tuned2             Two Heaviest Apps: 800    200        8,3         1000         6496        15.43%
                      All Other Apps: 200


   Tuned3             Two Heaviest Apps: 800    150        8,3         1000         5689        22.42%
                      All Other Apps: 150

   Contribution       major                     moderate   moderate    minor
   to
   improvement


Yahoo! Confidential                                                                        14
Acknowledgements

  Many of the observations presented here came as learnings and insights from
  •         Webmap Performance Engineers @ Y!
          –           Mahadevan Iyer, Arvind Murthy, Rohit Jalan
  •         Grid Performance Engineers @ Y!
          –           Rajesh Balamohan, Harish Mallipeddi, Janardhana Reddy
  •         Hadoop Dev Engineers @ Y!
          –           Devaraj Das, Jothi Padmanabhan, Hemanth Yamijala




  Questions: sriguru@yahoo-inc.com




Yahoo! Confidential                                                           15

Más contenido relacionado

La actualidad más candente

Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopMike Pittaro
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Emilio Coppa
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideDouglas Bernardini
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanNarayana B
 
Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHanborq Inc.
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataDataWorks Summit
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureDatabricks
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataDataWorks Summit
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)mundlapudi
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 

La actualidad más candente (20)

Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for Hadoop
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
 
Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 
Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep Insight
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big Data
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
Hadoop2.2
Hadoop2.2Hadoop2.2
Hadoop2.2
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 

Similar a Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application

MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Cloudera, Inc.
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloudelliando dias
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and DeploymentCisco Canada
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)outstanding59
 
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopHadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopYahoo Developer Network
 
SCM dashobard using Hadoop, Mongodb, Django
SCM dashobard using Hadoop, Mongodb, DjangoSCM dashobard using Hadoop, Mongodb, Django
SCM dashobard using Hadoop, Mongodb, Djangoprakash_ranade
 
SCM Dashboard
SCM DashboardSCM Dashboard
SCM DashboardPerforce
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96
 

Similar a Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application (20)

MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopHadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
 
Hadoop by sunitha
Hadoop by sunithaHadoop by sunitha
Hadoop by sunitha
 
SCM dashobard using Hadoop, Mongodb, Django
SCM dashobard using Hadoop, Mongodb, DjangoSCM dashobard using Hadoop, Mongodb, Django
SCM dashobard using Hadoop, Mongodb, Django
 
SCM Dashboard
SCM DashboardSCM Dashboard
SCM Dashboard
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Presentation
PresentationPresentation
Presentation
 

Más de Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

Más de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Último

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Último (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application

  • 1. Tuning Hadoop for Performance Srigurunath Chakravarthi Performance Enginnering, Yahoo! Bangalore Doc Ver 1.0 March 5, 2010 Yahoo! Confidential 1
  • 2. Outline • Why worry about performance? • Recap of Hadoop Design – Control Flow (Map, Shuffle, Reduce phases) • Key performance considerations • Thumb rules for tuning Hadoop – Cluster level – Application level • Wrap up Yahoo! Confidential 2
  • 3. Why Worry About Performance? Why Measure/Track Performance? • Tells your ROI on hardware. • Surfaces silent performance regressions from – Faulty and “slow” (malfunctioning) disks/NICs/CPUs – Software/Configuration Upgrades, etc. Why Improve Performance? • Faster results and better ROI :-) • There are non-obvious, yet simple ways to – Push up cluster/app performance without adding hardware – Unlock cluster/app performance by mitigating bottlenecks And The Good News Is… Hadoop is designed to be tunable by users – 25+ performance influencing tunable parameters – Cluster-wide and Job-specific controls Yahoo! Confidential 3
  • 4. Recap of Hadoop Design Map Map Task Task Map Task Task Tracker Reduce Reduce Task HDFS Local Disk Task Map Map HDFS Task Map Task Task Tracker Local Disk Task HDFS Local Disk Reduce Reduce Task Task Map Map Task Task Map Task Task Tracker HDFS Local Disk HDFS Local Disk Yahoo! Confidential 4
  • 5. Key Performance influencing factors Multiple Orthogonal factors • Cluster Hardware Configuration – # cores; RAM, # disks per node; disk speeds; network topology, etc. Example: If your app is data intensive, can you drive sufficiently good disk throughput? Do you have sufficient RAM (to decrease # trips to disk)? • Application logic related – Degree of Parallelism: M-R favors embarrassingly parallel apps – Load Balance: Slowest tasks impact M-R job completion time. • System Bottlenecks – Thrashing your CPU/memory/disks/network degrades performance severely • Resource Under-utilization – Your app may not be pushing system limits enough. • Scale – Bottlenecks from centralized components (Job Tracker and Name Node). Yahoo! Confidential 5
  • 6. Key Performance influencing factors Tuning Opportunities • Cluster Hardware Configuration – Hardware Purchase/Upgrade time decision. (Outside scope of this pres.) • Application logic related – Tied to app logic. (Outside scope of this presentation.) – Countering Load Balance: • Typically mitigated by adapting user algorithm to avoid “long tails”. • Examples: Re-partitioning; Imposing per-task hard-limits on input/output sizes. – Handling Non-Parallelism: • Run app as a pipeline of M-R jobs. Sequential portions as single reducers. – Record Combining: • Map-side and reduce-side combiners • System Bottlenecks & Resource Under-utilization – These can be mitigated by tuning Hadoop (discussed more). • Scale – Relevant to large (1000+ node) clusters. (Outside scope of this pres.) Yahoo! Confidential 6
  • 7. System Usage Characteristics Resource Intensiveness M-R Step CPU Memory Network Disk Notes Serve Map Yes* Yes *For remote maps (minority) Input Execute Map Yes* *Depends on App Function Store Map Yes* Yes+ Yes *If compression is ON Output +Memory Sensitive Shuffle Yes+ Yes Yes +Memory Sensitive Execute Yes* *Depends on App Reduce Func. Store Reduce Yes* Yes+ Yes *If compression is ON Output +For replication factor > 1 Yahoo! Confidential 7
  • 8. Cluster Level Tuning – CPU & Memory Map and Reducers task execution: Pushing Up CPU Utilization Tunables – mapred.tasktracker.map.tasks.maximum: The maximum number of map tasks that will be run simultaneously by a task tracker (aka “map slots” / “M”). – mapred.tasktracker.reduce.tasks.maximum: The maximum number of reduce tasks that will be run simultaneously by a task tracker (aka “reduce slots” / “R”). Thumb Rules for Tuning – Over-subscribe cores (Set total “slots” > num cores) – Throw more slots at the dominant phase. – Don’t exceed mem limit and hit swap! (Adjust Java heap via mapred.child.javaopts) – Example: – 8 cores. Assume map tasks account for 75% of CPU time. – Per Over-subscribing rule: Total Slots (M+R) = 10 (on 8 cores) – Per Biasing rule: Create more Map Slots than Reduce Slots. E.g., M,R = (8, 2) or (7,3) Yahoo! Confidential 8
  • 9. Cluster Level Tuning – DFS Throughput DFS Data Read/Write: Pushing up throughput Tunables – dfs.block.size: The default block size for new files (aka “DFS Block Size”). Thumb Rules for Tuning – The default of 128 MB is normally a good size. Lower if disk-space is a crunch. – Size it to avoid serving multiple blocks to a map task. May forsake data locality. – Alternately tailor the number of map tasks at the job level. – Example: – If your data sets that logically go to a single map are ~180-190 MB in size, set block size to 196 MB. Yahoo! Confidential 9
  • 10. Job Level Tuning –Task Granularity Setting optimal number of Map and Reduce tasks Tunables – # map tasks in your job (“m”) – controlled via input splits. – “mapred.reduce.tasks”: # reduce tasks in your job (“r”) Thumb Rules for Tuning – Set # map tasks to read off approximately 1 DFS block worth of data. – Use multiple “map waves”, to hide shuffle latency. – Look for a “sweet range” of # of waves (this is empirical). # Reduce tasks: – Use a single reducer wave. Second wave adds extra shuffle latency. – Use multiple reducer waves, iff reducer task can’t scale in memory. Num “map waves” = Total # of map tasks / Total # of map slots in cluster Yahoo! Confidential 10
  • 11. Job Level Tuning – io.sort.mb Buffering to Minimize Disk Writes Tunables – io.sort.mb Size of map-side buffer to store and merge map output before spilling to disk. (Map-side buffer) – fs.inmemorysize.mb Size of reduce-side buffer for storing & merging multi-map output before spilling to disk. (Reduce side-buffer) Thumb Rules for Tuning – Set these to ~70% of Java heap size. Pick heap sizes to utilize ~80% RAM across all processes (maps, reducers, TT, DN, other) – Set it small enough to avoid swap activity, but – Set it large enough to minimize disk spills. – Ensure that io.sort.factor is set large enough to allow full use of buffer space. – Balance space for output records (default 95%) & record meta-data (5%) • Use io.sort.spill.percent and io.sort.record.percent Yahoo! Confidential 11
  • 12. Job Level Tuning – Compression Compression: Trades off CPU cycles to reduce disk/network traffic. Tunables – mapred.compress.map.output Should intermediate map output be compressed? – mapred.output.compress Should final (reducer) output be compressed? Thumb Rules for Tuning – Turn them on unless CPU is your bottleneck. – Use BLOCK compression: Set mapred.(map).output.compression.type to BLOCK – LZO does better than default (Zlib) – mapred.(map).output.compression.codec – Try Intel® IPP libraries for even better compression speed on Intel platforms. Turn map output compression ON cluster-wide. Compression invariably improves performance of apps handling large data on modern multi-core systems. Yahoo! Confidential 12
  • 13. Tuning multiple parameters • Multiple tunables for memory, CPU, disk and network. • Only the prominent ones were covered here. • Inter-dependent. Can’t tune them independently. • Meta rules to help multi-tune : - Avoid swap. Cost of swapping is high. - Minimize spills. Spilling is not as evil as swapping. - It generally pays to compress and to over-subscribe cores. • Several other tunable parameters exist. Look them up in config/ – Core-default.xml, Mapred-default.xml, dfs-default.xml – Core-site.xml, Mapred-site.xml, dfs-site.xml Yahoo! Confidential 13
  • 14. Sample tuning gains for a 60-job app pipeline (“Mini Webmap on 64 node cluster”) Setting #Maps (m) #Reduces M,R slots io.sort.mb Job exec Improvement (r) time over Baseline (sec) Baseline Two Heaviest Apps: 1215 243 4,4 500 7682 - All Other Apps: 243 Tuned1 Two Heaviest Apps: 800 243 8,3 1000 7084 7.78% All Other Apps: 243 Tuned2 Two Heaviest Apps: 800 200 8,3 1000 6496 15.43% All Other Apps: 200 Tuned3 Two Heaviest Apps: 800 150 8,3 1000 5689 22.42% All Other Apps: 150 Contribution major moderate moderate minor to improvement Yahoo! Confidential 14
  • 15. Acknowledgements Many of the observations presented here came as learnings and insights from • Webmap Performance Engineers @ Y! – Mahadevan Iyer, Arvind Murthy, Rohit Jalan • Grid Performance Engineers @ Y! – Rajesh Balamohan, Harish Mallipeddi, Janardhana Reddy • Hadoop Dev Engineers @ Y! – Devaraj Das, Jothi Padmanabhan, Hemanth Yamijala Questions: sriguru@yahoo-inc.com Yahoo! Confidential 15