SlideShare una empresa de Scribd logo
1 de 28
Monitoring and Troubleshooting
  7/6/2012

© 2012 MapR Technologies   Troubleshooting 1
Monitoring & Troubleshooting
   Agenda
   • Cluster Monitoring Tools
   • Troubleshooting MapReduce Jobs
   • Troubleshooting Scenarios
   • Working with MapR Support
   • Things to Avoid




© 2012 MapR Technologies   Troubleshooting 2
Monitoring & Troubleshooting
   Objectives
   At the end of this module you will be able to:
   • Identify the tools you can use to monitor your cluster
   • Explain how MapR central logging can help you monitor MapReduce jobs
   • Describe several common troubleshooting scenarios and how to resolve
     issues based on these scenarios
   • List the tools you can use to work with MapR Support




© 2012 MapR Technologies        Troubleshooting 3
Cluster Monitoring Tools




© 2012 MapR Technologies   Troubleshooting 4
Monitoring Tools

         Built-In Tools
          – MapR Control System
          – MapR Metrics

         3rd Party Tools
          – Nagios
          – Ganglia




5   © 2012 MapR Technologies      Troubleshooting 5
MapR Control System

         MapR Control System
          –   Dashboard with cluster overview
              • Node health
              • MapR-FS and available disks
              • Resource utilization
                  –   bandwidth
                  –   disk space
                  –   CPU
              • MapReduce job status
              • Alarms




6   © 2012 MapR Technologies            Troubleshooting 6
MapR Control System




7   © 2012 MapR Technologies   Troubleshooting 7
MapR Metrics

         MapR Metrics
          –   View performance information about Hadoop jobs
              • Predict cluster usage
              • Measure which jobs consume resources
              • Troubleshoot failures & performance issues
          –   Metrics provided on
              •   Cumulative CPU/memory usage
              •   # of running/failed tasks/attempts
              •   Speed of input, output, and shuffle
              •   Duration of task attempts
              •   Data read, written, or shuffled
              •   Memory in use
              •   Number of records skipped/spilled

8   © 2012 MapR Technologies               Troubleshooting 8
MapR Metrics




9   © 2012 MapR Technologies   Troubleshooting 9
3rd Party Tools

          Nagios
           –   Configuration script generator
          Ganglia
           –   CLDB does metrics
           –   MapRGangliaContext
           –   Only need gmond on CLDB node




10   © 2012 MapR Technologies          Troubleshooting 10
MapR Service Logs

          /opt/mapr/logs
          For example:
           – CLDB
           – Warden
           – FileServer (mfs)
           – NFS




11   © 2012 MapR Technologies   Troubleshooting 11
Troubleshooting
                           MapReduce Jobs



© 2012 MapR Technologies      Troubleshooting 12
Central Logging

          MapR 2.0 introduces central logging
           –   Log files written to “local” volume on MapR-FS
               •   replication factor = 1
                   –   I/O confined to node
           – /var/mapr/local/<host>/logs/mapred/userlogs
           – Configurable via JobTracker variable
               •   mapr.localvolumes.path




13   © 2012 MapR Technologies                 Troubleshooting 13
Central Logging

          New CLI for MapReduce logs
               maprcli job linklogs -jobid <jobPatten> -todir
               <maprfsDir> [ -jobconf <pathToJobXml>]
           – Create a job-centric view of all logs on all involved TaskTracker nodes
           – Creates the following structure under <maprfsDir> for all <jobid>’s
             matching <jobPattern>
               •   <jobid>/hosts/<host>/
                   –   symbolic links to log directories of tasks executed for <jobid> on <host>
               •   <jobid>/mappers/
                   –   symbolic links to log directories of all map task attempts for <jobid> across the
                       cluster
               •   <jobid>/reducers/
                   –   symbolic links to log directories of all reduce task attempts for <jobid> across the
                       cluster


14   © 2012 MapR Technologies                   Troubleshooting 14
Troubleshooting
                              Scenarios



© 2012 MapR Technologies      Troubleshooting 15
Troubleshooting Scenarios

          Slow nodes
          Out of memory
          Out of disk space
          Time skew
          No ZooKeeper quorum
          Contention for resources
          Requirements not met




16   © 2012 MapR Technologies    Troubleshooting 16
Identifying Slow Nodes

          Before installation:
           –   Use dd to benchmark read/write speed
               •   dd bs=4M if=/dev/null of=/dev/sd<x>

           –   Compare performance across nodes to test network throughput:
               •   dd bs=4M if=/dev/null |       sudo ssh root@node 'dd bs=4M of=/dev/foo’

          After installation:
           – Look at task starting and completion times
           – Look in system logs for memory or CPU problems
           – Look at the performance of writes to the local volume
             (where intermediate data goes)
          Slow disks identified based on a threshold in mfs.conf
           –   May really be slow NIC


17   © 2012 MapR Technologies                     Troubleshooting 17
Out of Memory

          Make sure there is enough swap space
          See if a memory-intensive job is running
          Use ulimit to make sure there are no limits on the number of file
           descriptors, resource usage, and the number of processes
          Garbage collection can result in out-of-memory errors




18   © 2012 MapR Technologies     Troubleshooting 18
Out of Disk Space

          MapR logs go to /opt/mapr/logs
           – If this partition is too small, space can run out
           – Set up a cron job to clean out old logs
           – Move to a larger partition




19   © 2012 MapR Technologies          Troubleshooting 19
Time Skew

          NTP is your friend
          20 Seconds differential is the max allowed




20   © 2012 MapR Technologies    Troubleshooting 20
No ZooKeeper Quorum

          Not enough ZooKeepers running
          configure.sh run improperly
           –   Different ZooKeeper or CLDB nodes specified
          Network problem
           –   Hostname resolution
           –   Physical connection down




21   © 2012 MapR Technologies             Troubleshooting 21
Contention for Resources

          Make sure there’s no limit on file descriptors, processes
          Make sure the service layout follows good guidelines
           – Don’t run ZooKeeper with CLDB or JobTracker
           – Fewer task slots when running TaskTracker with CLDB or ZooKeeper
           – Avoid running the active JobTracker on the primary CLDB node

        Don’t run other random things on cluster nodes
        Don’t mix distributions




22   © 2012 MapR Technologies      Troubleshooting 22
Requirements Not Met

          Use Sun Java JDK
          Same users/groups with same UID/GID numbers on all nodes
          Proper licensing
          Host resolution between all nodes
           –   DNS or /etc/hosts
        Keyless ssh between all nodes for the root user
        All necessary ports open
           –   Watch out for iptables and selinux




23   © 2012 MapR Technologies          Troubleshooting 23
Working with MapR
                                Support



© 2012 MapR Technologies       Troubleshooting 24
Working with MapR Support

          mapr-support-collect and mapr-support dump
          fsck and gfsck




25   © 2012 MapR Technologies   Troubleshooting 25
Things to Avoid




© 2012 MapR Technologies      Troubleshooting 26
Things to Avoid

          Remove ZooKeeper data manually
          Format disks (unless you are sure)
          Run configure.sh incorrectly
          Use dd on an installed node
          Modify configuration files
           – Without a good reason
           – Inconsistently




27   © 2012 MapR Technologies        Troubleshooting 27
Questions




© 2012 MapR Technologies   Troubleshooting 28

Más contenido relacionado

La actualidad más candente

NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsJason Shao
 
MapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR HadoopMapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR Hadoopabord
 
Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityEdureka!
 
Architectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop DistributionArchitectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop Distributionmcsrivas
 
Spark tunning in Apache Kylin
Spark tunning in Apache KylinSpark tunning in Apache Kylin
Spark tunning in Apache KylinShi Shao Feng
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to YarnOmid Vahdaty
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterAltoros
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7Ted Dunning
 
Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Shivkumar Babshetty
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoopmcsrivas
 
Ambari Meetup: NameNode HA
Ambari Meetup: NameNode HAAmbari Meetup: NameNode HA
Ambari Meetup: NameNode HAHortonworks
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies
 
Taming YARN @ Hadoop conference Japan 2014
Taming YARN @ Hadoop conference Japan 2014Taming YARN @ Hadoop conference Japan 2014
Taming YARN @ Hadoop conference Japan 2014Tsuyoshi OZAWA
 

La actualidad más candente (20)

12a architecture
12a architecture12a architecture
12a architecture
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 
10c introduction
10c introduction10c introduction
10c introduction
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
MapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR HadoopMapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR Hadoop
 
Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High Availability
 
Architectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop DistributionArchitectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop Distribution
 
Spark tunning in Apache Kylin
Spark tunning in Apache KylinSpark tunning in Apache Kylin
Spark tunning in Apache Kylin
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop Cluster
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoop
 
Ambari Meetup: NameNode HA
Ambari Meetup: NameNode HAAmbari Meetup: NameNode HA
Ambari Meetup: NameNode HA
 
Anatomy of Hadoop YARN
Anatomy of Hadoop YARNAnatomy of Hadoop YARN
Anatomy of Hadoop YARN
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
 
Taming YARN @ Hadoop conference Japan 2014
Taming YARN @ Hadoop conference Japan 2014Taming YARN @ Hadoop conference Japan 2014
Taming YARN @ Hadoop conference Japan 2014
 

Destacado

Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniquesijsrd.com
 
Hive Apachecon 2008
Hive Apachecon 2008Hive Apachecon 2008
Hive Apachecon 2008athusoo
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaCloudera, Inc.
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 HiveNamit Jain
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on HadoopPaco Nathan
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideIBM
 

Destacado (8)

Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
Hive Apachecon 2008
Hive Apachecon 2008Hive Apachecon 2008
Hive Apachecon 2008
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 

Similar a 70a monitoring & troubleshooting

10c introduction
10c introduction10c introduction
10c introductionInyoung Cho
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesDataWorks Summit/Hadoop Summit
 
Taming Latency: Case Studies in MapReduce Data Analytics
Taming Latency: Case Studies in MapReduce Data AnalyticsTaming Latency: Case Studies in MapReduce Data Analytics
Taming Latency: Case Studies in MapReduce Data AnalyticsEMC
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5RojaT4
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecturesaipriyacoool
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Etu Solution
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal
 
BDAS RDD study report v1.2
BDAS RDD study report v1.2BDAS RDD study report v1.2
BDAS RDD study report v1.2Stefanie Zhao
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillTomer Shiran
 
Coredns nodecache - A highly-available Node-cache DNS server
Coredns nodecache - A highly-available Node-cache DNS serverCoredns nodecache - A highly-available Node-cache DNS server
Coredns nodecache - A highly-available Node-cache DNS serverYann Hamon
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around HadoopDataWorks Summit
 

Similar a 70a monitoring & troubleshooting (20)

48a tuning
48a tuning48a tuning
48a tuning
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
22 configuration
22 configuration22 configuration
22 configuration
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
10c introduction
10c introduction10c introduction
10c introduction
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 
Taming Latency: Case Studies in MapReduce Data Analytics
Taming Latency: Case Studies in MapReduce Data AnalyticsTaming Latency: Case Studies in MapReduce Data Analytics
Taming Latency: Case Studies in MapReduce Data Analytics
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
HBase with MapR
HBase with MapRHBase with MapR
HBase with MapR
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
BDAS RDD study report v1.2
BDAS RDD study report v1.2BDAS RDD study report v1.2
BDAS RDD study report v1.2
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Coredns nodecache - A highly-available Node-cache DNS server
Coredns nodecache - A highly-available Node-cache DNS serverCoredns nodecache - A highly-available Node-cache DNS server
Coredns nodecache - A highly-available Node-cache DNS server
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around Hadoop
 

Más de mapr-academy

42 lab-managing services
42 lab-managing services42 lab-managing services
42 lab-managing servicesmapr-academy
 
41a managing services
41a managing services41a managing services
41a managing servicesmapr-academy
 
30a accessing your cluster
30a accessing your cluster30a accessing your cluster
30a accessing your clustermapr-academy
 
3 map r installation & setup administration course description
3 map r installation & setup administration course description3 map r installation & setup administration course description
3 map r installation & setup administration course descriptionmapr-academy
 

Más de mapr-academy (8)

53 lab-nfs
53 lab-nfs53 lab-nfs
53 lab-nfs
 
51 lab-volumes
51 lab-volumes51 lab-volumes
51 lab-volumes
 
50a volumes
50a volumes50a volumes
50a volumes
 
42 lab-managing services
42 lab-managing services42 lab-managing services
42 lab-managing services
 
41a managing services
41a managing services41a managing services
41a managing services
 
30a accessing your cluster
30a accessing your cluster30a accessing your cluster
30a accessing your cluster
 
14 lab-planing
14 lab-planing14 lab-planing
14 lab-planing
 
3 map r installation & setup administration course description
3 map r installation & setup administration course description3 map r installation & setup administration course description
3 map r installation & setup administration course description
 

Último

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 

Último (20)

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

70a monitoring & troubleshooting

  • 1. Monitoring and Troubleshooting 7/6/2012 © 2012 MapR Technologies Troubleshooting 1
  • 2. Monitoring & Troubleshooting Agenda • Cluster Monitoring Tools • Troubleshooting MapReduce Jobs • Troubleshooting Scenarios • Working with MapR Support • Things to Avoid © 2012 MapR Technologies Troubleshooting 2
  • 3. Monitoring & Troubleshooting Objectives At the end of this module you will be able to: • Identify the tools you can use to monitor your cluster • Explain how MapR central logging can help you monitor MapReduce jobs • Describe several common troubleshooting scenarios and how to resolve issues based on these scenarios • List the tools you can use to work with MapR Support © 2012 MapR Technologies Troubleshooting 3
  • 4. Cluster Monitoring Tools © 2012 MapR Technologies Troubleshooting 4
  • 5. Monitoring Tools  Built-In Tools – MapR Control System – MapR Metrics  3rd Party Tools – Nagios – Ganglia 5 © 2012 MapR Technologies Troubleshooting 5
  • 6. MapR Control System  MapR Control System – Dashboard with cluster overview • Node health • MapR-FS and available disks • Resource utilization – bandwidth – disk space – CPU • MapReduce job status • Alarms 6 © 2012 MapR Technologies Troubleshooting 6
  • 7. MapR Control System 7 © 2012 MapR Technologies Troubleshooting 7
  • 8. MapR Metrics  MapR Metrics – View performance information about Hadoop jobs • Predict cluster usage • Measure which jobs consume resources • Troubleshoot failures & performance issues – Metrics provided on • Cumulative CPU/memory usage • # of running/failed tasks/attempts • Speed of input, output, and shuffle • Duration of task attempts • Data read, written, or shuffled • Memory in use • Number of records skipped/spilled 8 © 2012 MapR Technologies Troubleshooting 8
  • 9. MapR Metrics 9 © 2012 MapR Technologies Troubleshooting 9
  • 10. 3rd Party Tools  Nagios – Configuration script generator  Ganglia – CLDB does metrics – MapRGangliaContext – Only need gmond on CLDB node 10 © 2012 MapR Technologies Troubleshooting 10
  • 11. MapR Service Logs  /opt/mapr/logs  For example: – CLDB – Warden – FileServer (mfs) – NFS 11 © 2012 MapR Technologies Troubleshooting 11
  • 12. Troubleshooting MapReduce Jobs © 2012 MapR Technologies Troubleshooting 12
  • 13. Central Logging  MapR 2.0 introduces central logging – Log files written to “local” volume on MapR-FS • replication factor = 1 – I/O confined to node – /var/mapr/local/<host>/logs/mapred/userlogs – Configurable via JobTracker variable • mapr.localvolumes.path 13 © 2012 MapR Technologies Troubleshooting 13
  • 14. Central Logging  New CLI for MapReduce logs maprcli job linklogs -jobid <jobPatten> -todir <maprfsDir> [ -jobconf <pathToJobXml>] – Create a job-centric view of all logs on all involved TaskTracker nodes – Creates the following structure under <maprfsDir> for all <jobid>’s matching <jobPattern> • <jobid>/hosts/<host>/ – symbolic links to log directories of tasks executed for <jobid> on <host> • <jobid>/mappers/ – symbolic links to log directories of all map task attempts for <jobid> across the cluster • <jobid>/reducers/ – symbolic links to log directories of all reduce task attempts for <jobid> across the cluster 14 © 2012 MapR Technologies Troubleshooting 14
  • 15. Troubleshooting Scenarios © 2012 MapR Technologies Troubleshooting 15
  • 16. Troubleshooting Scenarios  Slow nodes  Out of memory  Out of disk space  Time skew  No ZooKeeper quorum  Contention for resources  Requirements not met 16 © 2012 MapR Technologies Troubleshooting 16
  • 17. Identifying Slow Nodes  Before installation: – Use dd to benchmark read/write speed • dd bs=4M if=/dev/null of=/dev/sd<x> – Compare performance across nodes to test network throughput: • dd bs=4M if=/dev/null | sudo ssh root@node 'dd bs=4M of=/dev/foo’  After installation: – Look at task starting and completion times – Look in system logs for memory or CPU problems – Look at the performance of writes to the local volume (where intermediate data goes)  Slow disks identified based on a threshold in mfs.conf – May really be slow NIC 17 © 2012 MapR Technologies Troubleshooting 17
  • 18. Out of Memory  Make sure there is enough swap space  See if a memory-intensive job is running  Use ulimit to make sure there are no limits on the number of file descriptors, resource usage, and the number of processes  Garbage collection can result in out-of-memory errors 18 © 2012 MapR Technologies Troubleshooting 18
  • 19. Out of Disk Space  MapR logs go to /opt/mapr/logs – If this partition is too small, space can run out – Set up a cron job to clean out old logs – Move to a larger partition 19 © 2012 MapR Technologies Troubleshooting 19
  • 20. Time Skew  NTP is your friend  20 Seconds differential is the max allowed 20 © 2012 MapR Technologies Troubleshooting 20
  • 21. No ZooKeeper Quorum  Not enough ZooKeepers running  configure.sh run improperly – Different ZooKeeper or CLDB nodes specified  Network problem – Hostname resolution – Physical connection down 21 © 2012 MapR Technologies Troubleshooting 21
  • 22. Contention for Resources  Make sure there’s no limit on file descriptors, processes  Make sure the service layout follows good guidelines – Don’t run ZooKeeper with CLDB or JobTracker – Fewer task slots when running TaskTracker with CLDB or ZooKeeper – Avoid running the active JobTracker on the primary CLDB node  Don’t run other random things on cluster nodes  Don’t mix distributions 22 © 2012 MapR Technologies Troubleshooting 22
  • 23. Requirements Not Met  Use Sun Java JDK  Same users/groups with same UID/GID numbers on all nodes  Proper licensing  Host resolution between all nodes – DNS or /etc/hosts  Keyless ssh between all nodes for the root user  All necessary ports open – Watch out for iptables and selinux 23 © 2012 MapR Technologies Troubleshooting 23
  • 24. Working with MapR Support © 2012 MapR Technologies Troubleshooting 24
  • 25. Working with MapR Support  mapr-support-collect and mapr-support dump  fsck and gfsck 25 © 2012 MapR Technologies Troubleshooting 25
  • 26. Things to Avoid © 2012 MapR Technologies Troubleshooting 26
  • 27. Things to Avoid  Remove ZooKeeper data manually  Format disks (unless you are sure)  Run configure.sh incorrectly  Use dd on an installed node  Modify configuration files – Without a good reason – Inconsistently 27 © 2012 MapR Technologies Troubleshooting 27
  • 28. Questions © 2012 MapR Technologies Troubleshooting 28