SlideShare una empresa de Scribd logo
1 de 59
Descargar para leer sin conexión
MANAGING YOUR HEROES
     The People Aspect of Monitoring
(a.k.a. Dealing with Outages and Failures)
               Alex Solomon
             alex@pagerduty.com
WHO AM I?
     Alex Solomon

    • Founder     / CEO of PagerDuty

    • Intersect   Inc.

    • Amazon.com




                                       2
DEFINITIONS



              3
Service Level Agreement (SLA)

 Mean Time To Resolution (MTTR)

      Mean Time To Response

Mean Time Between Failures (MTBF)



                                    4
OUTAGES



          5
Can we prevent them?




                       6
PREVENTING OUTAGES

  Single Points of Failure (SPOFs)
  Redundant systems

  Complex, monolithic systems
  Service-oriented architecture




                                     7
Netflix distributed SOA system




                                8
PREVENTING OUTAGES

                 Change

 (not much you can do about this one)




                                        9
OUTAGES


          10
FAILURE LIFECYCLE



                    11
Monitoring
                      detect failure



                                         Alert


                                       Investigate


                                          Fix



Root-cause Analysis

                                                     12
Critical Incident Timeline

              Alert                Investigate       Fix

            RESPONSE TIME
                                 RESOLUTION TIME


 Issue is               Engineer starts                    Issue is
detected               working on issue                      fixed




                                                                      13
MONITOR



          14
MONITOR EVERYTHING!
All levels of the stack
•   Data center
•   Network
•   Servers
•   Database
•   Application
•   Website
•   Business Metrics

                          15
WHY MONITOR EVERYTHING?


                 Metrics!

                 Metrics!

                 Metrics!




                            16
TOOLS
•   Internal monitoring (behind the firewall):
    •

    •

•   External monitoring (SaaS-based):
    •

    •

•   Metrics:
    •   Graphite or

                                                17
ALERT



        18
Best Practice: Categorize alerts by severity.




                                                19
SEVERITIES
Define severities based on business impact:
• sev1   - large scale business loss       {
                                           2 critical
                                           severities
• sev2   - small to medium business loss

• sev3



• sev4
       - no immediate business loss,
 customers may be impacted

       - no business loss, no
                                       {       2 non-critical
                                               severities
 customers impacted

                                                                20
Each severity level should have its own standard
operating procedure (SOP):
•   Who

•   How

•   Response time




                                                   21
•   Sev1: Major outage, all hands on deck
    •   Notify the entire team via phone and SMS
    •   Response time: 5 min
•   Sev2: Critical issue
    •   Notify the on-call person via phone and SMS
    •   Response time: 15 min
•   Sev3: Non-critical issue
    •   Notify the on-call person via email
    •   Response time: next day during business hours



                                                        22
•   Sev1 incidents
    •   Rare
    •   Rarely auto-generated
    •   Frequently start as sev2 which are upgraded to sev1




                                                              23
•   Sev2 incidents
    •   More common
    •   Mostly auto-generated




                                24
•   Sev3 incidents
    •   Non-critical incidents
    •   Can be auto-generated
    •   Can also be manually generated




                                         25
•   Severities can be downgraded or upgraded
    •   ex. sev2 ➞ sev1 (problem got worse)
    •   ex. sev1 ➞ sev2 (problem was partially fixed)
    •   ex. sev2 ➞ sev3 (critical problem was fixed but we
        still need to investigate root cause)




                                                            26
One more best-practice:

Alert before your systems fail completely




                                            27
Main benefit of severities

Only page on critical issues (sev1 or 2)




                                           28
Preserve sanity




                  29
Avoid “Peter and the wolf ” scenarios




                                        30
ON-CALL BEST PRACTICES


   Person      Team
    Level      Level




                         31
ON-CALL AT THE PERSON LEVEL


         Cellphone




                              32
Cellphone
Smart phone



    OR        AND




                    33
4G / 3G internet




4G hotspot   4G USB modem   3G/4G tethering



     (don’t forget your laptop)



                                              34
Page multiple times until you respond
 • Time    zero: email and SMS

 •1   min later: phone-call on cell

 •5   min later: phone-call on cell

 •5   min later: phone-call on landline

 •5   min later: phone-call to girlfriend



                                            35
Bonus: vibrating bluetooth bracelet




                                      36
ON-CALL AT THE TEAM LEVEL


  Rarely
  Do not send alerts to the entire team


                sev1 OK
                sev2 NO



                                          37
On-call schedules:
 •   Simple rotation-based schedule
     •   ex. weekly - everyone is on-call for a
         week at a time
 •   Set up a follow-the-sun schedule
     •   people in multiple timezones
     •   no night-shifts                          simple rotation




                                                                    38
What happens if the on-call person doesn’t respond at all?




                                                             39
If you care about uptime, you need redundancy in your on-call.

                                                                 40
Set up multiple on-call levels with automatic
escalation between them:

  Level 1: Primary on-call
       Escalate after 15 min

  Level 2: Secondary on-call
       Escalate after 20 min

  Level 3: Team on-call (alert entire team)




                                                41
Best Practice: Put management in the on-call chain
  Level 1: Primary on-call
       Escalate after 15 min

  Level 2: Secondary on-call
       Escalate after 20 min

  Level 3: Team on-call (alert entire team)
       Escalate after 20 min

  Level 4: Manager / Director



                                                     42
Best Practice: put software engineers in the
on-call chain
  •   Devops model
  •   Devs need to own the systems they write
  •   Getting paged provides a strong incentive to engineer
      better systems




                                                              43
Best Practice: measure on-call performance
“You can’t improve what you don’t measure.”

   •   Measure: mean-time-to-response
   •   Measure: % of issues that were escalated
   •   Set up policies to encourage good performance
       •   Put managers in on-call chain
       •   Pay people extra to do on-call



                                                       44
Network Operations Center




                            45
NOC with lots of Nagios goodness


                                   46
NOCs:
 •   Reduce the mean-time-to-response drastically
 •   Expensive (staffed 24x7 with multiple people)
 •   Train NOC staff to fix a good %age of issues
 •   As you scale your org, you may want a hybrid on-call
     approach (where NOC handles some issues, teams handle
     other issues directly)




                                                             47
Critical Incident Timeline

                 Alert                 Investigate          Fix
                 RESPONSE TIME
                                     RESOLUTION TIME

   Issue is               Engineer starts                         Issue is
  detected               working on issue                           fixed




                                   Alert
              Alerting system                Engineer gets to a
               gets ahold of                computer, connects
                somebody                        to internet



 Issue is                      Engineer is                Engineer starts
detected                      aware of issue             working on issue



                                                                             48
Alert
          Alerting system               Engineer gets to a
           gets ahold of               computer, connects
            somebody                       to internet




{
{
 Issue is                      Engineer is                  Engineer starts
detected                      aware of issue               working on issue



    How to minimize:                   How to minimize:
    •   Alert via phone & SMS          •   Carry 4G internet device +
                                           laptop at all times
    •   Alert multiple times via
        multiple channels              •   Set loud ringtone at night
    •   Failing that, escalate!
    •   Failing that, escalate to
        manager!

                                                                              49
RESEARCH & FIX



                 50
How do we reduce the amount of time needed to
             investigate and fix?

           Investigate       Fix




                                                51
Set up an Emergency Ops Guide:
  •   When you encounter a new failure, document it in the
      Guide
  •   Document symptoms, research steps, fixes
  •   Use a wiki




                                                             52
53
54
Automate fixes
           or


Add more fault tolerance




                           55
You need the right tools:
 •   Tools to help you diagnose problems faster
     •   Comprehensive monitoring, metrics and dashboards
     •   Tools that help search for problems in log files quickly (ie.
         Splunk)
 •   Tools to help your team communicate efficiently
     •   Voice: Conference bridge, Skype, Google Hangout
     •   Chat: Hipchat, Campfire


                                                                        56
Best Practice: Incident Commander




                                    57
Incident Commander:
 •   Essential for dealing with sev1 issues
 •   In charge of the situation
     •   Providers leadership, prevents analysis paralysis
     •   He/she directs people to do things
     •   Helps save time making decisions




                                                             58
Questions?


 Alex Solomon
alex@pagerduty.com

Más contenido relacionado

Similar a Nagios Conference 2012 - Alex Solomon - Managing Your Heros

Web Application Remediation - OWASP San Antonio March 2007
Web Application Remediation - OWASP San Antonio March 2007Web Application Remediation - OWASP San Antonio March 2007
Web Application Remediation - OWASP San Antonio March 2007Denim Group
 
2015 05-07 - vu amsterdam - testing safety critical systems
2015 05-07 - vu amsterdam - testing safety critical systems2015 05-07 - vu amsterdam - testing safety critical systems
2015 05-07 - vu amsterdam - testing safety critical systemsJaap van Ekris
 
Testing Safety Critical Systems (10-02-2014, VU amsterdam)
Testing Safety Critical Systems (10-02-2014, VU amsterdam)Testing Safety Critical Systems (10-02-2014, VU amsterdam)
Testing Safety Critical Systems (10-02-2014, VU amsterdam)Jaap van Ekris
 
Vulnerability Management In An Application Security World
Vulnerability Management In An Application Security WorldVulnerability Management In An Application Security World
Vulnerability Management In An Application Security WorldDenim Group
 
Nagios Conference 2012 - Jason Cook - Nagios and Mod-Gearman
Nagios Conference 2012 - Jason Cook - Nagios and Mod-GearmanNagios Conference 2012 - Jason Cook - Nagios and Mod-Gearman
Nagios Conference 2012 - Jason Cook - Nagios and Mod-GearmanNagios
 
Webinar - Disaster in Japan: A Lesson in BCM
Webinar - Disaster in Japan: A Lesson in BCMWebinar - Disaster in Japan: A Lesson in BCM
Webinar - Disaster in Japan: A Lesson in BCMeasy2comply
 
Technical Debt - PHPBenelux
Technical Debt - PHPBeneluxTechnical Debt - PHPBenelux
Technical Debt - PHPBeneluxenaramore
 
Continous Monitoring
Continous MonitoringContinous Monitoring
Continous MonitoringNaresh Jain
 
Big Events Cause Network Mayhem
Big Events Cause Network MayhemBig Events Cause Network Mayhem
Big Events Cause Network MayhemPacketTrap Msp
 
Application Assessment Techniques
Application Assessment TechniquesApplication Assessment Techniques
Application Assessment TechniquesDenim Group
 
RIPE Atlas
RIPE AtlasRIPE Atlas
RIPE AtlasRIPE NCC
 
Modern Post-Exploitation Strategies - 44CON 2012
Modern Post-Exploitation Strategies - 44CON 2012Modern Post-Exploitation Strategies - 44CON 2012
Modern Post-Exploitation Strategies - 44CON 201244CON
 
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...Puppet
 
The Rugged Way in the Cloud--Building Reliability and Security into Software
The Rugged Way in the Cloud--Building Reliability and Security into SoftwareThe Rugged Way in the Cloud--Building Reliability and Security into Software
The Rugged Way in the Cloud--Building Reliability and Security into SoftwareJames Wickett
 
Rugged Dev: Building Reliability and Security Into Software
Rugged Dev: Building Reliability and Security Into SoftwareRugged Dev: Building Reliability and Security Into Software
Rugged Dev: Building Reliability and Security Into SoftwareInnoTech
 
Dan Cornell - The Real Cost of Software Remediation
Dan Cornell  - The Real Cost of Software RemediationDan Cornell  - The Real Cost of Software Remediation
Dan Cornell - The Real Cost of Software RemediationSource Conference
 
vCenter Operations 5: Level 300 training
vCenter Operations 5: Level 300 trainingvCenter Operations 5: Level 300 training
vCenter Operations 5: Level 300 trainingEric Sloof
 
Everything you ever wanted to know about deployment but were afraid to ask
Everything you ever wanted to know about deployment but were afraid to askEverything you ever wanted to know about deployment but were afraid to ask
Everything you ever wanted to know about deployment but were afraid to asklauraxthomson
 
DevOps for the sysadmin
DevOps for the sysadminDevOps for the sysadmin
DevOps for the sysadminRobert Nelson
 
2012 Velocity London: DevOps Patterns Distilled
2012 Velocity London: DevOps Patterns Distilled2012 Velocity London: DevOps Patterns Distilled
2012 Velocity London: DevOps Patterns DistilledGene Kim
 

Similar a Nagios Conference 2012 - Alex Solomon - Managing Your Heros (20)

Web Application Remediation - OWASP San Antonio March 2007
Web Application Remediation - OWASP San Antonio March 2007Web Application Remediation - OWASP San Antonio March 2007
Web Application Remediation - OWASP San Antonio March 2007
 
2015 05-07 - vu amsterdam - testing safety critical systems
2015 05-07 - vu amsterdam - testing safety critical systems2015 05-07 - vu amsterdam - testing safety critical systems
2015 05-07 - vu amsterdam - testing safety critical systems
 
Testing Safety Critical Systems (10-02-2014, VU amsterdam)
Testing Safety Critical Systems (10-02-2014, VU amsterdam)Testing Safety Critical Systems (10-02-2014, VU amsterdam)
Testing Safety Critical Systems (10-02-2014, VU amsterdam)
 
Vulnerability Management In An Application Security World
Vulnerability Management In An Application Security WorldVulnerability Management In An Application Security World
Vulnerability Management In An Application Security World
 
Nagios Conference 2012 - Jason Cook - Nagios and Mod-Gearman
Nagios Conference 2012 - Jason Cook - Nagios and Mod-GearmanNagios Conference 2012 - Jason Cook - Nagios and Mod-Gearman
Nagios Conference 2012 - Jason Cook - Nagios and Mod-Gearman
 
Webinar - Disaster in Japan: A Lesson in BCM
Webinar - Disaster in Japan: A Lesson in BCMWebinar - Disaster in Japan: A Lesson in BCM
Webinar - Disaster in Japan: A Lesson in BCM
 
Technical Debt - PHPBenelux
Technical Debt - PHPBeneluxTechnical Debt - PHPBenelux
Technical Debt - PHPBenelux
 
Continous Monitoring
Continous MonitoringContinous Monitoring
Continous Monitoring
 
Big Events Cause Network Mayhem
Big Events Cause Network MayhemBig Events Cause Network Mayhem
Big Events Cause Network Mayhem
 
Application Assessment Techniques
Application Assessment TechniquesApplication Assessment Techniques
Application Assessment Techniques
 
RIPE Atlas
RIPE AtlasRIPE Atlas
RIPE Atlas
 
Modern Post-Exploitation Strategies - 44CON 2012
Modern Post-Exploitation Strategies - 44CON 2012Modern Post-Exploitation Strategies - 44CON 2012
Modern Post-Exploitation Strategies - 44CON 2012
 
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
 
The Rugged Way in the Cloud--Building Reliability and Security into Software
The Rugged Way in the Cloud--Building Reliability and Security into SoftwareThe Rugged Way in the Cloud--Building Reliability and Security into Software
The Rugged Way in the Cloud--Building Reliability and Security into Software
 
Rugged Dev: Building Reliability and Security Into Software
Rugged Dev: Building Reliability and Security Into SoftwareRugged Dev: Building Reliability and Security Into Software
Rugged Dev: Building Reliability and Security Into Software
 
Dan Cornell - The Real Cost of Software Remediation
Dan Cornell  - The Real Cost of Software RemediationDan Cornell  - The Real Cost of Software Remediation
Dan Cornell - The Real Cost of Software Remediation
 
vCenter Operations 5: Level 300 training
vCenter Operations 5: Level 300 trainingvCenter Operations 5: Level 300 training
vCenter Operations 5: Level 300 training
 
Everything you ever wanted to know about deployment but were afraid to ask
Everything you ever wanted to know about deployment but were afraid to askEverything you ever wanted to know about deployment but were afraid to ask
Everything you ever wanted to know about deployment but were afraid to ask
 
DevOps for the sysadmin
DevOps for the sysadminDevOps for the sysadmin
DevOps for the sysadmin
 
2012 Velocity London: DevOps Patterns Distilled
2012 Velocity London: DevOps Patterns Distilled2012 Velocity London: DevOps Patterns Distilled
2012 Velocity London: DevOps Patterns Distilled
 

Más de Nagios

Nagios XI Best Practices
Nagios XI Best PracticesNagios XI Best Practices
Nagios XI Best PracticesNagios
 
Jesse Olson - Nagios Log Server Architecture Overview
Jesse Olson - Nagios Log Server Architecture OverviewJesse Olson - Nagios Log Server Architecture Overview
Jesse Olson - Nagios Log Server Architecture OverviewNagios
 
Trevor McDonald - Nagios XI Under The Hood
Trevor McDonald  - Nagios XI Under The HoodTrevor McDonald  - Nagios XI Under The Hood
Trevor McDonald - Nagios XI Under The HoodNagios
 
Sean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient NotificationsSean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient NotificationsNagios
 
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise EditionMarcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise EditionNagios
 
Janice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios PluginsJanice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios PluginsNagios
 
Dave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceDave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceNagios
 
Mike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksMike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksNagios
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationNagios
 
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Nagios
 
Matt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosMatt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosNagios
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Nagios
 
Eric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosEric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosNagios
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Nagios
 
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Nagios
 
Nagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios
 
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNagios
 
Nagios Log Server - Features
Nagios Log Server - FeaturesNagios Log Server - Features
Nagios Log Server - FeaturesNagios
 
Nagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios
 
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios
 

Más de Nagios (20)

Nagios XI Best Practices
Nagios XI Best PracticesNagios XI Best Practices
Nagios XI Best Practices
 
Jesse Olson - Nagios Log Server Architecture Overview
Jesse Olson - Nagios Log Server Architecture OverviewJesse Olson - Nagios Log Server Architecture Overview
Jesse Olson - Nagios Log Server Architecture Overview
 
Trevor McDonald - Nagios XI Under The Hood
Trevor McDonald  - Nagios XI Under The HoodTrevor McDonald  - Nagios XI Under The Hood
Trevor McDonald - Nagios XI Under The Hood
 
Sean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient NotificationsSean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient Notifications
 
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise EditionMarcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
 
Janice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios PluginsJanice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios Plugins
 
Dave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceDave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical Experience
 
Mike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksMike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service Checks
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
 
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
 
Matt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosMatt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With Nagios
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
 
Eric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosEric Loyd - Fractal Nagios
Eric Loyd - Fractal Nagios
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
 
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
 
Nagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson Opening
 
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
 
Nagios Log Server - Features
Nagios Log Server - FeaturesNagios Log Server - Features
Nagios Log Server - Features
 
Nagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios Network Analyzer - Features
Nagios Network Analyzer - Features
 
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
 

Último

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Último (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Nagios Conference 2012 - Alex Solomon - Managing Your Heros

  • 1. MANAGING YOUR HEROES The People Aspect of Monitoring (a.k.a. Dealing with Outages and Failures) Alex Solomon alex@pagerduty.com
  • 2. WHO AM I? Alex Solomon • Founder / CEO of PagerDuty • Intersect Inc. • Amazon.com 2
  • 4. Service Level Agreement (SLA) Mean Time To Resolution (MTTR) Mean Time To Response Mean Time Between Failures (MTBF) 4
  • 6. Can we prevent them? 6
  • 7. PREVENTING OUTAGES Single Points of Failure (SPOFs) Redundant systems Complex, monolithic systems Service-oriented architecture 7
  • 9. PREVENTING OUTAGES Change (not much you can do about this one) 9
  • 10. OUTAGES 10
  • 12. Monitoring detect failure Alert Investigate Fix Root-cause Analysis 12
  • 13. Critical Incident Timeline Alert Investigate Fix RESPONSE TIME RESOLUTION TIME Issue is Engineer starts Issue is detected working on issue fixed 13
  • 14. MONITOR 14
  • 15. MONITOR EVERYTHING! All levels of the stack • Data center • Network • Servers • Database • Application • Website • Business Metrics 15
  • 16. WHY MONITOR EVERYTHING? Metrics! Metrics! Metrics! 16
  • 17. TOOLS • Internal monitoring (behind the firewall): • • • External monitoring (SaaS-based): • • • Metrics: • Graphite or 17
  • 18. ALERT 18
  • 19. Best Practice: Categorize alerts by severity. 19
  • 20. SEVERITIES Define severities based on business impact: • sev1 - large scale business loss { 2 critical severities • sev2 - small to medium business loss • sev3 • sev4 - no immediate business loss, customers may be impacted - no business loss, no { 2 non-critical severities customers impacted 20
  • 21. Each severity level should have its own standard operating procedure (SOP): • Who • How • Response time 21
  • 22. Sev1: Major outage, all hands on deck • Notify the entire team via phone and SMS • Response time: 5 min • Sev2: Critical issue • Notify the on-call person via phone and SMS • Response time: 15 min • Sev3: Non-critical issue • Notify the on-call person via email • Response time: next day during business hours 22
  • 23. Sev1 incidents • Rare • Rarely auto-generated • Frequently start as sev2 which are upgraded to sev1 23
  • 24. Sev2 incidents • More common • Mostly auto-generated 24
  • 25. Sev3 incidents • Non-critical incidents • Can be auto-generated • Can also be manually generated 25
  • 26. Severities can be downgraded or upgraded • ex. sev2 ➞ sev1 (problem got worse) • ex. sev1 ➞ sev2 (problem was partially fixed) • ex. sev2 ➞ sev3 (critical problem was fixed but we still need to investigate root cause) 26
  • 27. One more best-practice: Alert before your systems fail completely 27
  • 28. Main benefit of severities Only page on critical issues (sev1 or 2) 28
  • 30. Avoid “Peter and the wolf ” scenarios 30
  • 31. ON-CALL BEST PRACTICES Person Team Level Level 31
  • 32. ON-CALL AT THE PERSON LEVEL Cellphone 32
  • 34. 4G / 3G internet 4G hotspot 4G USB modem 3G/4G tethering (don’t forget your laptop) 34
  • 35. Page multiple times until you respond • Time zero: email and SMS •1 min later: phone-call on cell •5 min later: phone-call on cell •5 min later: phone-call on landline •5 min later: phone-call to girlfriend 35
  • 37. ON-CALL AT THE TEAM LEVEL Rarely Do not send alerts to the entire team sev1 OK sev2 NO 37
  • 38. On-call schedules: • Simple rotation-based schedule • ex. weekly - everyone is on-call for a week at a time • Set up a follow-the-sun schedule • people in multiple timezones • no night-shifts simple rotation 38
  • 39. What happens if the on-call person doesn’t respond at all? 39
  • 40. If you care about uptime, you need redundancy in your on-call. 40
  • 41. Set up multiple on-call levels with automatic escalation between them: Level 1: Primary on-call Escalate after 15 min Level 2: Secondary on-call Escalate after 20 min Level 3: Team on-call (alert entire team) 41
  • 42. Best Practice: Put management in the on-call chain Level 1: Primary on-call Escalate after 15 min Level 2: Secondary on-call Escalate after 20 min Level 3: Team on-call (alert entire team) Escalate after 20 min Level 4: Manager / Director 42
  • 43. Best Practice: put software engineers in the on-call chain • Devops model • Devs need to own the systems they write • Getting paged provides a strong incentive to engineer better systems 43
  • 44. Best Practice: measure on-call performance “You can’t improve what you don’t measure.” • Measure: mean-time-to-response • Measure: % of issues that were escalated • Set up policies to encourage good performance • Put managers in on-call chain • Pay people extra to do on-call 44
  • 46. NOC with lots of Nagios goodness 46
  • 47. NOCs: • Reduce the mean-time-to-response drastically • Expensive (staffed 24x7 with multiple people) • Train NOC staff to fix a good %age of issues • As you scale your org, you may want a hybrid on-call approach (where NOC handles some issues, teams handle other issues directly) 47
  • 48. Critical Incident Timeline Alert Investigate Fix RESPONSE TIME RESOLUTION TIME Issue is Engineer starts Issue is detected working on issue fixed Alert Alerting system Engineer gets to a gets ahold of computer, connects somebody to internet Issue is Engineer is Engineer starts detected aware of issue working on issue 48
  • 49. Alert Alerting system Engineer gets to a gets ahold of computer, connects somebody to internet { { Issue is Engineer is Engineer starts detected aware of issue working on issue How to minimize: How to minimize: • Alert via phone & SMS • Carry 4G internet device + laptop at all times • Alert multiple times via multiple channels • Set loud ringtone at night • Failing that, escalate! • Failing that, escalate to manager! 49
  • 51. How do we reduce the amount of time needed to investigate and fix? Investigate Fix 51
  • 52. Set up an Emergency Ops Guide: • When you encounter a new failure, document it in the Guide • Document symptoms, research steps, fixes • Use a wiki 52
  • 53. 53
  • 54. 54
  • 55. Automate fixes or Add more fault tolerance 55
  • 56. You need the right tools: • Tools to help you diagnose problems faster • Comprehensive monitoring, metrics and dashboards • Tools that help search for problems in log files quickly (ie. Splunk) • Tools to help your team communicate efficiently • Voice: Conference bridge, Skype, Google Hangout • Chat: Hipchat, Campfire 56
  • 57. Best Practice: Incident Commander 57
  • 58. Incident Commander: • Essential for dealing with sev1 issues • In charge of the situation • Providers leadership, prevents analysis paralysis • He/she directs people to do things • Helps save time making decisions 58