SlideShare una empresa de Scribd logo
1 de 94
Descargar para leer sin conexión
Responding
                 to Outages
                 Maturely

                                    John Allspaw
                                   SVP, Tech Ops
                              Code As Craft, Berlin

Tuesday, April 24, 12
OPERABILITY




Tuesday, April 24, 12
PRODUCTION




Tuesday, April 24, 12
http://WhoOwnsMyAvailability.com




Tuesday, April 24, 12
Tuesday, April 24, 12
How important is this?


Tuesday, April 24, 12
Tuesday, April 24, 12
Tuesday, April 24, 12
Tuesday, April 24, 12
Tuesday, April 24, 12
Tuesday, April 24, 12
Tuesday, April 24, 12
Tuesday, April 24, 12
Tuesday, April 24, 12
Tuesday, April 24, 12
Tuesday, April 24, 12
Tuesday, April 24, 12
Tuesday, April 24, 12
How important is this?


Tuesday, April 24, 12
How Can This Happen?


Tuesday, April 24, 12
Complicated?
                           Complex?




Tuesday, April 24, 12
Complex
                                           Systems
    •       Cascading Failures
    •       Difficult to determine boundaries
    •       Complex systems may be open
    •       Complex systems may have a memory
    •       Complex systems may be nested
    •       Dynamic network of multiplicity
    •       May produce emergent phenomena
    •       Relationships are non-linear
    •       Relationships contain feedback loops
Tuesday, April 24, 12
How Can This Happen?
                      It does happen.
                     And it will again.
Tuesday, April 24, 12
                         And again.
Tuesday, April 24, 12
Optimization
                        MTBF
                        MTTR
Tuesday, April 24, 12
http://www.flickr.com/photos/sparktography/75499095/
Tuesday, April 24, 12
How does team
                        troubleshooting
                            happen?
Tuesday, April 24, 12
Problem Starts

                        Detection
                               Evaluation
                                       Response
                                                  Stable




                                                                                     PostMortem
                                                           Confirmation
                                                                         All Clear
                                             Time
Tuesday, April 24, 12
Problem Starts
                                     Stress
                        Detection
                               Evaluation
                                       Response
                                                  Stable




                                                                                     PostMortem
                                                           Confirmation
                                                                         All Clear
                                             Time
Tuesday, April 24, 12
Forced beyond learned roles
          Actions whose consequences are both important and
          difficult to see
          Cognitively and perceptively noisy
          Coordinative load increases exponentially
Tuesday, April 24, 12
Tuesday, April 24, 12
So What
                        Can We Do?

Tuesday, April 24, 12
We Learn From
                           Others

Tuesday, April 24, 12
Characteristics of response to
     escalating scenarios




Tuesday, April 24, 12
Characteristics of response to
     escalating scenarios
                        ...tend to neglect how processes
                        develop within time (awareness of
                        rates) versus assessing how things
                        are in the moment



      “On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980


Tuesday, April 24, 12
Characteristics of response to
     escalating scenarios
                        ...have difficulty in dealing with
                        exponential developments (hard to
                        imagine how fast something can
                        change, or accelerate)



      “On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980


Tuesday, April 24, 12
Characteristics of response to
     escalating scenarios
                        ...inclined to think in causal series,
                        instead of causal nets.
                        A therefore B,
                        instead of
                        A, therefore B and C (therefore D and
                        E), etc.
      “On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980


Tuesday, April 24, 12
Pitfalls

Thematic
Vagabonding

Tuesday, April 24, 12
Pitfalls

Goal Fixation
(encystment)

Tuesday, April 24, 12
Pitfalls

Refusal to make
decisions

Tuesday, April 24, 12
Heroism
          Non-communicating lone wolf-isms


Tuesday, April 24, 12
Distraction
          Irrelevant noise in comm channels


Tuesday, April 24, 12
Jens Rasmussen, 1983
                                                        Senior Member, IEEE




          “Skills, Rules, and Knowledge; Signals, Signs,
          and Symbols, and Other Distinctions in Human
          Performance Models”
          IEEE Transactions On Systems, Man, and Cybernetics, May 1983




Tuesday, April 24, 12
SKILL - BASED

                             Simple, routine
 RULE - BASED


                        Knowable, but unfamiliar
 KNOWLEDGE - BASED


       (Reason, 1990)
                        WTF IS GOING ON?
Tuesday, April 24, 12
Team Troubleshooting
       • Which causes did you consider first?
       • Which ones did you not consider at all?
       • How much of what you considered comes
               from recent history?
       • How much comes from observations from
               other team members?

Tuesday, April 24, 12
Team Troubleshooting

       • How effective is the response team in
               communicating to other groups? Users?
       • How long does it take to exhaust obvious
               cause(s)?



Tuesday, April 24, 12
Team Dynamics



Tuesday, April 24, 12
High Reliability Organizations

       • Air Traffic Control           • Complex Socio-Technical
                                       systems
       • Naval Air Operations At Sea • Efficiency <-> Thoroughness
       • Electrical Power Systems • Time/Resource Constrained
       • Etc.                        • Engineering-driven
Tuesday, April 24, 12
Tuesday, April 24, 12
“The Self-Designing High-Reliability Organization:
          Aircraft Carrier Flight Operations at Sea”
          Rochlin, La Porte, and Roberts. Naval War College Review 1987

          http://govleaders.org/reliability.htm




Tuesday, April 24, 12
Tuesday, April 24, 12
Close interdependence
          between groups




Tuesday, April 24, 12
Close reciprocal
             coordination and
             information sharing,
             resulting in overlapping
             knowledge




Tuesday, April 24, 12
High redundancy: multiple
             people observing the same
             event and sharing
             information



Tuesday, April 24, 12
Broad definition of who
          belongs to the team.


Tuesday, April 24, 12
Teammates are included in
          the communication loops
          rather than excluded.

Tuesday, April 24, 12
Lots of error correction.



Tuesday, April 24, 12
High levels of situation
          comprehension: maintain
          constant awareness of the
          possibility of accidents.

Tuesday, April 24, 12
High levels of interpersonal
          skills


Tuesday, April 24, 12
Maintenance of detailed
          records of past incidents
          that are closely examined
          with a view to learning from
          them.
Tuesday, April 24, 12
Patterns of authority are
          changed to meet the
          demands of the events:
          organizational flexibility.

Tuesday, April 24, 12
The reporting of errors and
          faults is rewarded, not
          punished.

Tuesday, April 24, 12
So What Else
                        Can We Do?

Tuesday, April 24, 12
We Drill

Tuesday, April 24, 12
We GameDay

Tuesday, April 24, 12
Tuesday, April 24, 12
We Learn To Improvise



Tuesday, April 24, 12
IMPROVISATION



Tuesday, April 24, 12
IMPROVISATION



Tuesday, April 24, 12
We Learn From Our
                            Mistakes

Tuesday, April 24, 12
Postmortems


       •       Full timelines: What happened, when, who involved

       •       Review in public, everyone invited

       •       Search for “second stories” instead of “human error”

       •       Cultivating a blameless environment

       •       Giving requisite authority to individuals to improve
               things


Tuesday, April 24, 12
Qualifying Response
         High signal:noise in comm channels?
         Troubleshooting fatigue?
         Troubleshooting handoff?
         All tools on-hand and working?
         Improvised tooling or solutions?
         Metrics visibility?
         Collaborative and skillful communication?
Tuesday, April 24, 12
Remediation




Tuesday, April 24, 12
We Share Near-Miss
                             Events

Tuesday, April 24, 12
Near Misses
                        Hey everybody -
                        Don’t be like me. I tried to X, but
                        that wasn’t a good idea.
                        It almost exploded everyone.

                        So, don’t do: (details about X)
                                                      Love,
                                                       Joe
Tuesday, April 24, 12
Near Misses
       • Can act like “vaccines” - help system safety without actually
               hurting anything
       • Happen more often, so provide more data on latent failures
       • Powerful reminder of hazards, and slows down the process of
               forgetting to be afraid

Tuesday, April 24, 12
Practice!
      •        How we troubleshoot in the moment, as a distributed team
      •        How we handle time pressure
      •        How we Observe/Orient/Decide/Act
      •        How we communicate during emergencies
      •        How we trust (or not) each other during emergencies
      •        How we relate to emergencies when things are normal
      •        How we could detect how we are protected during normal times
               (i.e., why aren’t we going down RIGHT NOW?)
Tuesday, April 24, 12
Resilient Response
  •       Can learn from other fields
  •       Can train for outages
  •       Can learn from mistakes
  •       Can learn from successes as well as failures
Tuesday, April 24, 12
http://www.flickr.com/photos/sparktography/75499095/
Tuesday, April 24, 12
THE END




Tuesday, April 24, 12
A parting word
    A parting challenge


Tuesday, April 24, 12
Two Propositions



Tuesday, April 24, 12
100 changes
          6 change-related issues
Tuesday, April 24, 12
100 > 6
Tuesday, April 24, 12
Proposition #1
          “Ways in which things go right are special cases
          of the ways in which things go wrong.”




Tuesday, April 24, 12
Proposition #1
                        Successes = failures gone wrong
                        Study the failures, generalize from that.
                          Potential data sources: 6 out of 100

Tuesday, April 24, 12
Proposition #2
          “Ways in which things go wrong are special
          cases of the ways in which things go right.”




Tuesday, April 24, 12
Proposition #2

                           Failures = successes gone wrong
                           Study the successes, generalize from that



Tuesday, April 24, 12
                        Potential data sources:   94 out of 100
94/100 ?
                           OR

Tuesday, April 24, 12
                        6/100 ?
What and WHY Do Things
      Go RIGHT?
Tuesday, April 24, 12
Not just:
                            why did we fail?

         But also:
                        why did we succeed?
Tuesday, April 24, 12
Mature Role of Automation

        “Ironies of Automation” - Lisanne Bainbridge
           http://www.bainbrdg.demon.co.uk/Papers/Ironies.html




Tuesday, April 24, 12
Mature Role of Automation
         •       Moves humans from manual operator to supervisor
         •       Extends and augments human abilities, doesn’t replace it
         •       Doesn’t remove “human error”
         •       Are brittle
         •       Recognize that there is always discretionary space for humans
         •       Recognizes the Law of Stretched Systems

Tuesday, April 24, 12
Law of Stretched Systems
                 “Every system is stretched to operate at its
                 capacity; as soon as there is some
                 improvement, for example, in the form of
                 new technology, it will be exploited to
                 achieve a new intensity and tempo of
                 activity”


   D.Woods, E. Hollnagel, “Joint Cognitive Systems: Patterns” 2006

Tuesday, April 24, 12

Más contenido relacionado

Destacado (6)

Resilient Response In Complex Systems
Resilient Response In Complex SystemsResilient Response In Complex Systems
Resilient Response In Complex Systems
 
Solr & Lucene @ Etsy by Gregg Donovan
Solr & Lucene @ Etsy by Gregg DonovanSolr & Lucene @ Etsy by Gregg Donovan
Solr & Lucene @ Etsy by Gregg Donovan
 
Considerations for Alert Design
Considerations for Alert DesignConsiderations for Alert Design
Considerations for Alert Design
 
Scaling Etsy: What Went Wrong, What Went Right
Scaling Etsy: What Went Wrong, What Went RightScaling Etsy: What Went Wrong, What Went Right
Scaling Etsy: What Went Wrong, What Went Right
 
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012
 
Code as Craft: Building a Strong Engineering Culture at Etsy
Code as Craft: Building a Strong Engineering Culture at EtsyCode as Craft: Building a Strong Engineering Culture at Etsy
Code as Craft: Building a Strong Engineering Culture at Etsy
 

Más de John Allspaw

Ops Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For ChangeOps Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For Change
John Allspaw
 
Ops Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For ChangeOps Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For Change
John Allspaw
 
Operational Efficiency Hacks Web20 Expo2009
Operational Efficiency Hacks Web20 Expo2009Operational Efficiency Hacks Web20 Expo2009
Operational Efficiency Hacks Web20 Expo2009
John Allspaw
 
Capacity Management for Web Operations
Capacity Management for Web OperationsCapacity Management for Web Operations
Capacity Management for Web Operations
John Allspaw
 
Capacity Planning for Web Operations - Web20 Expo 2008
Capacity Planning for Web Operations - Web20 Expo 2008Capacity Planning for Web Operations - Web20 Expo 2008
Capacity Planning for Web Operations - Web20 Expo 2008
John Allspaw
 

Más de John Allspaw (13)

Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...
 
Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls
Velocity EU 2012 Escalating Scenarios: Outage Handling PitfallsVelocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls
Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls
 
Anticipation: What Could Possibly Go Wrong?
Anticipation: What Could Possibly Go Wrong?Anticipation: What Could Possibly Go Wrong?
Anticipation: What Could Possibly Go Wrong?
 
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
 
Dev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and FlickrDev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and Flickr
 
Go or No-Go: Operability and Contingency Planning at Etsy.com
Go or No-Go: Operability and Contingency Planning at Etsy.comGo or No-Go: Operability and Contingency Planning at Etsy.com
Go or No-Go: Operability and Contingency Planning at Etsy.com
 
Ops Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For ChangeOps Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For Change
 
Ops Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For ChangeOps Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For Change
 
Capacity Planning For LAMP
Capacity Planning For LAMPCapacity Planning For LAMP
Capacity Planning For LAMP
 
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
 
Operational Efficiency Hacks Web20 Expo2009
Operational Efficiency Hacks Web20 Expo2009Operational Efficiency Hacks Web20 Expo2009
Operational Efficiency Hacks Web20 Expo2009
 
Capacity Management for Web Operations
Capacity Management for Web OperationsCapacity Management for Web Operations
Capacity Management for Web Operations
 
Capacity Planning for Web Operations - Web20 Expo 2008
Capacity Planning for Web Operations - Web20 Expo 2008Capacity Planning for Web Operations - Web20 Expo 2008
Capacity Planning for Web Operations - Web20 Expo 2008
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 

Responding to Outages Maturely