SlideShare una empresa de Scribd logo
1 de 109
Ops Meta-Metrics
The Currency You Use to Pay For Change




John Allspaw
VP Operations
  Etsy.com
                                   http://www.flickr.com/photos/wwarby/3296379139
Warning

Graphs and numbers in this
       presentation
   are sort of made up
/usr/nagios/libexec/check_ops.pl
How R U Doing?
            http://www.flickr.com/photos/a4gpa/190120662/
We track bugs already...




       Example: https://issues.apache.org/jira/browse/TS
We should track
 these, too...
We should track
    these, too...

Changes (Who/What/When/Type)
We should track
     these, too...

Changes (Who/What/When/Type)
Incidents (Type/Severity)
We should track
     these, too...

Changes (Who/What/When/Type)
Incidents (Type/Severity)
Response to Incidents (TTR/TTD)
trepidation
noun
1 a feeling of fear or agitation about something that may
happen : the men set off in fear and trepidation.
2 archaic trembling motion.
DERIVATIVES
trepidatious               adjective
ORIGIN late 15th cent.: from Latin trepidatio(n-), from
trepidare ‘be agitated, tremble,’ from trepidus ‘alarme
Change

Required.
Often feared.
Why?



                http://www.flickr.com/photos/20408885@N03/3570184759/
This is why
                       OMGWTF OUTAGES!!!1!!

   la de da,
everything’s fine




            change
            happens
Change
 PTSD?




         http://www.flickr.com/photos/tzofia/270800047/
Brace For Impact?
Brace For Impact?
But wait....
                          (OMGWTF)
   la de da,
everything’s fine




            change
            happens
But wait....
                                (OMGWTF)
   la de da,




                   }
everything’s fine
                       How much change is this?




            change
            happens
But wait....
                                (OMGWTF)
   la de da,




                   }
everything’s fine
                       How much change is this?
                       What kind of change?


            change
            happens
But wait....
                                (OMGWTF)
   la de da,




                   }
everything’s fine
                       How much change is this?
                       What kind of change?
                       How often does this happen?

            change
            happens
Need to raise confidence that


change != outage
...incidents can be
    handled well




                 http://www.flickr.com/photos/axiepics/3181170364/
...root causes can be fixed
        quick enough




                   http://www.flickr.com/photos/ljv/213624799/
...change can be
  safe enough




     http://www.flickr.com/photos/marksetchell/43252686/
But how?
How do we have confidence in anything
in our infrastructure?



          We measure it.
          And graph it.
          And alert on it.
Tracking Change
1. Type
2. Frequency/Size
3. Results of those changes
Types of Change

        Layers                   Examples


      App code        PHP/Rails/etc or ‘front-end’ code

                       Apache, MySQL, DB schema,
    Services code
                         PHP/Ruby versions, etc.

                      OS/Servers, Switches, Routers,
     Infrastructure
                           Datacenters, etc.

(you decide what these are for your architecture)
Code Deploys:
        Who/What/When
WHEN              WHO                                 WHAT
                  (guy who pushed the button) (link to diff)




(http://codeascraft.etsy.com/2010/05/20/quantum-of-deployment/)
Code Deploys:
            Who/What/When

                      Last 2 prod deploys
Last 2 Chef changes
other changes




(insert whatever ticketing/tracking you have)
Frequency
Frequency
Frequency
Size
Tracking Incidents
        http://www.flickr.com/photos/47684393@N00/4543311558/
Incident Frequency
Incident Size


      Big Outage
     TTR still going
Tracking Incidents

1. Frequency
2. Severity
3. Root Cause
4. Time-To-Detect (TTD)
5. Time-To-Resolve (TTR)
The How
Doesn’t
Matter




          http://www.flickr.com/photos/matsuyuki/2328829160/
Incident/Degradation
               Tracking
         Start      Detect Resolve           Root            PostMortem
 Date                              Severity                    Done?
         Time        Time   Time            Cause


1/2/08   12:30 ET   12:32 ET   12:45 ET   Sev1   DB Change      Yes


3/7/08   18:32 ET   18:40 ET   18:47 ET   Sev2   Capacity       Yes


5/3/08   17:55 ET   17:55 ET   18:14 ET   Sev3   Hardware       Yes
Incident/Degradation
             Tracking
       Start  Detect Resolve           Root   PostMortem
Date                         Severity
       Time
         These Time give you
               will    Time  context  Cause     Done?




           for your rates of change.

   (You’ll need them for postmortems, anyway.)
Change:Incident Ratio
Change:Incident Ratio

  Important.
Change:Incident Ratio

  Important.
  Not because all changes are equal.
Change:Incident Ratio

  Important.
  Not because all changes are equal.
  Not because all incidents are equal, or
  change-related.
Change:Incident Ratio
But because
humans will
irrationally
make a
permanent
connection
between the
two.
               http://www.flickr.com/photos/michelepedrolli/449572596/
Severity
Severity
Not all incidents are created equal.
Severity
Not all incidents are created equal.
Something like:
Severity
Not all incidents are created equal.
Something like:
Severity
Not all incidents are created equal.
Something like:



SEV1 Full outage, or effectively unusable.
Severity
Not all incidents are created equal.
Something like:



SEV1 Full outage, or effectively unusable.
SEV2 Significant degradation for subset of users.
Severity
Not all incidents are created equal.
Something like:



SEV1 Full outage, or effectively unusable.
SEV2 Significant degradation for subset of users.
SEV3 Minor impact on user experience.
Severity
Not all incidents are created equal.
Something like:



SEV1 Full outage, or effectively unusable.
SEV2 Significant degradation for subset of users.
SEV3 Minor impact on user experience.
SEV4 No impact, but time-sensitive failure.
Root Cause?
          (Not all incidents are change related)

          Something like:




Note: this can be difficult to categorize.
http://en.wikipedia.org/wiki/Root_cause_analysis
Root Cause?
          (Not all incidents are change related)

          Something like:


                         1. Hardware Failure
                         2. Datacenter Issue
                         3. Change: Code Issue
                         4. Change: Config Issue
                         5. Capacity/Traffic Issue
                         6. Other
Note: this can be difficult to categorize.
http://en.wikipedia.org/wiki/Root_cause_analysis
Recording Your Response




                (worth the hassle)


              http://www.flickr.com/photos/mattblaze/2695044170/
Time
la de da,
 everything’s fine




Time
la de da,
 everything’s fine




Time
                change
                happens
Noticed there
                    was a problem




    la de da,
 everything’s fine




Time
                change
                happens
Noticed there
                    was a problem




                                      Figured out
    la de da,                       what the cause is
 everything’s fine




Time
                change
                happens
Fixed the problem


                    Noticed there                       •rolled back
                    was a problem                       •rolled forward
                                                        •temporary solution
                                                        •etc

                                      Figured out
    la de da,                       what the cause is
 everything’s fine




Time
                change
                happens
Fixed the problem


                    Noticed there                       •rolled back
                    was a problem                       •rolled forward
                                                        •temporary solution
                                                        •etc

                                      Figured out
    la de da,                       what the cause is
 everything’s fine




Time
                change
                happens
• Coordinate troubleshooting/diagnosis
                                                         Fixed the problem


                    Noticed there                       •rolled back
                    was a problem                       •rolled forward
                                                        •temporary solution
                                                        •etc

                                      Figured out
    la de da,                       what the cause is
 everything’s fine




Time
                change
                happens
• Coordinate troubleshooting/diagnosis
  • Communicate to support/community/execs
                                                         Fixed the problem


                    Noticed there                       •rolled back
                    was a problem                       •rolled forward
                                                        •temporary solution
                                                        •etc

                                      Figured out
    la de da,                       what the cause is
 everything’s fine




Time
                change
                happens
Fixed the problem


                    Noticed there                       •rolled back
                    was a problem                       •rolled forward
                                                        •temporary solution
                                                        •etc

                                      Figured out
    la de da,                       what the cause is
 everything’s fine




Time                                                                Time

                change
                happens
• Coordinate responses*
                                                          Fixed the problem


                    Noticed there                       •rolled back
                    was a problem                       •rolled forward
                                                        •temporary solution
                                                        •etc

                                      Figured out
    la de da,                       what the cause is
 everything’s fine




Time                                                                  Time

                change
                happens
                                    * usually, “One Thing At A Time” responses
• Coordinate responses*
   • Communicate to support/community/execs problem
                                     Fixed the


                    Noticed there                       •rolled back
                    was a problem                       •rolled forward
                                                        •temporary solution
                                                        •etc

                                      Figured out
    la de da,                       what the cause is
 everything’s fine




Time                                                                  Time

                change
                happens
                                    * usually, “One Thing At A Time” responses
Fixed the problem
                                      Figured out
                                    what the cause is
                    Noticed there                       •rolled back
                    was a problem                       •rolled forward
                                                        •temporary solution
                                                        •etc


    la de da,
 everything’s fine




Time                                                                Time

                change
                happens
• Confirm stability, resolving steps

                                                         Fixed the problem
                                      Figured out
                                    what the cause is
                    Noticed there                       •rolled back
                    was a problem                       •rolled forward
                                                        •temporary solution
                                                        •etc


    la de da,
 everything’s fine




Time                                                                Time

                change
                happens
• Confirm stability, resolving steps
 • Communicate to support/community/execs
                                                         Fixed the problem
                                      Figured out
                                    what the cause is
                    Noticed there                       •rolled back
                    was a problem                       •rolled forward
                                                        •temporary solution
                                                        •etc


    la de da,
 everything’s fine




Time                                                                Time

                change
                happens
Communications
http://etsystatus.com




twitter.com/etsystatus
Fixed the problem
                                      Figured out
                                    what the cause is
                    Noticed there                       •rolled back
                    was a problem                       •rolled forward
                                                        •temporary solution
                                                        •etc


    la de da,
 everything’s fine




Time                                                                Time

                change
                happens                             PostMortem
Time To Detect

                      (TTD)

                                     Time To Resolve
    la de da,

                                        (TTR)
                                                          la de da,
 everything’s fine
                                                       everything’s fine




Time
                change
                happens
Hypothetical Example:
 “We’re So Nimble!”
Nimble, But Stumbling?
Is There Any Pattern?
Nimble, But Stumbling?



          +
Nimble, But Stumbling?



          +
Maybe this is too
       Maybe you’re      much suck?




                                  }
changing too much at once?




                 }
       Happening too often?
What percentage of incidents are related to
change?




                            http://www.flickr.com/photos/78364563@N00/2467989781/
What percentage of change-
related incidents are “off-hours”?




                             http://www.flickr.com/photos/jeffreyanthonyrafolpiano/3266123838
What percentage of change-
related incidents are “off-hours”?




Do they have higher or
lower TTR?




                             http://www.flickr.com/photos/jeffreyanthonyrafolpiano/3266123838
What types of change have the   worst success
rates?




                                 http://www.flickr.com/photos/lwr/2257949828/
What types of change have the   worst success
rates?




                       Which ones have the                     best
                       success rates?
                                 http://www.flickr.com/photos/lwr/2257949828/
Does your   TTD/TTR increase
depending on the:

-   SIZE?
-   FREQUENCY?




                               http://www.flickr.com/photos/45409431@N00/2521827947/
Side effect is
             that you’re
             also tracking
             successful
             changes to
             production
             as well




http://www.flickr.com/photos/wwworks/2313927146
Q2 2010
                                                    Incident
                                        Success
    Type         Successes   Failures             Minutes(Sev1
                                         Rate         /2)

 App code           420         5        98.81         8

   Config            404         3        99.26         5

DB Schema           15          1        93.33         10

    DNS             45          0         100          0

Network (misc)       5          0         100          0

Network (core)       1          0         100          0
Q2 2010
                                                   Incident
                                        Success
    Type         Successes   Failures             Minutes(Se




                 !
                                         Rate
                                                    v1/2)
 App code           420         5        98.81        8

   Config            404         3        99.26        5

DB Schema           15          1        93.33        10

    DNS             45          0         100         0

Network (misc)       5          0         100         0

Network (core)       1          0         100         0
Some Observations
Incident Observations


Morale




    Length of Incident/Outage
Incident Observations


Mistakes




      Length of Incident/Outage
Change Observations


Change
 Size



         Change Frequency
Change Observations
          Huge changesets
          deployed rarely


Change
 Size



         Change Frequency
Change Observations
          Huge changesets (high TTR)
          deployed rarely


Change
 Size



         Change Frequency
Change Observations
          Huge changesets (high TTR)
          deployed rarely


Change
 Size                         Tiny changesets
                              deployed often



         Change Frequency
Change Observations
          Huge changesets (high TTR)
          deployed rarely


Change
 Size                         Tiny changesets
                              deployed often
                                      (low TTR)


         Change Frequency
Specifically....


   la de da,
                       What if this was only   5

                   }
everything’s fine       lines of code that were
                              changed?

                          Does that feel safer?
            change
            happens                            (it should)
Pay attention to this stuff
                http://www.flickr.com/photos/plasticbag/2461247090/
We’re Hiring Ops!
SF & NYC
In May:

-   $22.9M of goods were sold by the community
-   1,895,943 new items listed
-   239,340 members joined
The End
Bonus Time!!1!
Continuous
   Deployment

     Described in 6 graphs
(Originally Cal Henderson’s idea)
Ops Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For Change

Más contenido relacionado

La actualidad más candente

5 Practices for an Agile Mindset
5 Practices for an Agile Mindset5 Practices for an Agile Mindset
5 Practices for an Agile MindsetMichael Sahota
 
The Paved Road at Netflix
The Paved Road at NetflixThe Paved Road at Netflix
The Paved Road at NetflixDianne Marsh
 
Redesigning everything ITARC Stockholm 2021
Redesigning everything ITARC Stockholm 2021Redesigning everything ITARC Stockholm 2021
Redesigning everything ITARC Stockholm 2021Alberto Brandolini
 
Тестирование требований и документации
Тестирование требований и документацииТестирование требований и документации
Тестирование требований и документацииUladzimir Kryvenka
 
ITKonekt 2023: The Busy Platform Engineers Guide to API Gateways
ITKonekt 2023: The Busy Platform Engineers Guide to API GatewaysITKonekt 2023: The Busy Platform Engineers Guide to API Gateways
ITKonekt 2023: The Busy Platform Engineers Guide to API GatewaysDaniel Bryant
 
Event storming Notes
Event storming NotesEvent storming Notes
Event storming NotesArnauld Loyer
 
Patterns of Kanban Maturity
Patterns of Kanban MaturityPatterns of Kanban Maturity
Patterns of Kanban MaturityDavid Anderson
 
C'est quoi le Software Craftsmanship ?
C'est quoi le Software Craftsmanship ?C'est quoi le Software Craftsmanship ?
C'est quoi le Software Craftsmanship ?Jean-Pierre Lambert
 
ITkonekt 2019 | Robert C. Martin (Uncle Bob), Clean Architecture and Design
ITkonekt 2019 | Robert C. Martin (Uncle Bob), Clean Architecture and DesignITkonekt 2019 | Robert C. Martin (Uncle Bob), Clean Architecture and Design
ITkonekt 2019 | Robert C. Martin (Uncle Bob), Clean Architecture and DesignErginBilgin3
 
Definition of Done Canvas.pptx
Definition of Done Canvas.pptxDefinition of Done Canvas.pptx
Definition of Done Canvas.pptxKaizenko
 
voip2day 2016: mediasoup, powerful WebRTC SFU for Node.js
voip2day 2016: mediasoup, powerful WebRTC SFU for Node.jsvoip2day 2016: mediasoup, powerful WebRTC SFU for Node.js
voip2day 2016: mediasoup, powerful WebRTC SFU for Node.jsIñaki Baz Castillo
 
DevOps to DevSecOps Journey..
DevOps to DevSecOps Journey..DevOps to DevSecOps Journey..
DevOps to DevSecOps Journey..Siddharth Joshi
 
Software design as a cooperative game with EventStorming
Software design as a cooperative game with EventStormingSoftware design as a cooperative game with EventStorming
Software design as a cooperative game with EventStormingAlberto Brandolini
 
Agile Dependency Management
Agile Dependency ManagementAgile Dependency Management
Agile Dependency ManagementKmanthei
 
David anderson kanban when is it not appropriate
David anderson   kanban when is it not appropriateDavid anderson   kanban when is it not appropriate
David anderson kanban when is it not appropriateAGILEMinds
 
Kanban Avançado - Além de Visualizações e Limites
Kanban Avançado - Além de Visualizações e LimitesKanban Avançado - Além de Visualizações e Limites
Kanban Avançado - Além de Visualizações e LimitesRodrigo Yoshima
 
Everything as Code
Everything as CodeEverything as Code
Everything as CodeWayne Walls
 

La actualidad más candente (20)

5 Practices for an Agile Mindset
5 Practices for an Agile Mindset5 Practices for an Agile Mindset
5 Practices for an Agile Mindset
 
The Paved Road at Netflix
The Paved Road at NetflixThe Paved Road at Netflix
The Paved Road at Netflix
 
Redesigning everything ITARC Stockholm 2021
Redesigning everything ITARC Stockholm 2021Redesigning everything ITARC Stockholm 2021
Redesigning everything ITARC Stockholm 2021
 
Тестирование требований и документации
Тестирование требований и документацииТестирование требований и документации
Тестирование требований и документации
 
мир без Jsp. thymeleaf 2.0
мир без Jsp. thymeleaf 2.0мир без Jsp. thymeleaf 2.0
мир без Jsp. thymeleaf 2.0
 
The gordian knot
The gordian knotThe gordian knot
The gordian knot
 
ITKonekt 2023: The Busy Platform Engineers Guide to API Gateways
ITKonekt 2023: The Busy Platform Engineers Guide to API GatewaysITKonekt 2023: The Busy Platform Engineers Guide to API Gateways
ITKonekt 2023: The Busy Platform Engineers Guide to API Gateways
 
Event storming Notes
Event storming NotesEvent storming Notes
Event storming Notes
 
Patterns of Kanban Maturity
Patterns of Kanban MaturityPatterns of Kanban Maturity
Patterns of Kanban Maturity
 
Pair programming
Pair programmingPair programming
Pair programming
 
C'est quoi le Software Craftsmanship ?
C'est quoi le Software Craftsmanship ?C'est quoi le Software Craftsmanship ?
C'est quoi le Software Craftsmanship ?
 
ITkonekt 2019 | Robert C. Martin (Uncle Bob), Clean Architecture and Design
ITkonekt 2019 | Robert C. Martin (Uncle Bob), Clean Architecture and DesignITkonekt 2019 | Robert C. Martin (Uncle Bob), Clean Architecture and Design
ITkonekt 2019 | Robert C. Martin (Uncle Bob), Clean Architecture and Design
 
Definition of Done Canvas.pptx
Definition of Done Canvas.pptxDefinition of Done Canvas.pptx
Definition of Done Canvas.pptx
 
voip2day 2016: mediasoup, powerful WebRTC SFU for Node.js
voip2day 2016: mediasoup, powerful WebRTC SFU for Node.jsvoip2day 2016: mediasoup, powerful WebRTC SFU for Node.js
voip2day 2016: mediasoup, powerful WebRTC SFU for Node.js
 
DevOps to DevSecOps Journey..
DevOps to DevSecOps Journey..DevOps to DevSecOps Journey..
DevOps to DevSecOps Journey..
 
Software design as a cooperative game with EventStorming
Software design as a cooperative game with EventStormingSoftware design as a cooperative game with EventStorming
Software design as a cooperative game with EventStorming
 
Agile Dependency Management
Agile Dependency ManagementAgile Dependency Management
Agile Dependency Management
 
David anderson kanban when is it not appropriate
David anderson   kanban when is it not appropriateDavid anderson   kanban when is it not appropriate
David anderson kanban when is it not appropriate
 
Kanban Avançado - Além de Visualizações e Limites
Kanban Avançado - Além de Visualizações e LimitesKanban Avançado - Além de Visualizações e Limites
Kanban Avançado - Além de Visualizações e Limites
 
Everything as Code
Everything as CodeEverything as Code
Everything as Code
 

Destacado

GameDay: Creating Resiliency Through Destruction - LISA11
GameDay: Creating Resiliency Through Destruction - LISA11GameDay: Creating Resiliency Through Destruction - LISA11
GameDay: Creating Resiliency Through Destruction - LISA11Jesse Robbins
 
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
10+ Deploys Per Day: Dev and Ops Cooperation at FlickrJohn Allspaw
 
Considerations for Alert Design
Considerations for Alert DesignConsiderations for Alert Design
Considerations for Alert DesignJohn Allspaw
 
DevOps: A Culture Transformation, More than Technology
DevOps: A Culture Transformation, More than TechnologyDevOps: A Culture Transformation, More than Technology
DevOps: A Culture Transformation, More than TechnologyCA Technologies
 
Continuous Delivery
Continuous DeliveryContinuous Delivery
Continuous DeliveryJez Humble
 
AWS re:Invent 2016: Another Day in the Life of a Netflix Engineer (DEV209)
AWS re:Invent 2016: Another Day in the Life of a Netflix Engineer (DEV209)AWS re:Invent 2016: Another Day in the Life of a Netflix Engineer (DEV209)
AWS re:Invent 2016: Another Day in the Life of a Netflix Engineer (DEV209)Amazon Web Services
 
DEVOPS - La synthèse
DEVOPS - La synthèseDEVOPS - La synthèse
DEVOPS - La synthèseCOMPETENSIS
 
Tester c'est douter - Linkvalue tech
Tester c'est douter - Linkvalue techTester c'est douter - Linkvalue tech
Tester c'est douter - Linkvalue techMarine Karam
 
Introduction to Continuous Delivery
Introduction to Continuous DeliveryIntroduction to Continuous Delivery
Introduction to Continuous DeliveryKmanthei
 
Strava Insights 2015 révèle le paysage running en France
Strava Insights 2015 révèle le paysage running en FranceStrava Insights 2015 révèle le paysage running en France
Strava Insights 2015 révèle le paysage running en FranceNicolas Raybaud
 
Cas agence Damart
Cas agence DamartCas agence Damart
Cas agence DamartSportlab
 
Lean startup (méthode Running Lean)
Lean startup (méthode Running Lean)Lean startup (méthode Running Lean)
Lean startup (méthode Running Lean)Camille Roux
 
DevOps et tendances Monitoring
DevOps et tendances MonitoringDevOps et tendances Monitoring
DevOps et tendances MonitoringFrançois
 
General Continuous Delivery for Agile Practitioners Meetup May 2014
General Continuous Delivery for Agile Practitioners Meetup May 2014General Continuous Delivery for Agile Practitioners Meetup May 2014
General Continuous Delivery for Agile Practitioners Meetup May 2014Chris Hilton
 
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroDevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroGaurav "GP" Pal
 
906702 Enhancing Business Processes Using Enterprise Information Systems
906702 Enhancing Business Processes Using Enterprise Information Systems906702 Enhancing Business Processes Using Enterprise Information Systems
906702 Enhancing Business Processes Using Enterprise Information Systemssiroros
 
11. Huccet I Imaniye
11. Huccet I  Imaniye11. Huccet I  Imaniye
11. Huccet I ImaniyeAhmet Türkan
 

Destacado (20)

GameDay: Creating Resiliency Through Destruction - LISA11
GameDay: Creating Resiliency Through Destruction - LISA11GameDay: Creating Resiliency Through Destruction - LISA11
GameDay: Creating Resiliency Through Destruction - LISA11
 
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
 
Considerations for Alert Design
Considerations for Alert DesignConsiderations for Alert Design
Considerations for Alert Design
 
DevOps
DevOpsDevOps
DevOps
 
DevOps: A Culture Transformation, More than Technology
DevOps: A Culture Transformation, More than TechnologyDevOps: A Culture Transformation, More than Technology
DevOps: A Culture Transformation, More than Technology
 
Continuous Delivery
Continuous DeliveryContinuous Delivery
Continuous Delivery
 
AWS re:Invent 2016: Another Day in the Life of a Netflix Engineer (DEV209)
AWS re:Invent 2016: Another Day in the Life of a Netflix Engineer (DEV209)AWS re:Invent 2016: Another Day in the Life of a Netflix Engineer (DEV209)
AWS re:Invent 2016: Another Day in the Life of a Netflix Engineer (DEV209)
 
DEVOPS - La synthèse
DEVOPS - La synthèseDEVOPS - La synthèse
DEVOPS - La synthèse
 
Tester c'est douter - Linkvalue tech
Tester c'est douter - Linkvalue techTester c'est douter - Linkvalue tech
Tester c'est douter - Linkvalue tech
 
Introduction to Continuous Delivery
Introduction to Continuous DeliveryIntroduction to Continuous Delivery
Introduction to Continuous Delivery
 
Strava Insights 2015 révèle le paysage running en France
Strava Insights 2015 révèle le paysage running en FranceStrava Insights 2015 révèle le paysage running en France
Strava Insights 2015 révèle le paysage running en France
 
Cas agence Damart
Cas agence DamartCas agence Damart
Cas agence Damart
 
Lean startup (méthode Running Lean)
Lean startup (méthode Running Lean)Lean startup (méthode Running Lean)
Lean startup (méthode Running Lean)
 
Attention getters
Attention gettersAttention getters
Attention getters
 
DevOps et tendances Monitoring
DevOps et tendances MonitoringDevOps et tendances Monitoring
DevOps et tendances Monitoring
 
General Continuous Delivery for Agile Practitioners Meetup May 2014
General Continuous Delivery for Agile Practitioners Meetup May 2014General Continuous Delivery for Agile Practitioners Meetup May 2014
General Continuous Delivery for Agile Practitioners Meetup May 2014
 
Path to continuous delivery
Path to continuous deliveryPath to continuous delivery
Path to continuous delivery
 
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroDevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
 
906702 Enhancing Business Processes Using Enterprise Information Systems
906702 Enhancing Business Processes Using Enterprise Information Systems906702 Enhancing Business Processes Using Enterprise Information Systems
906702 Enhancing Business Processes Using Enterprise Information Systems
 
11. Huccet I Imaniye
11. Huccet I  Imaniye11. Huccet I  Imaniye
11. Huccet I Imaniye
 

Similar a Ops Meta-Metrics: The Currency You Pay For Change

More Aim, Less Blame: How to use postmortems to turn failures into something ...
More Aim, Less Blame: How to use postmortems to turn failures into something ...More Aim, Less Blame: How to use postmortems to turn failures into something ...
More Aim, Less Blame: How to use postmortems to turn failures into something ...Daniel Kanchev
 
My Road To Test Driven Development
My Road To Test Driven DevelopmentMy Road To Test Driven Development
My Road To Test Driven DevelopmentGerard Sychay
 
Let's Make the PAIN Visible!
Let's Make the PAIN Visible!Let's Make the PAIN Visible!
Let's Make the PAIN Visible!Arty Starr
 
WinOps Conf 2016 - Gael Colas - Configuration Management Theory: Why Idempote...
WinOps Conf 2016 - Gael Colas - Configuration Management Theory: Why Idempote...WinOps Conf 2016 - Gael Colas - Configuration Management Theory: Why Idempote...
WinOps Conf 2016 - Gael Colas - Configuration Management Theory: Why Idempote...WinOps Conf
 
Deltastreams
DeltastreamsDeltastreams
DeltastreamsESUG
 
DevOps : It's Made of People
DevOps : It's Made of PeopleDevOps : It's Made of People
DevOps : It's Made of PeopleDavid Benjamin
 
Dev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and FlickrDev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and FlickrJohn Allspaw
 
Embracing Failure
Embracing FailureEmbracing Failure
Embracing FailureOwen Wang
 
Monitoring Complex Systems - Chicago Erlang, 2014
Monitoring Complex Systems - Chicago Erlang, 2014Monitoring Complex Systems - Chicago Erlang, 2014
Monitoring Complex Systems - Chicago Erlang, 2014Brian Troutwine
 

Similar a Ops Meta-Metrics: The Currency You Pay For Change (11)

More Aim, Less Blame: How to use postmortems to turn failures into something ...
More Aim, Less Blame: How to use postmortems to turn failures into something ...More Aim, Less Blame: How to use postmortems to turn failures into something ...
More Aim, Less Blame: How to use postmortems to turn failures into something ...
 
My Road To Test Driven Development
My Road To Test Driven DevelopmentMy Road To Test Driven Development
My Road To Test Driven Development
 
What lies beneath
What lies beneathWhat lies beneath
What lies beneath
 
Let's Make the PAIN Visible!
Let's Make the PAIN Visible!Let's Make the PAIN Visible!
Let's Make the PAIN Visible!
 
mri-bp2015
mri-bp2015mri-bp2015
mri-bp2015
 
WinOps Conf 2016 - Gael Colas - Configuration Management Theory: Why Idempote...
WinOps Conf 2016 - Gael Colas - Configuration Management Theory: Why Idempote...WinOps Conf 2016 - Gael Colas - Configuration Management Theory: Why Idempote...
WinOps Conf 2016 - Gael Colas - Configuration Management Theory: Why Idempote...
 
Deltastreams
DeltastreamsDeltastreams
Deltastreams
 
DevOps : It's Made of People
DevOps : It's Made of PeopleDevOps : It's Made of People
DevOps : It's Made of People
 
Dev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and FlickrDev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and Flickr
 
Embracing Failure
Embracing FailureEmbracing Failure
Embracing Failure
 
Monitoring Complex Systems - Chicago Erlang, 2014
Monitoring Complex Systems - Chicago Erlang, 2014Monitoring Complex Systems - Chicago Erlang, 2014
Monitoring Complex Systems - Chicago Erlang, 2014
 

Más de John Allspaw

Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...John Allspaw
 
Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls
Velocity EU 2012 Escalating Scenarios: Outage Handling PitfallsVelocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls
Velocity EU 2012 Escalating Scenarios: Outage Handling PitfallsJohn Allspaw
 
Responding to Outages Maturely
Responding to Outages MaturelyResponding to Outages Maturely
Responding to Outages MaturelyJohn Allspaw
 
Resilient Response In Complex Systems
Resilient Response In Complex SystemsResilient Response In Complex Systems
Resilient Response In Complex SystemsJohn Allspaw
 
Outages, PostMortems, and Human Error
Outages, PostMortems, and Human ErrorOutages, PostMortems, and Human Error
Outages, PostMortems, and Human ErrorJohn Allspaw
 
Anticipation: What Could Possibly Go Wrong?
Anticipation: What Could Possibly Go Wrong?Anticipation: What Could Possibly Go Wrong?
Anticipation: What Could Possibly Go Wrong?John Allspaw
 
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)John Allspaw
 
Capacity Planning For LAMP
Capacity Planning For LAMPCapacity Planning For LAMP
Capacity Planning For LAMPJohn Allspaw
 
Operational Efficiency Hacks Web20 Expo2009
Operational Efficiency Hacks Web20 Expo2009Operational Efficiency Hacks Web20 Expo2009
Operational Efficiency Hacks Web20 Expo2009John Allspaw
 
Capacity Management for Web Operations
Capacity Management for Web OperationsCapacity Management for Web Operations
Capacity Management for Web OperationsJohn Allspaw
 
Capacity Planning for Web Operations - Web20 Expo 2008
Capacity Planning for Web Operations - Web20 Expo 2008Capacity Planning for Web Operations - Web20 Expo 2008
Capacity Planning for Web Operations - Web20 Expo 2008John Allspaw
 

Más de John Allspaw (11)

Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...
 
Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls
Velocity EU 2012 Escalating Scenarios: Outage Handling PitfallsVelocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls
Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls
 
Responding to Outages Maturely
Responding to Outages MaturelyResponding to Outages Maturely
Responding to Outages Maturely
 
Resilient Response In Complex Systems
Resilient Response In Complex SystemsResilient Response In Complex Systems
Resilient Response In Complex Systems
 
Outages, PostMortems, and Human Error
Outages, PostMortems, and Human ErrorOutages, PostMortems, and Human Error
Outages, PostMortems, and Human Error
 
Anticipation: What Could Possibly Go Wrong?
Anticipation: What Could Possibly Go Wrong?Anticipation: What Could Possibly Go Wrong?
Anticipation: What Could Possibly Go Wrong?
 
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
 
Capacity Planning For LAMP
Capacity Planning For LAMPCapacity Planning For LAMP
Capacity Planning For LAMP
 
Operational Efficiency Hacks Web20 Expo2009
Operational Efficiency Hacks Web20 Expo2009Operational Efficiency Hacks Web20 Expo2009
Operational Efficiency Hacks Web20 Expo2009
 
Capacity Management for Web Operations
Capacity Management for Web OperationsCapacity Management for Web Operations
Capacity Management for Web Operations
 
Capacity Planning for Web Operations - Web20 Expo 2008
Capacity Planning for Web Operations - Web20 Expo 2008Capacity Planning for Web Operations - Web20 Expo 2008
Capacity Planning for Web Operations - Web20 Expo 2008
 

Último

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 

Ops Meta-Metrics: The Currency You Pay For Change

  • 1. Ops Meta-Metrics The Currency You Use to Pay For Change John Allspaw VP Operations Etsy.com http://www.flickr.com/photos/wwarby/3296379139
  • 2. Warning Graphs and numbers in this presentation are sort of made up
  • 4. How R U Doing? http://www.flickr.com/photos/a4gpa/190120662/
  • 5. We track bugs already... Example: https://issues.apache.org/jira/browse/TS
  • 6. We should track these, too...
  • 7. We should track these, too... Changes (Who/What/When/Type)
  • 8. We should track these, too... Changes (Who/What/When/Type) Incidents (Type/Severity)
  • 9. We should track these, too... Changes (Who/What/When/Type) Incidents (Type/Severity) Response to Incidents (TTR/TTD)
  • 10. trepidation noun 1 a feeling of fear or agitation about something that may happen : the men set off in fear and trepidation. 2 archaic trembling motion. DERIVATIVES trepidatious adjective ORIGIN late 15th cent.: from Latin trepidatio(n-), from trepidare ‘be agitated, tremble,’ from trepidus ‘alarme
  • 11. Change Required. Often feared. Why? http://www.flickr.com/photos/20408885@N03/3570184759/
  • 12. This is why OMGWTF OUTAGES!!!1!! la de da, everything’s fine change happens
  • 13. Change PTSD? http://www.flickr.com/photos/tzofia/270800047/
  • 16. But wait.... (OMGWTF) la de da, everything’s fine change happens
  • 17. But wait.... (OMGWTF) la de da, } everything’s fine How much change is this? change happens
  • 18. But wait.... (OMGWTF) la de da, } everything’s fine How much change is this? What kind of change? change happens
  • 19. But wait.... (OMGWTF) la de da, } everything’s fine How much change is this? What kind of change? How often does this happen? change happens
  • 20. Need to raise confidence that change != outage
  • 21. ...incidents can be handled well http://www.flickr.com/photos/axiepics/3181170364/
  • 22. ...root causes can be fixed quick enough http://www.flickr.com/photos/ljv/213624799/
  • 23. ...change can be safe enough http://www.flickr.com/photos/marksetchell/43252686/
  • 24. But how? How do we have confidence in anything in our infrastructure? We measure it. And graph it. And alert on it.
  • 25. Tracking Change 1. Type 2. Frequency/Size 3. Results of those changes
  • 26. Types of Change Layers Examples App code PHP/Rails/etc or ‘front-end’ code Apache, MySQL, DB schema, Services code PHP/Ruby versions, etc. OS/Servers, Switches, Routers, Infrastructure Datacenters, etc. (you decide what these are for your architecture)
  • 27. Code Deploys: Who/What/When WHEN WHO WHAT (guy who pushed the button) (link to diff) (http://codeascraft.etsy.com/2010/05/20/quantum-of-deployment/)
  • 28. Code Deploys: Who/What/When Last 2 prod deploys Last 2 Chef changes
  • 29. other changes (insert whatever ticketing/tracking you have)
  • 33. Size
  • 34. Tracking Incidents http://www.flickr.com/photos/47684393@N00/4543311558/
  • 36. Incident Size Big Outage TTR still going
  • 37. Tracking Incidents 1. Frequency 2. Severity 3. Root Cause 4. Time-To-Detect (TTD) 5. Time-To-Resolve (TTR)
  • 38. The How Doesn’t Matter http://www.flickr.com/photos/matsuyuki/2328829160/
  • 39. Incident/Degradation Tracking Start Detect Resolve Root PostMortem Date Severity Done? Time Time Time Cause 1/2/08 12:30 ET 12:32 ET 12:45 ET Sev1 DB Change Yes 3/7/08 18:32 ET 18:40 ET 18:47 ET Sev2 Capacity Yes 5/3/08 17:55 ET 17:55 ET 18:14 ET Sev3 Hardware Yes
  • 40. Incident/Degradation Tracking Start Detect Resolve Root PostMortem Date Severity Time These Time give you will Time context Cause Done? for your rates of change. (You’ll need them for postmortems, anyway.)
  • 42. Change:Incident Ratio Important.
  • 43. Change:Incident Ratio Important. Not because all changes are equal.
  • 44. Change:Incident Ratio Important. Not because all changes are equal. Not because all incidents are equal, or change-related.
  • 45. Change:Incident Ratio But because humans will irrationally make a permanent connection between the two. http://www.flickr.com/photos/michelepedrolli/449572596/
  • 47. Severity Not all incidents are created equal.
  • 48. Severity Not all incidents are created equal. Something like:
  • 49. Severity Not all incidents are created equal. Something like:
  • 50. Severity Not all incidents are created equal. Something like: SEV1 Full outage, or effectively unusable.
  • 51. Severity Not all incidents are created equal. Something like: SEV1 Full outage, or effectively unusable. SEV2 Significant degradation for subset of users.
  • 52. Severity Not all incidents are created equal. Something like: SEV1 Full outage, or effectively unusable. SEV2 Significant degradation for subset of users. SEV3 Minor impact on user experience.
  • 53. Severity Not all incidents are created equal. Something like: SEV1 Full outage, or effectively unusable. SEV2 Significant degradation for subset of users. SEV3 Minor impact on user experience. SEV4 No impact, but time-sensitive failure.
  • 54. Root Cause? (Not all incidents are change related) Something like: Note: this can be difficult to categorize. http://en.wikipedia.org/wiki/Root_cause_analysis
  • 55. Root Cause? (Not all incidents are change related) Something like: 1. Hardware Failure 2. Datacenter Issue 3. Change: Code Issue 4. Change: Config Issue 5. Capacity/Traffic Issue 6. Other Note: this can be difficult to categorize. http://en.wikipedia.org/wiki/Root_cause_analysis
  • 56. Recording Your Response (worth the hassle) http://www.flickr.com/photos/mattblaze/2695044170/
  • 57. Time
  • 58. la de da, everything’s fine Time
  • 59. la de da, everything’s fine Time change happens
  • 60. Noticed there was a problem la de da, everything’s fine Time change happens
  • 61. Noticed there was a problem Figured out la de da, what the cause is everything’s fine Time change happens
  • 62. Fixed the problem Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time change happens
  • 63. Fixed the problem Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time change happens
  • 64. • Coordinate troubleshooting/diagnosis Fixed the problem Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time change happens
  • 65. • Coordinate troubleshooting/diagnosis • Communicate to support/community/execs Fixed the problem Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time change happens
  • 66. Fixed the problem Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time Time change happens
  • 67. • Coordinate responses* Fixed the problem Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time Time change happens * usually, “One Thing At A Time” responses
  • 68. • Coordinate responses* • Communicate to support/community/execs problem Fixed the Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time Time change happens * usually, “One Thing At A Time” responses
  • 69. Fixed the problem Figured out what the cause is Noticed there •rolled back was a problem •rolled forward •temporary solution •etc la de da, everything’s fine Time Time change happens
  • 70. • Confirm stability, resolving steps Fixed the problem Figured out what the cause is Noticed there •rolled back was a problem •rolled forward •temporary solution •etc la de da, everything’s fine Time Time change happens
  • 71. • Confirm stability, resolving steps • Communicate to support/community/execs Fixed the problem Figured out what the cause is Noticed there •rolled back was a problem •rolled forward •temporary solution •etc la de da, everything’s fine Time Time change happens
  • 73. Fixed the problem Figured out what the cause is Noticed there •rolled back was a problem •rolled forward •temporary solution •etc la de da, everything’s fine Time Time change happens PostMortem
  • 74. Time To Detect (TTD) Time To Resolve la de da, (TTR) la de da, everything’s fine everything’s fine Time change happens
  • 77. Is There Any Pattern?
  • 80. Maybe this is too Maybe you’re much suck? } changing too much at once? } Happening too often?
  • 81. What percentage of incidents are related to change? http://www.flickr.com/photos/78364563@N00/2467989781/
  • 82. What percentage of change- related incidents are “off-hours”? http://www.flickr.com/photos/jeffreyanthonyrafolpiano/3266123838
  • 83. What percentage of change- related incidents are “off-hours”? Do they have higher or lower TTR? http://www.flickr.com/photos/jeffreyanthonyrafolpiano/3266123838
  • 84. What types of change have the worst success rates? http://www.flickr.com/photos/lwr/2257949828/
  • 85. What types of change have the worst success rates? Which ones have the best success rates? http://www.flickr.com/photos/lwr/2257949828/
  • 86. Does your TTD/TTR increase depending on the: - SIZE? - FREQUENCY? http://www.flickr.com/photos/45409431@N00/2521827947/
  • 87. Side effect is that you’re also tracking successful changes to production as well http://www.flickr.com/photos/wwworks/2313927146
  • 88. Q2 2010 Incident Success Type Successes Failures Minutes(Sev1 Rate /2) App code 420 5 98.81 8 Config 404 3 99.26 5 DB Schema 15 1 93.33 10 DNS 45 0 100 0 Network (misc) 5 0 100 0 Network (core) 1 0 100 0
  • 89. Q2 2010 Incident Success Type Successes Failures Minutes(Se ! Rate v1/2) App code 420 5 98.81 8 Config 404 3 99.26 5 DB Schema 15 1 93.33 10 DNS 45 0 100 0 Network (misc) 5 0 100 0 Network (core) 1 0 100 0
  • 91. Incident Observations Morale Length of Incident/Outage
  • 92. Incident Observations Mistakes Length of Incident/Outage
  • 94. Change Observations Huge changesets deployed rarely Change Size Change Frequency
  • 95. Change Observations Huge changesets (high TTR) deployed rarely Change Size Change Frequency
  • 96. Change Observations Huge changesets (high TTR) deployed rarely Change Size Tiny changesets deployed often Change Frequency
  • 97. Change Observations Huge changesets (high TTR) deployed rarely Change Size Tiny changesets deployed often (low TTR) Change Frequency
  • 98. Specifically.... la de da, What if this was only 5 } everything’s fine lines of code that were changed? Does that feel safer? change happens (it should)
  • 99. Pay attention to this stuff http://www.flickr.com/photos/plasticbag/2461247090/
  • 100. We’re Hiring Ops! SF & NYC In May: - $22.9M of goods were sold by the community - 1,895,943 new items listed - 239,340 members joined
  • 103. Continuous Deployment Described in 6 graphs (Originally Cal Henderson’s idea)

Notas del editor

  1. This is about metrics about YOU! Metrics *about* the metrics-makers!
  2. They are basically taken from both Flickr and Etsy.
  3. HOW MANY: write public-facing app code? maintain the release tools? release process? respond to incidents? have had an outage or notable degradation this month? that was change-related?
  4. Too fast? Too loose? Too many issues? Too many upset and stressed out humans?
  5. Everyone is used to bug tracking, it’s something worthwhile....
  6. If this is a feeling you have often, please read on.
  7. All you need is to see this happen once, and it’s hard to get out of your memory. No wonder why some people can start to think “code deploy = outage”.
  8. Mild version of “Critical Incident Stress Management”? Change = risk, and sometimes risk = outage. And outages are stressful.
  9. Not supposed to feel like this.
  10. Details about the change play a huge role in your ability to respond to change-related incidents.
  11. Details about the change play a huge role in your ability to respond to change-related incidents.
  12. Details about the change play a huge role in your ability to respond to change-related incidents.
  13. We do this by tracking our responses to outages and incidents.
  14. We can do this by tracking our change, and learning from the results.
  15. We need to raise confidence that we’re moving as fast as we can while still being safe enough to do so. And we can adjust the change to meet our requirements...
  16. Why should change and results of changes be any different?
  17. Type = code, schema, infrastructure, etc. Frequency/Size = how often each type is changed, implies risk Results = how often each change results in an incident/degradation
  18. Lots of different types here. Might be different for everyone. Not all types of change bring the same amount of risk.
  19. This info should be considered mandatory. This should also be done for db schema changes, network changes, changes in any part of the stack, really.
  20. The header of our metrics tools has these statistics, too.
  21. The tricky part: getting all prod changes written down without too much hassle.
  22. Here’s one type of change....
  23. Here’s another type of change....
  24. Here’s yet another type of change...
  25. Size does turn out to be important. Size = lines of code, level of SPOF risk, etc.
  26. This seems like something you should do. Also: “incidents” = outages or degradations.
  27. Just an example. This looks like it’s going well! Getting better!
  28. Maybe I can’t say that it’s getting better, actually....
  29. Some folks have Techcrunch as their incident log keeper. You could just use a spreadsheet.
  30. An example!
  31. You *are* doing postmortems on incidents that happen, right? Doing them comes at a certain point in your evolution.
  32. Without the statistics, even a rare but severe outage can make the impression that change == outage.
  33. Just examples. It’s important to categorize these things so you can count the ones that matter for the user’s experience. #4 Loss of redundancy
  34. Just examples. It’s important to categorize these things so you can count the ones that matter for the user’s experience. #4 Loss of redundancy
  35. Just examples. It’s important to categorize these things so you can count the ones that matter for the user’s experience. #4 Loss of redundancy
  36. Just examples. It’s important to categorize these things so you can count the ones that matter for the user’s experience. #4 Loss of redundancy
  37. Just examples. It’s important to categorize these things so you can count the ones that matter for the user’s experience. #4 Loss of redundancy
  38. Just examples. It’s important to categorize these things so you can count the ones that matter for the user’s experience. #4 Loss of redundancy
  39. Just examples. It’s important to categorize these things so you can count the ones that matter for the user’s experience. #4 Loss of redundancy
  40. Just examples. This normally comes from a postmortem meeting. A good pointer on Root Cause Analysis is Eric Ries’ material on Five Whys, and the wikipedia page for RCA.
  41. http://www.flickr.com/photos/mattblaze/2695044170/
  42. What happens in our response to a change-related incident is just as important as the occurrence of the incident.
  43. What happens in our response to a change-related incident is just as important as the occurrence of the incident.
  44. What happens in our response to a change-related incident is just as important as the occurrence of the incident.
  45. What happens in our response to a change-related incident is just as important as the occurrence of the incident.
  46. What happens in our response to a change-related incident is just as important as the occurrence of the incident.
  47. Th
  48. Th
  49. This might also be known as a ‘diagnose’ point.
  50. This might also be known as a ‘diagnose’ point.
  51. These events usually spawn other events.
  52. These events usually spawn other events.
  53. This should be standard operating procedure at this point,
  54. These events usually spawn other events.
  55. Some folks might notice a “Time To Diagnose” missing here. ALSO: it’s usually more complex than this, but this is the gist of it.
  56. Do incidents increase with size of change? With frequency? With frequency/size of different types?
  57. If you don’t track: Change, Incidents, and Responses, you’ll never have answers for these questions.
  58. Reasonable questions.
  59. *YOU* get to decide what is “small” and “frequent”.
  60. THIS is what can help give you confidence. Or not.
  61. The longer an outage lasts, the bigger of a bummer it is for all those who are working on fixing it.
  62. The longer an outage lasts, the more mistakes people make. (and, as the night gets longer) Red herrings...
  63. put two points on this graph
  64. put two points on this graph
  65. put two points on this graph
  66. put two points on this graph
  67. It should, because it is.
  68. How we feel about change and how it can (or not) cause outages is important. Some of the nastiest relationships emerge between dev and ops because of these things.
  69. “Normal” = lots of change done at regular intervals, change = big, time = long.
  70. 2 weeks? 5000 lines?
  71. Scary Monster of Change! Each incident-causing deploy has only one recourse: roll it all back. Even code that was ok and unrelated to the incident. Boo!
  72. Silly Monster of Nothing to Be Afraid Of Because His Teeth Are Small.
  73. Problem? Roll that little piece back. Or better yet, roll it forward!
  74. This looks like an adorable monster. Like a Maurice Sendak monster.