Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Incident Management in the Age of DevOps and SRE

1.046 visualizaciones

Publicado el

Damon Edwards, co-founder of Rundeck, presents at Salt Lake City DevOps Meetup, November 13, 2019.

There is no doubt that DevOps has changed how we deliver software. But what about after deployment? Whether you are in a traditional operations organization or a “you build it, you run it” team, how do you mobilize, resolve, and learn from incidents? This talk will look at how high performing organizations have applied DevOps and SRE practices to shorten incidents and reduce escalations. Less frustration for the engineers. Lower costs for the business. Everybody wins.

See a Demo of Rundeck Enterprise :
https://www.rundeck.com/see-demo

--or--

Download Rundeck Open Source here:
https://rundeck.com/open-source

Connect:
Stack Overflow community: https://stackoverflow.com/questions/tagged/rundeck
Github: https://github.com/rundeck/rundeck/issues
Twitter: https://twitter.com/Rundeck
Facebook: https://www.facebook.com/RundeckInc/
LinkedIn: www.linkedin.com › company › rundeck-inc

Publicado en: Tecnología
  • Inicia sesión para ver los comentarios

Incident Management in the Age of DevOps and SRE

  1. 1. Damon Edwards Incident Management in the Age of DevOps and SRE Salt Lake City DevOps Nov 13, 2019
  2. 2. Assertion: The ability to respond to and resolve incidents is the true indicator of an organization’s operational capabilities
  3. 3. Assertion 2: Everybody now works in “Operations"
  4. 4. What Is an Incident? An unplanned disruption impacting customers or business operations
  5. 5. What Is an Incident? An unplanned disruption impacting customers or business operations Outages Service Degradation
  6. 6. What Is an Incident? An unplanned disruption impacting customers or business operations Outages Service Degradation Work interruption Delay/Waiting “Short-Notice” Requests
  7. 7. Board
  8. 8. Integrated Board
  9. 9. Integrated Responsive Board
  10. 10. Integrated Responsive Everywhere Board
  11. 11. Integrated Responsive Everywhere Always Board
  12. 12. Integrated Responsive Everywhere Always Board Tech Org Execution
  13. 13. Integrated Responsive Everywhere Always Board Tech Org Execution
  14. 14. Kubernetes AWS GCP Azure Docker Consul Terraform Istio Zipkin Envoy Serverless OpenShift KafkaLamba Prometheus Containerd Helm Cloud Foundry Linkerd Etcd CoreDNS MongoDB Redis InfluxDB Jaeger gRPC CRI-O Cognito Fargate Cloud Functions Cosmos BigQuery Spark Rook Ceph NGINXHAProxy Open vSwitch NSX Sensu Vault Aurora Nomad
  15. 15. Kubernetes AWS GCP Azure Docker Consul Terraform Istio Zipkin Envoy Serverless OpenShift KafkaLamba Prometheus Containerd Helm Cloud Foundry Linkerd Etcd CoreDNS MongoDB Redis InfluxDB Jaeger gRPC CRI-O Cognito Fargate Cloud Functions Cosmos BigQuery Spark Rook Ceph NGINXHAProxy Open vSwitch NSX Sensu Vault Aurora Nomad
  16. 16. Kubernetes AWS GCP Azure Docker Consul Terraform Istio Zipkin Envoy Serverless OpenShift KafkaLamba Prometheus Containerd Helm Cloud Foundry Linkerd Etcd CoreDNS MongoDB Redis InfluxDB Jaeger gRPC CRI-O Cognito Fargate Cloud Functions Cosmos BigQuery Spark Rook Ceph NGINXHAProxy Open vSwitch NSX Sensu Vault Aurora Nomad
  17. 17. Kubernetes AWS GCP Azure Docker Consul Terraform Istio Zipkin Envoy Serverless OpenShift KafkaLamba Prometheus Containerd Helm Cloud Foundry Linkerd Etcd CoreDNS MongoDB Redis InfluxDB Jaeger gRPC CRI-O Cognito Fargate Cloud Functions Cosmos BigQuery Spark Rook Ceph NGINXHAProxy Open vSwitch NSX Sensu Vault Aurora Nomad SAIL/cornell.edu
  18. 18. Adrian Cockcroft Developer Developer Developer Developer Developer Old Release Still Running Release Plan Release Plan Release Plan Release Plan Deploy Feature to Production Deploy Feature to Production Deploy Feature to Production Deploy Feature to Production Bugs Deploy Feature to Production Immutable microservice deployment scales, is faster with large teams and diverse platform components DockerCon EU 2014 Architecture enables speed. Speed is the advantage.
  19. 19. The Three Ways (2013)
  20. 20. The Three Ways (2013) The Five Ideals (2019)
  21. 21. DEV
  22. 22. Go! Go! Go!DEV
  23. 23. Go! Go! Go!DEV …OPS?
  24. 24. 0000 Go! Go! Go!DEV …OPS?
  25. 25. 0000 Go! Go! Go!DEV …OPS? Operations: The Last Mile
  26. 26. 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload Principles of SRE
  27. 27. 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload Principles of SRE
  28. 28. 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload Principles of SRE
  29. 29. DevOps + SRE Product, Not Project Continuous Delivery Shift Left Error Budgets 0 100 !! Toil Limits Cloud Native+ + + + +
  30. 30. DevOps + SRE Product, Not Project Continuous Delivery Shift Left Error Budgets 0 100 !! Toil Limits Cloud Native+ + + + +
  31. 31. Dev Ops Cross-Functional Team Cross-Functional Team DevOps + SRE Product, Not Project Continuous Delivery Shift Left Error Budgets 0 100 !! Toil Limits Cloud Native+ + + + +
  32. 32. Dev Ops Cross-Functional Team Cross-Functional Team DevOps + SRE Product, Not Project Continuous Delivery Shift Left Error Budgets 0 100 !! Toil Limits Cloud Native+ + + + + “Value-Aligned” and Self-Regulating Shared Responsibility Model
  33. 33. Traditional ITSM
  34. 34. Traditional ITSM ITIL 1989 - ?
  35. 35. Traditional ITSM ITIL 1989 - ?
  36. 36. Traditional ITSM Unintentionally Encourages Silos ITIL 1989 - ?
  37. 37. Traditional ITSM X X X XX X Unintentionally Encourages Silos ITIL 1989 - ?
  38. 38. Traditional ITSM X X X XX X Unintentionally Encourages Silos Encourages command & control management ITIL 1989 - ?
  39. 39. Traditional ITSM X X X XX X Unintentionally Encourages Silos Encourages command & control management ITIL 1989 - ?
  40. 40. Old Way New Way
  41. 41. Old Way New Way
  42. 42. +
  43. 43. REDeploy.io
  44. 44. There is no root cause. (That’s just a political distinction) REDeploy.io
  45. 45. Why? Why? Why? Why? Why? There is no root cause. (That’s just a political distinction) REDeploy.io
  46. 46. Why? Why? Why? Why? Why? There is no root cause. (That’s just a political distinction) Right, Wrong, Safety II, and You. REDeploy.io
  47. 47. Why? Why? Why? Why? Why? There is no root cause. (That’s just a political distinction) Right, Wrong, Safety II, and You. Incidents = unplanned investments REDeploy.io
  48. 48. You Not
  49. 49. 18Million IT Ops 22.3Million Developers
  50. 50. Col. John Boyd OODA Loop
  51. 51. Monitoring Spotting the knowns
  52. 52. Monitoring Spotting the knowns Observability Interrogating the unknowns
  53. 53. Observability Interrogating the unknowns
  54. 54. Observability Interrogating the unknowns Logging: The event
  55. 55. Observability Interrogating the unknowns Logging: The event Metrics: Data points over time
  56. 56. Observability Interrogating the unknowns Logging: The event Metrics: Data points over time Tracing: Events in context of a single request
  57. 57. Observability Interrogating the unknowns Logging: The event Metrics: Data points over time Tracing: Events in context of a single request
  58. 58. Automated Governance Objective automated attestation of GRC controls
  59. 59. Automated Governance Objective automated attestation of GRC controls
  60. 60. Automated Governance Objective automated attestation of GRC controls
  61. 61. Monitoring Observability Governance Everyone Everyone Everyone Everyone
  62. 62. Incident Command Mobilization, Coordination, Communication
  63. 63. Incident Command Mobilization, Coordination, Communication Incident Command System (FEMA)
  64. 64. Incident Command Mobilization, Coordination, Communication Incident Command System (FEMA)
  65. 65. Incident Command Mobilization, Coordination, Communication Incident Command System (FEMA)
  66. 66. Incident Command Mobilization, Coordination, Communication Incident Command System (FEMA)
  67. 67. Incident Command Mobilization, Coordination, Communication Incident Command System (FEMA) GitHub: PagerDuty/incident-response-docs
  68. 68. Ops = Platform Eng + SRE Divide and conquer
  69. 69. Ops = Platform Eng + SRE Divide and conquer
  70. 70. Ops Platform Eng + SRE Divide and conquer SRE: Expert Operators (distributed) Platform Eng: Build and Operate Platform Services (centralized)
  71. 71. Ops Platform Eng + SRE Divide and conquer SRE: Expert Operators (distributed) Platform Eng: Build and Operate Platform Services (centralized)
  72. 72. Ops Platform Eng + SRE Divide and conquer SRE: Expert Operators (distributed) Platform Eng: Build and Operate Platform Services (centralized)
  73. 73. New Views on Escalations Avoid… but swarm if you do Support at the edge Swarm
  74. 74. Diagnose: Health checks, exploratory actions Take Action! Restore: Restart, repair actions, rollback
  75. 75. The Return of Runbooks Awhile ago Not that long ago Now
  76. 76. The Return of Runbooks Awhile ago Not that long ago Now Runbooks (Mostly Manual)
  77. 77. The Return of Runbooks Awhile ago Not that long ago Now Runbooks (Mostly Manual) …
  78. 78. The Return of Runbooks Awhile ago Not that long ago Now Runbooks (Mostly Manual) Runbooks (Automate!…How?)… Thanks SRE!
  79. 79. Runbook Automation Safe self-service access to the expert knowledge you need to take action.
  80. 80. Runbook Automation Safe self-service access to the expert knowledge you need to take action.
  81. 81. Runbook Automation Safe self-service access to the expert knowledge you need to take action.
  82. 82. Runbook Automation Safe self-service access to the expert knowledge you need to take action. Moving the bits is the easy part!
  83. 83. Runbook Automation Safe self-service access to the expert knowledge you need to take action.
  84. 84. Empower those closest to the action! Runbook Automation Safe self-service access to the expert knowledge you need to take action.
  85. 85. Runbook Automation Safe self-service access to the expert knowledge you need to take action.
  86. 86. De-risk! Runbook Automation Safe self-service access to the expert knowledge you need to take action.
  87. 87. Before Runbook Automation…
  88. 88. Before Runbook Automation… 3 options:
  89. 89. 1. Decipher the wiki Before Runbook Automation… 3 options:
  90. 90. 1. Decipher the wiki 2.Ad-hoc tool/script usage Before Runbook Automation… 3 options:
  91. 91. 1. Decipher the wiki 2.Ad-hoc tool/script usage 3.ESCALATE! Before Runbook Automation… 3 options:
  92. 92. …with Runbook Automation
  93. 93. Shorter Incidents. Fewer Escalations. Before RBA
  94. 94. Shorter Incidents. Fewer Escalations. Before RBA
  95. 95. With RBA Shorter Incidents. Fewer Escalations.
  96. 96. With RBA Shorter Incidents. Fewer Escalations.
  97. 97. Before RBA Shorter Incidents. Fewer Escalations.
  98. 98. With RBA Shorter Incidents. Fewer Escalations.
  99. 99. Solve Difficult Security & Compliance Problems Before RBA
  100. 100. Solve Difficult Security & Compliance Problems With RBA
  101. 101. Everything Through a SDLC Promote
  102. 102. Runbooks as a Service
  103. 103. Incidents = unplanned investments …the ROI is up to you.
  104. 104. Recap! Elevate the Human.
  105. 105. @damonedwards damon@rundeck.com Let’s talk… Special thanks to

×