Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Incident Management in the Age of DevOps and SRE

301 visualizaciones

Publicado el

Damon Edwards, co-founder of Rundeck, presents to Northern Virginia Linux Users Group on February 8, 2020.

Contains parts of
"Operations: The Last Mile":
https://www.youtube.com/watch?v=1zUtBLZ4Lus

"Incident Management in the Age of DevOps and SRE":
https://www.infoq.com/presentations/incident-management-devops-sre/

Publicado en: Tecnología
  • Inicia sesión para ver los comentarios

Incident Management in the Age of DevOps and SRE

  1. 1. Incident Management in the Age of DevOps & SRE Damon Edwards @damonedwards
  2. 2. Community Ops Improvement DevOps in Enterprise Ops Tools
  3. 3. 1. Operations: The Last Mile 2.Start with Incident Management
  4. 4. 1. Operations: The Last Mile 2.Start with Incident Management
  5. 5. 
 Developers have had an unfair advantage.
  6. 6. Ops Ah-ha! Dev Ka-ching!
  7. 7. Ops Ah-ha! Dev Ka-ching! Agile 2001
  8. 8. Ops Ah-ha! Dev Ka-ching! Agile 2001 ITIL 1989
  9. 9. OpsBusiness Idea Shorter Time-to-Market Fast Feedback from Users Dev Ops Running Services Improved Quality Digital and DevOps Availability Auditing Security Compliance "Go faster!" “Open up!” “Lock it down!” 2020
  10. 10. Story time….
  11. 11. Digital Agile DevOps SRE Cloud Docker Kubernetes Microservices CHANGE Wow That is cool I wish I could work there
  12. 12. But nobody was talking about what happened after deployment…
  13. 13. It was just another Tuesday…
  14. 14. NOC NOC Biz Manager Escalate! NOC NOC NOC (Bob) Open Incident Ticket 9:30am 10:00am NOC (Bob) Biz Manager Ticket Context Wagon Yes, but this looks different Hasn’t there been some intermittent errors this week? v3 ?!
  15. 15. NOC (Bob) Open Incident Ticket Ticket Biz Manager App-specific SREs “Try this.” “Try that.” SRE SysAdmin with Prod Access (Steve) SRE SRE SRE SRE SRE SRE Bridge Call Biz Manager fixed? fixed? NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x SRE Ticket Context Wagon Ticket Context Wagon
  16. 16. SRE “It’s a problem with the Foo service” SRE SRE Foo SRE SRE SRE SRE Bridge Call Biz Manager Foo Service No. NOC (Bob) Update Ticket Ticket Foo Lead Dev + add 12:00pm NOC (Bob) Biz Manager Foo SRE Ticket Context Wagon Can you fix it?
  17. 17. o Dev Foo Lead Dev (Karen) ding! Ignore. App Manager Hey did you see that ticket? Foo Lead Dev (Karen) sigh. I’ll take a look I’m go mor pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE Scrum Ticket Context Wagon
  18. 18. k Foo Lead Dev (Karen) I’m going to need more log files Ticket SysAdmin Team + add Update Ticket Chat “Can someone with access to Foo Service in Prod01 help me with ticket #42516?” SysAdmin (Lee) Ticket “logs attached” Foo Lead Dev (Karen) Ticket “no the other ones” Le (K NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Ticket Context Wagon
  19. 19. Foo Lead Dev (Karen) Logs -Who restarted these services? (and why?) -They didn’t use the correct environment variables! -This entire service pool needs to be restarted! Ticket Update Ticket NOC (Bob) Update Ticket Ticket Middleware Team + add “Middleware, please urgent restart this entire app pool with the correct environment variable” 2:00pm Ticket Context W
  20. 20. ase s entire e correct able” NOC (Bob) Middleware Manager (Melissa) No way. It’s the middle of the day! You need business approval. NOC (Bob) Update Ticket Ticket SVP for Line of Business + add (S NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager NOC (B Biz Ma App Ma Lead D Foo SR Ticket Context Wagon Ticket Context Wagon 2:30pm
  21. 21. Update Ticket Ticket SVP for Line of Business + add SVP (Susan) Chief of Staff Tech VP Tech VP Update Ticket Ticket “Restart approved” Customer impact? Ticket Middlewa Manage (Melissa Wh prod 5:00pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Ticket Context Wagon
  22. 22. Share point proved” Ticket Middleware Manager (Melissa) Who knows these production services the best? Ellen! Middleware Middleware (Scott) Ellen to Europe office Middleware (Scott) Trial and error .doc 5:00pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Ticket Context Wagon
  23. 23. Share point Middleware (Scott) Trial and error .doc NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) ket Context Wagon Middleware (Scott) Bar Service 10 min Middleware (Scott) Waiting for Acme Service Acme startup failed Bar Service 6:00pm
  24. 24. Come on.. no.no.no. What? Why? Middleware (Scott)
  25. 25. Come on.. no.no.no. What? Why? Middleware (Scott)
  26. 26. 8888888 Come on.. no.no.no. What? Why? Middleware (Scott)
  27. 27. -Bar app startup timed out. Error says can’t connect to Acme service. - I looked at Acme but it seems to be running -Is this error message correct? Why can’t Bar connect? Ticket Update Ticket Middleware (Scott) Bar SRE + add Bar SRE (Linda) Middleware (Scott) -URGENT: Network connection issue between Bar and Acme Ticket Update Ticket Network SRE Team + add 6:45 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda)Ticket Context Wagon The new environment pre-flight check is preventing startup. Looks like Bar’s connection to Acme is being blocked.
  28. 28. Bar SRE (Linda) Middleware (Scott) -URGENT: Network connection issue between Bar and Acme Ticket Update Ticket Network SRE Team + add Bar Lead Dev 6:45pm ob) ager nager ev (Karen) E SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda) Customers are calling. What is going on?The new environment pre-flight check is preventing startup. Looks like Bar’s connection to Acme is being blocked. Bar Lead Dev (Liu) Business Managers I can comment out the test… But the CD pipeline only goes to QA ENV!
  29. 29. Network Dir (Carlos) Middleware (Scott) Carlos, I need a favor. Can you escalate?Middleware Manager (Melissa) Customers are calling. What is going on? Last week.. Net SRE VP VP Priority! Different Incident! Net SRE Net SRE Net SRE Its the network! Business Managers Your network is broken! Business Managers We are already working on it! Network VPs out he ly V!
  30. 30. Network SRE (Hari) The firewall is blocking the traffic You’ll have to take it up with the Firewall Team -URGENT: Firewall is blocking connection between Bar and Acme Ticket Open Firewall Ticket Firewall Team + add Firewall Engineer (Freddie) Middleware (Scott) Paging on-call… Open bridge… Can’t be the firewall, it hasn’t changed since last Thursday. No its the firewall. 8:00p NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda) Network PM (Carlos) Network SRE (Bob) Ticket Context Wagon
  31. 31. Firewall Engineer (Freddie) Middleware (Scott) Firewall Engineer (Freddie) Middleware (Scott) Can’t be the firewall, it hasn’t changed since last Thursday. No its the firewall. There was a rule change last Thursday that would stop Bar from talking to Acme. Can you change it back? Sure we make changes on Thursday… Chief of Staff SVP and VPs are livid… this was supposed to be a safe change!! Freddie, we’ve got customers calling. ES Em pro rul Update Firewall Ticket Firewall Engineer (Freddie) 8:00pm
  32. 32. d VPs are livid… this was sed to be a safe change!! we’ve got customers calling. ESCALATE: Emergency production firewall rule change review Ticket Update Firewall Ticket NetSec + add Firewall Engineer (Freddie) Paging on-call… NetSec (Nicole) This is production so I’ll have to get others on the Network CAB… Chief of Staff Firewall (Freddie) Middleware (Scott) Customer outage! … I’ll call SVP Susan Middleware Manager VP VP Bar Lead Dev 9:00pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAd Middle SVP Chief o 2 x Tec Ticket Context Wagon
  33. 33. I’ll have Network Chief of Staff Firewall (Freddie) Middleware (Scott) Customer outage! APPROVE: Emergency firewall rule change Ticket Update Firewall Ticket NetSec (Nicole) … I’ll call SVP Susan Middleware Manager VP VP Bar Lead Dev Firewall (Freddie) Net L2 (Bob) Middl (Sc Firewall change Restart Bar 9:30pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda) Network PM (Carlos) Network SRE (Bob) Firewall (Freddie) Ticket Context Wagon NetSec (Nicole)
  34. 34. Middleware (Scott) Update Ticket Ticket Customer Engagement Manager + add Policy !! “Ready for API tests” 9:45pm Firewall (Freddie) Net L2 (Bob) Middleware (Scott) Firewall change Restart Bar I think we are good! Middleware Manager VP VP Bar Lead Dev You “think?” pm
  35. 35. et gement “Ready for API tests” Customer Engagement Manager (Varsha) NOC (Bob) Customer Engagement Manager (Varsha) Update Ticket Ticket “APIs OK” Middleware (Scott) Upda Tick 11:00pm Ticket Co
  36. 36. e Ticket “APIs OK” Middleware (Scott) Update Ticket Ticket “Services restarted OK” NOC NOC Lights are green… I guess it is fixed. Close Ticket NOC (Bob) Zzz 11:30pm N NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda) Network PM (Carlos) Network SRE (Bob) Firewall (Freddie) Ticket Context Wagon NetSec (Nicole) Cust. Engmt. (Varsha)
  37. 37. e Ticket “APIs OK” Middleware (Scott) Update Ticket Ticket “Services restarted OK” NOC NOC Lights are green… I guess it is fixed. Close Ticket NOC (Bob) Zzz 11:30pm N NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda) Network PM (Carlos) Network SRE (Bob) Firewall (Freddie) Ticket Context Wagon NetSec (Nicole) Cust. Engmt. (Varsha) .
  38. 38. NOC Lights are green… I guess it is fixed. Close Ticket NOC (Bob) Zzz Next Day SVP (Susan) Whose fault is this?! Why are we so bad at change? What additional processes and approvals are you adding to never let this happen again?! VP VP Dir Dir VP Dir VP Scott) da) Carlos) (Bob) die) NetSec (Nicole) Cust. Engmt. (Varsha)
  39. 39. Later…
  40. 40. We’ve invested in Cloud, Agile, DevOps, Containers… Why does everything still take too long and cost too much? Executive Team Our transformation has largely ignored Ops
  41. 41. Traditionally we’ve chase the symptoms…
  42. 42. …by following the conventional wisdom:
  43. 43. “We need better tools” …by following the conventional wisdom:
  44. 44. “We need better tools” “We need more people” …by following the conventional wisdom:
  45. 45. “We need better tools” “We need more people” “We need more discipline and attention to detail” …by following the conventional wisdom:
  46. 46. “We need better tools” “We need more people” “We need more discipline and attention to detail” “We need more change reviews/approvals” …by following the conventional wisdom:
  47. 47. “We need better tools” “We need more people” “We need more discipline and attention to detail” “We need more change reviews/approvals” …by following the conventional wisdom:
  48. 48. Challenge the conventional wisdom about operations work
  49. 49. Forces That Undermine Operations Silos Queues Excessive ToilLow Trust
  50. 50. Forces That Undermine Operations Silos Queues Excessive ToilLow Trust Operations: The Last Mile https://www.youtube.com/watch?v=1zUtBLZ4Lus
  51. 51. 1. Operations: The Last Mile 2.Start with Incident Management
  52. 52. Why focus on incident management?
  53. 53. Why focus on incident management? Detecting, diagnosing, and resolving incidents is the true measure of an organization’s operational capabilities.
  54. 54. Why now?
  55. 55. Complexity
  56. 56. Complexity Consequence
  57. 57. Complexity Consequence Speed
  58. 58. https://www.infoq.com/presentations/incident-management-devops-sre/Full presentation:
  59. 59. What Is an Incident? An unplanned disruption impacting customers or business operations
  60. 60. What Is an Incident? An unplanned disruption impacting customers or business operations Outages Service Degradation
  61. 61. What Is an Incident? An unplanned disruption impacting customers or business operations Outages Service Degradation Work interruption Delay/Waiting “Short-Notice” Requests
  62. 62. Col. John Boyd OODA Loop
  63. 63. Monitoring Spotting the knowns
  64. 64. Monitoring Spotting the knowns Observability Interrogating the unknowns
  65. 65. Observability Interrogating the unknowns
  66. 66. Observability Interrogating the unknowns Logging: The event
  67. 67. Observability Interrogating the unknowns Logging: The event Metrics: Data points over time
  68. 68. Observability Interrogating the unknowns Logging: The event Metrics: Data points over time Tracing: Events in context of a single request
  69. 69. Observability Interrogating the unknowns Logging: The event Metrics: Data points over time Tracing: Events in context of a single request
  70. 70. Automated Governance Objective automated attestation of GRC controls
  71. 71. Automated Governance Objective automated attestation of GRC controls
  72. 72. Automated Governance Objective automated attestation of GRC controls
  73. 73. Monitoring Observability Governance Everyone Everyone Everyone Everyone
  74. 74. Incident Command Mobilization, Coordination, Communication
  75. 75. Incident Command Mobilization, Coordination, Communication Incident Command System (FEMA)
  76. 76. Incident Command Mobilization, Coordination, Communication Incident Command System (FEMA)
  77. 77. Incident Command Mobilization, Coordination, Communication Incident Command System (FEMA)
  78. 78. Incident Command Mobilization, Coordination, Communication Incident Command System (FEMA)
  79. 79. Incident Command Mobilization, Coordination, Communication Incident Command System (FEMA) GitHub: PagerDuty/incident-response-docs
  80. 80. Stage 1: Dev + Ops
  81. 81. Stage 1: Dev + Ops
  82. 82. Stage 2: Ops Platform Eng + SRE Divide and conquer SRE: Expert Operators (distributed) Platform Eng: Build and Operate Platform Services (centralized)
  83. 83. Stage 2: Ops Platform Eng + SRE Divide and conquer SRE: Expert Operators (distributed) Platform Eng: Build and Operate Platform Services (centralized)
  84. 84. Stage 2: Ops Platform Eng + SRE Divide and conquer SRE: Expert Operators (distributed) Platform Eng: Build and Operate Platform Services (centralized)
  85. 85. New Views on Escalations Avoid… but swarm if you do Support at the edge Swarm
  86. 86. Diagnose:Health checks, exploratory actions Take Action! Restore:Restart, repair actions, rollback
  87. 87. The Return of Runbooks Awhile ago Not that long ago Now
  88. 88. The Return of Runbooks Awhile ago Not that long ago Now Runbooks (Mostly Manual)
  89. 89. The Return of Runbooks Awhile ago Not that long ago Now Runbooks (Mostly Manual) …
  90. 90. The Return of Runbooks Awhile ago Not that long ago Now Runbooks (Mostly Manual) Runbooks (Automate!…How?)… Thanks SRE!
  91. 91. Runbook Automation Safe self-service access to the expert knowledge you need to take action.
  92. 92. Runbook Automation Safe self-service access to the expert knowledge you need to take action.
  93. 93. Runbook Automation Safe self-service access to the expert knowledge you need to take action.
  94. 94. Runbook Automation Safe self-service access to the expert knowledge you need to take action. Moving the bits is the easy part!
  95. 95. Runbook Automation Safe self-service access to the expert knowledge you need to take action.
  96. 96. Empower those closest to the action! Runbook Automation Safe self-service access to the expert knowledge you need to take action.
  97. 97. Runbook Automation Safe self-service access to the expert knowledge you need to take action.
  98. 98. De-risk! Runbook Automation Safe self-service access to the expert knowledge you need to take action.
  99. 99. Before Runbook Automation…
  100. 100. Before Runbook Automation… 3 options:
  101. 101. 1. Decipher the wiki Before Runbook Automation… 3 options:
  102. 102. 1. Decipher the wiki 2.Ad-hoc tool/script usage Before Runbook Automation… 3 options:
  103. 103. 1. Decipher the wiki 2.Ad-hoc tool/script usage 3.ESCALATE! Before Runbook Automation… 3 options:
  104. 104. …with Runbook Automation
  105. 105. Shorter Incidents. Fewer Escalations. Before RBA
  106. 106. Shorter Incidents. Fewer Escalations. Before RBA
  107. 107. With RBA Shorter Incidents. Fewer Escalations.
  108. 108. With RBA Shorter Incidents. Fewer Escalations.
  109. 109. Before RBA Shorter Incidents. Fewer Escalations.
  110. 110. With RBA Shorter Incidents. Fewer Escalations.
  111. 111. Solve Difficult Security & Compliance Problems With RBA
  112. 112. Everything Through a SDLC Promote
  113. 113. Runbooks as a Service
  114. 114. Incidents = unplanned investments …the ROI is up to you.
  115. 115. damon@rundeck.com Let’s talk… @damonedwards

×