Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

SysAdmin to SRE: Solving the Last Mile Problem

1.644 visualizaciones

Publicado el

Presented by Damon Edwards, co-founder of Rundeck, at DevOps Days Dallas on August 20, 2019.

Some DevOps transformations flourish, but others are stalling. Why is that? This talk will make the case that Operations is the most predictable differentiator.

So much of the energy in DevOps has been about activities that start in Dev and move towards Ops — continuous delivery, deployment pipelines, automated testing, and of course, the unofficial mantra of “deploy, deploy, deploy. “However, post-deployment, too many DevOps transformations maintain the status quo and leave questionable Operations practices in place.

Now along comes a new vision for Operations called SRE (a.k.a. Site Reliability Engineering)… But SRE seems almost too good to be true!

SREs are cover much of what systems administrators used to do, but get to spend most of their time doing engineering work that adds enduring value to their company? How is it that SREs’ don’t get caught up in the interruptions, repetitive work, and drudgery that consumes so much of our time? And how do companies use SRE to do so much more with the same or less headcount?

This talk will take a close look at what SRE is, what SRE isn’t, and how SRE avoids the pitfalls that have plagued traditional Ops work. Finally, we’ll break down the principles behind the SRE movement and highlight how early examples are proving that DevOps + SRE = the end-to-end speed and quality promised since the early days of DevOps.

See a Demo of Rundeck Enterprise :
https://www.rundeck.com/see-demo

--or--

Download Rundeck Open Source here:
https://rundeck.com/open-source

Connect:
Stack Overflow community: https://stackoverflow.com/questions/tagged/rundeck
Github: https://github.com/rundeck/rundeck/issues
Twitter: https://twitter.com/Rundeck
Facebook: https://www.facebook.com/RundeckInc/
LinkedIn: www.linkedin.com › company › rundeck-inc

Publicado en: Tecnología

SysAdmin to SRE: Solving the Last Mile Problem

  1. 1. SysAdmin to SRE: Solving the Last Mile Problem Damon Edwards @damonedwards
  2. 2. Operations: The Last Mile
  3. 3. Operations: The Last Mile Silos Queues Excessive ToilLow Trust
  4. 4. Operations: The Last Mile https://www.youtube.com/watch?v=1zUtBLZ4Lus Silos Queues Excessive ToilLow Trust
  5. 5. SRE (Site Reliability Engineering)
  6. 6. “SRE… When you ask software engineers to do operations” “SRE… Next-generation, cloud-native Operations” Class SRE implements DevOps “SRE… When Ops does more engineering than Ops”
  7. 7. “SRE… When you ask software engineers to do operations” “SRE… Next-generation, cloud-native Operations” Class SRE implements DevOps “SRE… When Ops does more engineering than Ops” SRE
  8. 8. Why SRE? Simon Sinek Start with “why?”
  9. 9. Story time….
  10. 10. Its was just another Thursday…
  11. 11. Call Center Agent Call Center Agent My browser times out!Wow, this is so slow! I can’t login What a c#@p service! I can’t login Barely works It’s broken Customers Thursday 10:00am PDT (1200 Agents)
  12. 12. t a c#@p ervice! rks Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers
  13. 13. Call Center Agent Technical Support Service Desk Many tickets Many calls “Stuff isn’t working” “…but monitoring is all green” Service Desk OK OK OK OK OK Ops Ops
  14. 14. …but monitoring is all green” OK OK OK OK OK Call Center Agent Customer Now it works Now it works Service Desk ? Ops Ops 3:30pm
  15. 15. The next day…
  16. 16. Call Center Agent Call Center Agent My browser times out!Wow, this is so slow! I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Custo VIP Cu Friday 9:00am PDT
  17. 17. Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK
  18. 18. Service Desk Escalate! Incident Commander Ticket Launch the incident bridge Ops Incident Commander Ops Dev Sec Ops Bridge Call Ops Not me… Not me… Not me… Not me… No code updates Probably not the new server hardening process or the network changes… Headcount: 40
  19. 19. ev No code updates Probably not the new server dening process or the network changes… Ops Ops Ops Uhh.. WHAT new server hardening process and network changes? Sec We were going to fail audit… you didn’t get the email?
  20. 20. Dev Bridge Call No code updates War Room SysAdmin “Try this” Test Platform “Try this” Test Network “Try this” Test Security “Try this” Test Storage “Try this” Test SysEng “Try this” Test Incident Commander “Theory: new security updates” Call Center Agent Customer Now it works Now it works Call Center Manager What is going on? 3:30pm Headcount: 30
  21. 21. orks Ops Ops Sec Ops OpsOps Rollback: -OS changes -Network changes Over the weekend QA Headcount: 10
  22. 22. Monday morning…
  23. 23. Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Custo VIP Cus Monday 10:00am PDT
  24. 24. Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK
  25. 25. “…but monitoring is all green” Service Desk OK OK OK OK OK Customer Systems Lead Dev ding! Ignore. Incident Commander Hey did you s that ticket? Scrum
  26. 26. ustomer Systems Lead Dev Ignore. Incident Commander Hey did you see that ticket? sigh. I’ll take a look Scrum Customer Systems Lead Dev Customer S Lead D Somet the data
  27. 27. . I’ll take a look r Systems d Dev Customer Systems Lead Dev Something is wrong with the database connection… … But our code didn’t change. DBA No recent database updates.
  28. 28. Dev Bridge Call No code updates War Room DBA “Try this” Test SysAdmin “Try this” Test Network “Try this” Test Security “Try this” Test SysEng “Try this” Test Incident Commander “New Theory: Its the database connection” Call Center Manager What is going on? idn’t DBA No recent database updates. Headcount: 20
  29. 29. Dev Bridge Call No code updates War Room DBA “Try this” Test SysAdmin “Try this” Test Network “Try this” Test Security “Try this” Test SysEng “Try this” Test Incident Commander “New Theory: Its the database connection” Call Center Agent Customer Now it works Now it works Call Center Manager What is going on? 4:00pm Headcount: 20
  30. 30. The next day…
  31. 31. Dev Bridge Call No code updates War Room DBA “Try this” Test DBA “Try this” Test SysAdmin “Try this” Test SysEng “Try this” Test SysEng “Try this” Test Incident Commander“New Theory: “problem with stored procedures… but not sure what” Incident Commander DB Vendor phone support isn’t cutting it. Call Center Manager What is going on? Call Center Director What is being done? Tuesday 10:00am PDT Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers
  32. 32. Dev No code updates War Room Test Test Test Test Test Incident Commander Incident Commander Vendor Management DB Vendor phone support isn’t cutting it. We only paid for bronze support Call Center Manager What is going on? Call Center Director What is being done? Approval Request “Need to upgrade support” Finance ??
  33. 33. The next day…
  34. 34. Dev Bridge Call No code updates War Room Vendor Consultant “Let’s see with the vendor consultant says” Call Center Manager What is going on? Call Center Director What is being done? OK, let me take a look. Ven Cons So per Wednesday 10:00am PDT Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Headcount: 15
  35. 35. Dev e No code updates War Room Call Center Manager What is going on? Call Center Director What is being done? Vendor Consultant So? Someone toggled on the new performance analysis feature DBA 3:00pm dcount: 15
  36. 36. So? Vendor Consultant Its been choking on a particular stored procedure you use everywhere… This stored procedure has almost 400 parameters. It’s 1 million lines of code but… its been working for years! ? ? ?DBA Dev m
  37. 37. but… its been working for years! ? ? ? Ops SysEng QA Ops QA DBA change config load test Dev 1:00am Headcount: 10
  38. 38. but… its been working for years! ? ? ? Ops SysEng QA Ops QA DBA change config load test Dev 1:00am Headcount: 10 .
  39. 39. Post mortem…
  40. 40. Vendor Consultant Dir Finance No budget GM, Line of Business Stay on schedule You should really fix that… Ops It’s not fixed. It’s just turned off. VP Ops I’m told bug #8543 is P1, but was rejected? Ops Refactor it before it bites us again. VP Dev It’s not a bug. You already have a fix. Dev wins Dev wins Dev No time. Dev Their change broke it.Dev vs Ops
  41. 41. Vendor Consultant Dir Finance No budget GM, Line of Business Stay on schedule You should really fix that… Ops It’s not fixed. It’s just turned off. VP Ops I’m told bug #8543 is P1, but was rejected? Ops Refactor it before it bites us again. VP Dev It’s not a bug. You already have a fix. Dev wins Dev wins Dev No time. Dev Their change broke it.Dev vs Ops
  42. 42. Vendor Consultant Dir Finance No budget GM, Line of Business Stay on schedule You should really fix that… Ops It’s not fixed. It’s just turned off. VP Ops I’m told bug #8543 is P1, but was rejected? Ops Refactor it before it bites us again. VP Dev It’s not a bug. You already have a fix. Dev wins Dev wins Dev No time. Dev Their change broke it.Dev vs Ops
  43. 43. Vendor Consultant Dir Finance No budget GM, Line of Business Stay on schedule You should really fix that… Ops It’s not fixed. It’s just turned off. VP Ops I’m told bug #8543 is P1, but was rejected? Ops Refactor it before it bites us again. VP Dev It’s not a bug. You already have a fix. Dev wins Dev wins Dev No time. Dev Their change broke it.Dev vs Ops
  44. 44. Vendor Consultant Dir Finance No budget GM, Line of Business Stay on schedule You should really fix that… Ops It’s not fixed. It’s just turned off. VP Ops I’m told bug #8543 is P1, but was rejected? Ops Refactor it before it bites us again. VP Dev It’s not a bug. You already have a fix. Dev wins Dev wins Dev No time. Dev Their change broke it.Dev vs Ops
  45. 45. Vendor Consultant Dir Finance No budget GM, Line of Business Stay on schedule You should really fix that… Ops It’s not fixed. It’s just turned off. VP Ops I’m told bug #8543 is P1, but was rejected? Ops Refactor it before it bites us again. VP Dev It’s not a bug. You already have a fix. Dev wins Dev wins Dev No time. Dev Their change broke it.Dev vs Ops
  46. 46. Call Center Agent Call Center Agent My browser times out!Wow, this is so slow! I can’t login What a c#@p service! I can’t login Barely works It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Call Center Agent Customer Now it works Now it works Service Desk ? Ops Ops Thursday 10:00am PDT 3:30pm (1200 Agents) Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Service Desk Escalate! Incident Commander Ticket Launch the incident bridge Ops Incident Commander Ops Dev Sec Ops Bridge Call Ops Not me… Not me… Not me… Not me… No code updates Probably not the new server hardening process or the network changes… Ops Ops Ops Uhh.. WHAT new server hardening process and network changes? Sec We were going to fail audit… you didn’t get the email? Dev Bridge Call No code updates War Room SysAdmin “Try this” Test Platform “Try this” Test Network “Try this” Test Security “Try this” Test Storage “Try this” Test SysEng “Try this” Test Incident Commander “Theory: new security updates” Call Center Agent Customer Now it works Now it works Ops Ops Sec Ops Ops Call Center Manager What is going on? Ops Rollback: -OS changes -Network changes 3:30pm Over the weekend QA Headcount: 40 Headcount: 30 Headcount: 10 Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Bridge Call DBA “Try this” SysAdmin “Try this” Network “Try this” Security “Try this” SysEng “Try this” “New Theory: Its the database connection” Customer Systems Lead Dev ding! Ignore. Incident Commander Hey did you see that ticket? sigh. I’ll take a look Scrum Customer Systems Lead Dev Customer Systems Lead Dev Something is wrong with the database connection… … But our code didn’t change. DBA No recent database updates. Monday 10:00am PDT Headco Dev Bridge Call No code updates War Room DBA “Try this” Test DBA “Try this” Test SysAdmin “Try this” Test SysEng “Try this” Test SysEng “Try this” Test Incident Commander“New Theory: “problem with stored procedures… but not sure what” Incident Commander Vendor Management DB Vendor phone support isn’t cutting it. We only paid for bronze support Call Center Manager What is going on? Call Center Director What is being done? Approval Request “Need to upgrade support” Finance ?? Tuesday 10:00am PDT Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Dev Bridge Call No code updates War Room Vendor Consultant “Let’s see with the vendor consultant says” Call Center Manager What is going on? Call Center Director What is being done? OK, let me take a look. Vendor Consultant So? Vendor Consultant Its been choking on a particular stored procedure you use everywhere…Someone toggled on the new performance analysis feature This stored procedure has almost 400 parameters. It’s 1 million lines of code but… its been working for years! ? ? ? Ops Sys Ops QA change config load test Wednesday 10:00am PDT Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers DBA Dev 3:00pm Headcount: 15 Headcount: 10 Call Center Agent Call Center Agent My browser times out!Wow, this is so slow! I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers Friday 9:00am PDT
  47. 47. Call Center Agent Call Center Agent My browser times out!Wow, this is so slow! I can’t login What a c#@p service! I can’t login Barely works It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Call Center Agent Customer Now it works Now it works Service Desk ? Ops Ops Thursday 10:00am PDT 3:30pm (1200 Agents) Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Service Desk Escalate! Incident Commander Ticket Launch the incident bridge Ops Incident Commander Ops Dev Sec Ops Bridge Call Ops Not me… Not me… Not me… Not me… No code updates Probably not the new server hardening process or the network changes… Ops Ops Ops Uhh.. WHAT new server hardening process and network changes? Sec We were going to fail audit… you didn’t get the email? Dev Bridge Call No code updates War Room SysAdmin “Try this” Test Platform “Try this” Test Network “Try this” Test Security “Try this” Test Storage “Try this” Test SysEng “Try this” Test Incident Commander “Theory: new security updates” Call Center Agent Customer Now it works Now it works Ops Ops Sec Ops Ops Call Center Manager What is going on? Ops Rollback: -OS changes -Network changes 3:30pm Over the weekend QA Headcount: 40 Headcount: 30 Headcount: 10 Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Bridge Call DBA “Try this” SysAdmin “Try this” Network “Try this” Security “Try this” SysEng “Try this” “New Theory: Its the database connection” Customer Systems Lead Dev ding! Ignore. Incident Commander Hey did you see that ticket? sigh. I’ll take a look Scrum Customer Systems Lead Dev Customer Systems Lead Dev Something is wrong with the database connection… … But our code didn’t change. DBA No recent database updates. Monday 10:00am PDT Headco Dev Bridge Call No code updates War Room DBA “Try this” Test DBA “Try this” Test SysAdmin “Try this” Test SysEng “Try this” Test SysEng “Try this” Test Incident Commander“New Theory: “problem with stored procedures… but not sure what” Incident Commander Vendor Management DB Vendor phone support isn’t cutting it. We only paid for bronze support Call Center Manager What is going on? Call Center Director What is being done? Approval Request “Need to upgrade support” Finance ?? Tuesday 10:00am PDT Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Dev Bridge Call No code updates War Room Vendor Consultant “Let’s see with the vendor consultant says” Call Center Manager What is going on? Call Center Director What is being done? OK, let me take a look. Vendor Consultant So? Vendor Consultant Its been choking on a particular stored procedure you use everywhere…Someone toggled on the new performance analysis feature This stored procedure has almost 400 parameters. It’s 1 million lines of code but… its been working for years! ? ? ? Ops Sys Ops QA change config load test Wednesday 10:00am PDT Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers DBA Dev 3:00pm Headcount: 15 Headcount: 10 Call Center Agent Call Center Agent My browser times out!Wow, this is so slow! I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers Friday 9:00am PDT Response labor: $270,000 Lost call center productivity: $620,000 $890,000
  48. 48. Call Center Agent Call Center Agent My browser times out!Wow, this is so slow! I can’t login What a c#@p service! I can’t login Barely works It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Call Center Agent Customer Now it works Now it works Service Desk ? Ops Ops Thursday 10:00am PDT 3:30pm (1200 Agents) Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Service Desk Escalate! Incident Commander Ticket Launch the incident bridge Ops Incident Commander Ops Dev Sec Ops Bridge Call Ops Not me… Not me… Not me… Not me… No code updates Probably not the new server hardening process or the network changes… Ops Ops Ops Uhh.. WHAT new server hardening process and network changes? Sec We were going to fail audit… you didn’t get the email? Dev Bridge Call No code updates War Room SysAdmin “Try this” Test Platform “Try this” Test Network “Try this” Test Security “Try this” Test Storage “Try this” Test SysEng “Try this” Test Incident Commander “Theory: new security updates” Call Center Agent Customer Now it works Now it works Ops Ops Sec Ops Ops Call Center Manager What is going on? Ops Rollback: -OS changes -Network changes 3:30pm Over the weekend QA Headcount: 40 Headcount: 30 Headcount: 10 Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Bridge Call DBA “Try this” SysAdmin “Try this” Network “Try this” Security “Try this” SysEng “Try this” “New Theory: Its the database connection” Customer Systems Lead Dev ding! Ignore. Incident Commander Hey did you see that ticket? sigh. I’ll take a look Scrum Customer Systems Lead Dev Customer Systems Lead Dev Something is wrong with the database connection… … But our code didn’t change. DBA No recent database updates. Monday 10:00am PDT Headco Dev Bridge Call No code updates War Room DBA “Try this” Test DBA “Try this” Test SysAdmin “Try this” Test SysEng “Try this” Test SysEng “Try this” Test Incident Commander“New Theory: “problem with stored procedures… but not sure what” Incident Commander Vendor Management DB Vendor phone support isn’t cutting it. We only paid for bronze support Call Center Manager What is going on? Call Center Director What is being done? Approval Request “Need to upgrade support” Finance ?? Tuesday 10:00am PDT Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Dev Bridge Call No code updates War Room Vendor Consultant “Let’s see with the vendor consultant says” Call Center Manager What is going on? Call Center Director What is being done? OK, let me take a look. Vendor Consultant So? Vendor Consultant Its been choking on a particular stored procedure you use everywhere…Someone toggled on the new performance analysis feature This stored procedure has almost 400 parameters. It’s 1 million lines of code but… its been working for years! ? ? ? Ops Sys Ops QA change config load test Wednesday 10:00am PDT Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers DBA Dev 3:00pm Headcount: 15 Headcount: 10 Call Center Agent Call Center Agent My browser times out!Wow, this is so slow! I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers Friday 9:00am PDT Response labor: $270,000 Lost call center productivity: $620,000 $890,000 (+ project delays)
  49. 49. Call Center Agent Call Center Agent My browser times out!Wow, this is so slow! I can’t login What a c#@p service! I can’t login Barely works It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Call Center Agent Customer Now it works Now it works Service Desk ? Ops Ops Thursday 10:00am PDT 3:30pm (1200 Agents) Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Service Desk Escalate! Incident Commander Ticket Launch the incident bridge Ops Incident Commander Ops Dev Sec Ops Bridge Call Ops Not me… Not me… Not me… Not me… No code updates Probably not the new server hardening process or the network changes… Ops Ops Ops Uhh.. WHAT new server hardening process and network changes? Sec We were going to fail audit… you didn’t get the email? Dev Bridge Call No code updates War Room SysAdmin “Try this” Test Platform “Try this” Test Network “Try this” Test Security “Try this” Test Storage “Try this” Test SysEng “Try this” Test Incident Commander “Theory: new security updates” Call Center Agent Customer Now it works Now it works Ops Ops Sec Ops Ops Call Center Manager What is going on? Ops Rollback: -OS changes -Network changes 3:30pm Over the weekend QA Headcount: 40 Headcount: 30 Headcount: 10 Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Bridge Call DBA “Try this” SysAdmin “Try this” Network “Try this” Security “Try this” SysEng “Try this” “New Theory: Its the database connection” Customer Systems Lead Dev ding! Ignore. Incident Commander Hey did you see that ticket? sigh. I’ll take a look Scrum Customer Systems Lead Dev Customer Systems Lead Dev Something is wrong with the database connection… … But our code didn’t change. DBA No recent database updates. Monday 10:00am PDT Headco Dev Bridge Call No code updates War Room DBA “Try this” Test DBA “Try this” Test SysAdmin “Try this” Test SysEng “Try this” Test SysEng “Try this” Test Incident Commander“New Theory: “problem with stored procedures… but not sure what” Incident Commander Vendor Management DB Vendor phone support isn’t cutting it. We only paid for bronze support Call Center Manager What is going on? Call Center Director What is being done? Approval Request “Need to upgrade support” Finance ?? Tuesday 10:00am PDT Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Dev Bridge Call No code updates War Room Vendor Consultant “Let’s see with the vendor consultant says” Call Center Manager What is going on? Call Center Director What is being done? OK, let me take a look. Vendor Consultant So? Vendor Consultant Its been choking on a particular stored procedure you use everywhere…Someone toggled on the new performance analysis feature This stored procedure has almost 400 parameters. It’s 1 million lines of code but… its been working for years! ? ? ? Ops Sys Ops QA change config load test Wednesday 10:00am PDT Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers DBA Dev 3:00pm Headcount: 15 Headcount: 10 Call Center Agent Call Center Agent My browser times out!Wow, this is so slow! I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers Friday 9:00am PDT Response labor: $270,000 Lost call center productivity: $620,000 $890,000 (+ project delays) (+ brand damage)
  50. 50. Call Center Agent Call Center Agent My browser times out!Wow, this is so slow! I can’t login What a c#@p service! I can’t login Barely works It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Call Center Agent Customer Now it works Now it works Service Desk ? Ops Ops Thursday 10:00am PDT 3:30pm (1200 Agents) Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Service Desk Escalate! Incident Commander Ticket Launch the incident bridge Ops Incident Commander Ops Dev Sec Ops Bridge Call Ops Not me… Not me… Not me… Not me… No code updates Probably not the new server hardening process or the network changes… Ops Ops Ops Uhh.. WHAT new server hardening process and network changes? Sec We were going to fail audit… you didn’t get the email? Dev Bridge Call No code updates War Room SysAdmin “Try this” Test Platform “Try this” Test Network “Try this” Test Security “Try this” Test Storage “Try this” Test SysEng “Try this” Test Incident Commander “Theory: new security updates” Call Center Agent Customer Now it works Now it works Ops Ops Sec Ops Ops Call Center Manager What is going on? Ops Rollback: -OS changes -Network changes 3:30pm Over the weekend QA Headcount: 40 Headcount: 30 Headcount: 10 Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Bridge Call DBA “Try this” SysAdmin “Try this” Network “Try this” Security “Try this” SysEng “Try this” “New Theory: Its the database connection” Customer Systems Lead Dev ding! Ignore. Incident Commander Hey did you see that ticket? sigh. I’ll take a look Scrum Customer Systems Lead Dev Customer Systems Lead Dev Something is wrong with the database connection… … But our code didn’t change. DBA No recent database updates. Monday 10:00am PDT Headco Dev Bridge Call No code updates War Room DBA “Try this” Test DBA “Try this” Test SysAdmin “Try this” Test SysEng “Try this” Test SysEng “Try this” Test Incident Commander“New Theory: “problem with stored procedures… but not sure what” Incident Commander Vendor Management DB Vendor phone support isn’t cutting it. We only paid for bronze support Call Center Manager What is going on? Call Center Director What is being done? Approval Request “Need to upgrade support” Finance ?? Tuesday 10:00am PDT Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Dev Bridge Call No code updates War Room Vendor Consultant “Let’s see with the vendor consultant says” Call Center Manager What is going on? Call Center Director What is being done? OK, let me take a look. Vendor Consultant So? Vendor Consultant Its been choking on a particular stored procedure you use everywhere…Someone toggled on the new performance analysis feature This stored procedure has almost 400 parameters. It’s 1 million lines of code but… its been working for years! ? ? ? Ops Sys Ops QA change config load test Wednesday 10:00am PDT Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers DBA Dev 3:00pm Headcount: 15 Headcount: 10 Call Center Agent Call Center Agent My browser times out!Wow, this is so slow! I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers Friday 9:00am PDT Response labor: $270,000 Lost call center productivity: $620,000 $890,000 (+ project delays) (+ brand damage) > $1,000,000
  51. 51. How did they end up here?
  52. 52. Corporate Plan Annual Budget Project Plan Requirements
  53. 53. Corporate Plan Annual Budget Project Plan Requirements
  54. 54. Corporate Plan Annual Budget Project Plan Requirements
  55. 55. Corporate Plan Annual Budget Project Plan Requirements Context Context Process Process Tooling Tooling Capacity Capacity
  56. 56. What were they thinking?
  57. 57. 26 ITIL Processes Service Validation & Testing Strategy Management for IT Services Supplier Management The 7 Step Improvement Transition Planning & Support Access Management Availability Management Business Relationship Management Capacity Management Change Management Change Evaluation Demand Management Design Coordination Event Management Financial Management for IT Services Incident Management Information Security Management IT Service Continuity Management Knowledge Management Process Problem Management Process Release & Deployment Management Request Fulfillment Process Service Asset & Configuration Management Service Catalog Management Service Level Management Service Portfolio Management ITIL Processes The same as everyone else.
  58. 58. 26 ITIL Processes Service Validation & Testing Strategy Management for IT Services Supplier Management The 7 Step Improvement Transition Planning & Support Access Management Availability Management Business Relationship Management Capacity Management Change Management Change Evaluation Demand Management Design Coordination Event Management Financial Management for IT Services Incident Management Information Security Management IT Service Continuity Management Knowledge Management Process Problem Management Process Release & Deployment Management Request Fulfillment Process Service Asset & Configuration Management Service Catalog Management Service Level Management Service Portfolio Management
  59. 59. 26 ITIL Processes Service Validation & Testing Strategy Management for IT Services Supplier Management The 7 Step Improvement Transition Planning & Support Access Management Availability Management Business Relationship Management Capacity Management Change Management Change Evaluation Demand Management Design Coordination Event Management Financial Management for IT Services Incident Management Information Security Management IT Service Continuity Management Knowledge Management Process Problem Management Process Release & Deployment Management Request Fulfillment Process Service Asset & Configuration Management Service Catalog Management Service Level Management Service Portfolio Management
  60. 60. 26 ITIL Processes Service Validation & Testing Strategy Management for IT Services Supplier Management The 7 Step Improvement Transition Planning & Support Access Management Availability Management Business Relationship Management Capacity Management Change Management Change Evaluation Demand Management Design Coordination Event Management Financial Management for IT Services Incident Management Information Security Management IT Service Continuity Management Knowledge Management Process Problem Management Process Release & Deployment Management Request Fulfillment Process Service Asset & Configuration Management Service Catalog Management Service Level Management Service Portfolio Management
  61. 61. 26 ITIL Processes Service Validation & Testing Strategy Management for IT Services Supplier Management The 7 Step Improvement Transition Planning & Support Access Management Availability Management Business Relationship Management Capacity Management Change Management Change Evaluation Demand Management Design Coordination Event Management Financial Management for IT Services Incident Management Information Security Management IT Service Continuity Management Knowledge Management Process Problem Management Process Release & Deployment Management Request Fulfillment Process Service Asset & Configuration Management Service Catalog Management Service Level Management Service Portfolio Management
  62. 62. 26 ITIL Processes Service Validation & Testing Strategy Management for IT Services Supplier Management The 7 Step Improvement Transition Planning & Support Access Management Availability Management Business Relationship Management Capacity Management Change Management Change Evaluation Demand Management Design Coordination Event Management Financial Management for IT Services Incident Management Information Security Management IT Service Continuity Management Knowledge Management Process Problem Management Process Release & Deployment Management Request Fulfillment Process Service Asset & Configuration Management Service Catalog Management Service Level Management Service Portfolio Management Encourages Silos Context Context Process Process Tooling Tooling Capacity Capacity
  63. 63. 26 ITIL Processes Service Validation & Testing Strategy Management for IT Services Supplier Management The 7 Step Improvement Transition Planning & Support Access Management Availability Management Business Relationship Management Capacity Management Change Management Change Evaluation Demand Management Design Coordination Event Management Financial Management for IT Services Incident Management Information Security Management IT Service Continuity Management Knowledge Management Process Problem Management Process Release & Deployment Management Request Fulfillment Process Service Asset & Configuration Management Service Catalog Management Service Level Management Service Portfolio Management Encourages Silos Context Context Process Process Tooling Tooling Capacity Capacity Command and Control Management
  64. 64. 26 ITIL Processes Service Validation & Testing Strategy Management for IT Services Supplier Management The 7 Step Improvement Transition Planning & Support Access Management Availability Management Business Relationship Management Capacity Management Change Management Change Evaluation Demand Management Design Coordination Event Management Financial Management for IT Services Incident Management Information Security Management IT Service Continuity Management Knowledge Management Process Problem Management Process Release & Deployment Management Request Fulfillment Process Service Asset & Configuration Management Service Catalog Management Service Level Management Service Portfolio Management Encourages Silos Context Context Process Process Tooling Tooling Capacity Capacity Command and Control Management Deming “3. Cease dependence on inspection to achieve quality.”
  65. 65. 26 ITIL Processes Service Validation & Testing Strategy Management for IT Services Supplier Management The 7 Step Improvement Transition Planning & Support Access Management Availability Management Business Relationship Management Capacity Management Change Management Change Evaluation Demand Management Design Coordination Event Management Financial Management for IT Services Incident Management Information Security Management IT Service Continuity Management Knowledge Management Process Problem Management Process Release & Deployment Management Request Fulfillment Process Service Asset & Configuration Management Service Catalog Management Service Level Management Service Portfolio Management Encourages Silos Context Context Process Process Tooling Tooling Capacity Capacity Command and Control Management Deming “3. Cease dependence on inspection to achieve quality.” Charity Majors “Distributed systems have an infinite list of almost impossible failure scenarios”
  66. 66. 26 ITIL Processes Service Validation & Testing Strategy Management for IT Services Supplier Management The 7 Step Improvement Transition Planning & Support Access Management Availability Management Business Relationship Management Capacity Management Change Management Change Evaluation Demand Management Design Coordination Event Management Financial Management for IT Services Incident Management Information Security Management IT Service Continuity Management Knowledge Management Process Problem Management Process Release & Deployment Management Request Fulfillment Process Service Asset & Configuration Management Service Catalog Management Service Level Management Service Portfolio Management Encourages Silos Context Context Process Process Tooling Tooling Capacity Capacity Command and Control Management Deming “3. Cease dependence on inspection to achieve quality.” X X X X X X Charity Majors “Distributed systems have an infinite list of almost impossible failure scenarios”
  67. 67. Is there a different way?
  68. 68. The Rise of a New IT Operations Support Model By 2015, DevOps will evolve from a niche strategy employed by large cloud providers into a mainstream strategy employed by 20% of Global 2000 organizations Why DevOps will emerge: !DevOps is not usually driven from Why DevOps will not emerge: !Cultural changes are the hardest to by 20% of Global 2000 organizations. !DevOps is not usually driven from the top down and, thus, may be more easily accepted by IT operations teams. !Cultural changes are the hardest to implement, and DevOps requires a significant rethinking of IT operations conventional wisdom. !ITIL and other best practices frameworks are acknowledged to have not delivered on their goals, enabling IT organizations to look for !There is a large body of work with respect to ITIL and other best practices frameworks that is already accepted within the industry enabling IT organizations to look for new models. !The growing interest in tools such as Chef, Puppet, etc., will help accepted within the industry. !Open source (OSS) management tools, which are more aligned with this approach, have not seen pp p stimulate demand for OSS-based management pp significant enterprise market share traction. March 18, 2011 Cameron Haight DevOps vs ITIL?
  69. 69. The Rise of a New IT Operations Support Model By 2015, DevOps will evolve from a niche strategy employed by large cloud providers into a mainstream strategy employed by 20% of Global 2000 organizations Why DevOps will emerge: !DevOps is not usually driven from Why DevOps will not emerge: !Cultural changes are the hardest to by 20% of Global 2000 organizations. !DevOps is not usually driven from the top down and, thus, may be more easily accepted by IT operations teams. !Cultural changes are the hardest to implement, and DevOps requires a significant rethinking of IT operations conventional wisdom. !ITIL and other best practices frameworks are acknowledged to have not delivered on their goals, enabling IT organizations to look for !There is a large body of work with respect to ITIL and other best practices frameworks that is already accepted within the industry enabling IT organizations to look for new models. !The growing interest in tools such as Chef, Puppet, etc., will help accepted within the industry. !Open source (OSS) management tools, which are more aligned with this approach, have not seen pp p stimulate demand for OSS-based management pp significant enterprise market share traction. March 18, 2011 Cameron Haight DevOps vs ITIL?
  70. 70. The Rise of a New IT Operations Support Model By 2015, DevOps will evolve from a niche strategy employed by large cloud providers into a mainstream strategy employed by 20% of Global 2000 organizations Why DevOps will emerge: !DevOps is not usually driven from Why DevOps will not emerge: !Cultural changes are the hardest to by 20% of Global 2000 organizations. !DevOps is not usually driven from the top down and, thus, may be more easily accepted by IT operations teams. !Cultural changes are the hardest to implement, and DevOps requires a significant rethinking of IT operations conventional wisdom. !ITIL and other best practices frameworks are acknowledged to have not delivered on their goals, enabling IT organizations to look for !There is a large body of work with respect to ITIL and other best practices frameworks that is already accepted within the industry enabling IT organizations to look for new models. !The growing interest in tools such as Chef, Puppet, etc., will help accepted within the industry. !Open source (OSS) management tools, which are more aligned with this approach, have not seen pp p stimulate demand for OSS-based management pp significant enterprise market share traction. March 18, 2011 Cameron Haight DevOps vs ITIL?
  71. 71. Product, Not Project Continuous Delivery Shift Left (and more!) DevOps… Error Budgets 0 100 !! Toil Limits Cloud Native (and more!) …then comes SRE
  72. 72. Product, Not Project Continuous Delivery Shift Left (and more!) DevOps… Error Budgets 0 100 !! Toil Limits Cloud Native (and more!) …then comes SRE
  73. 73. Product, Not Project Continuous Delivery Shift Left Error Budgets 0 100 !! Toil Limits Cloud Native+ + + + +
  74. 74. Product, Not Project Continuous Delivery Shift Left Error Budgets 0 100 !! Toil Limits Cloud Native+ + + + + “Value-Aligned” and Self-Regulating
  75. 75. Product, Not Project Continuous Delivery Shift Left Error Budgets 0 100 !! Toil Limits Cloud Native+ + + + + “Value-Aligned” and Self-Regulating Dev Ops Cross-Functional Team Cross-Functional Team
  76. 76. Product, Not Project Continuous Delivery Shift Left Error Budgets 0 100 !! Toil Limits Cloud Native+ + + + + “Value-Aligned” and Self-Regulating Dev Ops Cross-Functional Team Cross-Functional Team Shared Responsibility Model
  77. 77. Product, Not Project Continuous Delivery Shift Left Error Budgets 0 100 !! Toil Limits Cloud Native+ + + + + “Value-Aligned” and Self-Regulating Dev Ops Cross-Functional Team Cross-Functional Team Shared Responsibility Model “DevOps is a deconstructive movement” Jon Hall
  78. 78. Developer Developer Developer Developer Developer Old Release Still Running Release Plan Release Plan Release Plan Release Plan Deploy Feature to Production Deploy Feature to Production Deploy Feature to Production Deploy Feature to Production Bugs Deploy Feature to Production Immutable microservice deployment scales, is faster with large teams and diverse platform components Adrian Cockcroft https://www.youtube.com/watch?v=nMTaS07i3jk DockerCon EU 2014 Architecture enables speed. Speed is the advantage.
  79. 79. Developer Developer Developer Developer Developer Old Release Still Running Release Plan Release Plan Release Plan Release Plan Deploy Feature to Production Deploy Feature to Production Deploy Feature to Production Deploy Feature to Production Bugs Deploy Feature to Production Immutable microservice deployment scales, is faster with large teams and diverse platform components Adrian Cockcroft https://www.youtube.com/watch?v=nMTaS07i3jk DockerCon EU 2014 Architecture enables speed. Speed is the advantage.
  80. 80. Developer Developer Developer Developer Developer Old Release Still Running Release Plan Release Plan Release Plan Release Plan Deploy Feature to Production Deploy Feature to Production Deploy Feature to Production Deploy Feature to Production Bugs Deploy Feature to Production Immutable microservice deployment scales, is faster with large teams and diverse platform components Adrian Cockcroft https://www.youtube.com/watch?v=nMTaS07i3jk DockerCon EU 2014 Architecture enables speed. Speed is the advantage. Keeps the people out of their own way!
  81. 81. What is the innovation of SRE?
  82. 82. Principles are what makes SRE different
  83. 83. Principles are what makes SRE different Stephen Thorne, Google At DevOps Enterprise Summit London 2018 “Principles of SRE” https://youtu.be/c-w_GYvi0eA
  84. 84. Principles are what makes SRE different 1. SRE needs Service Level Objectives, with consequences Stephen Thorne, Google At DevOps Enterprise Summit London 2018 “Principles of SRE” https://youtu.be/c-w_GYvi0eA
  85. 85. SLO and Error Budgets: Tools for Shared Responsibility 0 100 Service Level Objective Error Budget* Service Level Indicator (*Use this to improve the service)
  86. 86. SLO and Error Budgets: Tools for Shared Responsibility 0 100 Service Level Objective Error Budget* Service Level Indicator (*Use this to improve the service)
  87. 87. SLO and Error Budgets: Tools for Shared Responsibility 0 100 Service Level Objective Error Budget* Service Level Indicator (*Use this to improve the service) DEV BIZ Ops
  88. 88. SLO and Error Budgets: Tools for Shared Responsibility 0 100 Service Level Objective Error Budget* Service Level Indicator (*Use this to improve the service) DEV BIZ Ops SLO takes priority!!
  89. 89. Principles of SRE are what set SRE apart 1. SRE needs Service Level Objectives, with consequences Stephen Thorne, Google At DevOps Enterprise Summit London 2018 “Principles of SRE” https://youtu.be/c-w_GYvi0eA
  90. 90. Principles of SRE are what set SRE apart 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today Stephen Thorne, Google At DevOps Enterprise Summit London 2018 “Principles of SRE” https://youtu.be/c-w_GYvi0eA
  91. 91. Toil: Name For a Problem We’ve All Felt
  92. 92. Toil: Name For a Problem We’ve All Felt “Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” -Vivek Rau Google
  93. 93. Toil vs. Engineering Work Toil Engineering Work Lacks Enduring Value Builds Enduring Value Rote, Repetitive Creative, Iterative Tactical Strategic Increases With Scale Enables Scaling Can Be Automated Requires Human Creativity
  94. 94. Excessive Toil Prevents Fixing the System Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
  95. 95. Excessive Toil Prevents Fixing the System Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
  96. 96. Excessive Toil Prevents Fixing the System Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”) Downward spiral is inevitable!
  97. 97. Toil is a Naturally Occurring Force General Evolution of Automation 1. No automation 2. Externally maintained system-specific automation 3. Externally maintained generic automation 4. Internally maintained system-specific automation 5. Systems that don’t need any automation Niall Murphy Microsoft Azure
  98. 98. Toil is a Naturally Occurring Force General Evolution of Automation 1. No automation 2. Externally maintained system-specific automation 3. Externally maintained generic automation 4. Internally maintained system-specific automation 5. Systems that don’t need any automation Niall Murphy Microsoft Azure Launch (ToDos & Unknowns) Mature
  99. 99. Toil is a Naturally Occurring Force General Evolution of Automation 1. No automation 2. Externally maintained system-specific automation 3. Externally maintained generic automation 4. Internally maintained system-specific automation 5. Systems that don’t need any automation Niall Murphy Microsoft Azure Toil Toil Toil Toil Launch (ToDos & Unknowns) Mature
  100. 100. Principles of SRE are what set SRE apart 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today Stephen Thorne, Google At DevOps Enterprise Summit London 2018 “Principles of SRE” https://youtu.be/c-w_GYvi0eA
  101. 101. Principles of SRE are what set SRE apart 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload Stephen Thorne, Google At DevOps Enterprise Summit London 2018 “Principles of SRE” https://youtu.be/c-w_GYvi0eA
  102. 102. SRE teams have the ability to regulate their workload
  103. 103. SRE teams have the ability to regulate their workload SRE can say no.
  104. 104. SRE teams have the ability to regulate their workload Example: SRE can say no.
  105. 105. SRE teams have the ability to regulate their workload Example: What if handing-off responsibility to SRE/Ops wasn’t a right? SRE can say no.
  106. 106. SRE teams have the ability to regulate their workload Example: What if handing-off responsibility to SRE/Ops wasn’t a right? (separate the “running in production” from “run by SRE/Ops”) SRE can say no.
  107. 107. Principles of SRE are what set SRE apart 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload
  108. 108. What's the Difference Between DevOps and SRE? 
 (class SRE implements DevOps) @sethvargo@lizthegrey
  109. 109. Where to start (the practical approach)
  110. 110. Where to start (the practical approach) 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload
  111. 111. Where to start (the practical approach) 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload Company-wide culture change (hard!)
  112. 112. Where to start (the practical approach) 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload Company-wide culture change (hard!) Company-wide culture change (hard!)
  113. 113. Where to start (the practical approach) 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload Company-wide culture change (hard!) Company-wide culture change (hard!) Reduce toil.
 Everybody wins!
  114. 114. Where to start (the practical approach) 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload Company-wide culture change (hard!) Company-wide culture change (hard!) Reduce toil.
 Everybody wins!
  115. 115. Why focus on reducing toil?
  116. 116. Why focus on reducing toil? 1. Lots of value independent of “SRE”
  117. 117. 2. Your people are you most expensive assets
 … stay out of their way! Why focus on reducing toil? 1. Lots of value independent of “SRE”
  118. 118. Start reducing toil today Toil
  119. 119. Start reducing toil today 1. Track toil levels for each team Toil
  120. 120. Start reducing toil today 1. Track toil levels for each team Toil
  121. 121. Track toil levels for each team
  122. 122. Track toil levels for each team • Standardize (e.g. meetings and email are “overhead" not “toil”)
  123. 123. Track toil levels for each team • Standardize (e.g. meetings and email are “overhead" not “toil”) • Track • Self-reporting • Periodic surveys • SM or PM interview/sampling
  124. 124. Track toil levels for each team • Standardize (e.g. meetings and email are “overhead" not “toil”) • Track • Self-reporting • Periodic surveys • SM or PM interview/sampling • Don’t get lost in time tracking weeds!
  125. 125. Start reducing toil today 1. Track toil levels for each team Toil
  126. 126. Start reducing toil today 1. Track toil levels for each team Toil 2. Set toil limit for each team (50% is conventional wisdom)
  127. 127. Start reducing toil today 1. Track toil levels for each team 2. Set toil limit for each team (50% is conventional wisdom) 3. Fund efforts to reduce toil (with emphasis on teams already over limit) Toil
  128. 128. Start reducing toil today 1. Track toil levels for each team 2. Set toil limit for each team (50% is conventional wisdom) 3. Fund efforts to reduce toil (with emphasis on teams already over limit) Toil Michael Kehoe Todd Palino (LinkedIn) At SREcon Americas 2019 Example Process “Code Yellow”
  129. 129. Where to focus? Toil
  130. 130. Where to focus? Toil Reduce Technical Debt
  131. 131. Where to focus? Toil Reduce Technical Debt Re-Engineer Processes
  132. 132. Where to focus? Toil Reduce Technical Debt Re-Engineer Processes Enable Self-Service
  133. 133. Where to focus? Toil Reduce Technical Debt Re-Engineer Processes Enable Self-Service
  134. 134. Eliminate Interruptions Eliminate Waiting
  135. 135. Eliminate Interruptions Eliminate Waiting Self-Service (runbooks) Do X.
  136. 136. Eliminate Interruptions Eliminate Waiting Self-Service (runbooks) Do X. … and a lot less toil
  137. 137. Empower teams to spot and fix the anti-patterns.
  138. 138. “Fix this for me, fix it again, then fix it again.” Done.I need you to do X Your other work I need you to do X I need you to do X Ticket Do X Later… Do X Do X Done. Done. Your other work Self-Service Self-Service Self-Service Your other work x2 Your other work x3 Later…Later… Later… Your other work Your other work After Before Wait Interrupt Ticket Wait Interrupt Ticket Wait Interrupt
  139. 139. “Fix this for me, fix it again, then fix it again.” Done.I need you to do X Your other work I need you to do X I need you to do X Ticket Do X Later… Do X Do X Done. Done. Your other work Self-Service Self-Service Self-Service Your other work x2 Your other work x3 Later…Later… Later… Your other work Your other work After Before Wait Interrupt Ticket Wait Interrupt Ticket Wait Interrupt
  140. 140. “I could fix it, but I can’t get to it.” Environment I could fix it if I could get to it Before Wait Interrupt
  141. 141. “I could fix it, but I can’t get to it.” Environment I could fix it if I could get to it Before Wait Interrupt After I’ve got this! Environment Self- Service
  142. 142. “The dog-pile.” !! I think its a problem with db07-store2.uswest.acme “$ top” “$ top” db07store2. uswest.acme “$ top” “$ top” “$ top” !! “$ top” !! !! !! healthcheck store2 -all db07store2. uswest.acme Self-Service 1. 2. 3. I think its a problem with db07-store2.uswest.acme
  143. 143. “I’m an expert, I don’t read the wiki.” docs Service has changed. Use this flag or bad things will happen! Pause monitoring first or we all get woken up! “restart -doit -now” I’ve done this before. I’ve got this… Environment docs Later… Before
  144. 144. “I’m an expert, I don’t read the wiki.” docs Service has changed. Use this flag or bad things will happen! Pause monitoring first or we all get woken up! “restart -doit -now” I’ve done this before. I’ve got this… Environment docs Later… Before
  145. 145. “I’m an expert, I don’t read the wiki.” docs Service has changed. Use this flag or bad things will happen! Pause monitoring first or we all get woken up! “restart -doit -now” I’ve done this before. I’ve got this… Environment docs Later… Before Service has changed. Use this flag or bad things will happen! Pause monitoring first or we all get woken up! “restart” Environment Later… Update Restart Job ✅ I’ve done this before. I’ve got this. Self-Service Self-Service After
  146. 146. “Known issue… doesn’t get permanent fix”
  147. 147. “Known issue… doesn’t get permanent fix”
  148. 148. Recap: Make Tomorrow Better Than Today Beware: impact of traditional management structures Be practical and start focusing on toil Find and fix toil anti-patterns Empower with Self-Service Runbooks SRE is a new way to think about Ops work 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload Done.I need you to do X Your other work I need you to do X I need you to do X Ticket Do X Later… Do X Do X Done. Done. Your other work Self-Service Self-Service Self-Service Your other work x2 Your other work x3 Later…Later… Later… Your other work Your other work After Before Wait Interrupt Ticket Wait Interrupt Ticket Wait Interrupt Toil Use DevOps and SRE to improve speed and quality After I’ve got this! Environment Self- Service
  149. 149. Let’s talk… @damonedwards damon@rundeck.com

×