Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

The Theory and Practice, Practice, Practice of AWS Operations - AWS Summit Sydney

143 visualizaciones

Publicado el

At AWS scale, with millions of customers and trillions of actions happening every day, we have learned some hard-earned lessons about how to operate reliable and secure services. In this session we'll cover repeatable patterns, tips, and stories from operating some of AWS's largest services.

  • Sé el primero en comentar

The Theory and Practice, Practice, Practice of AWS Operations - AWS Summit Sydney

  1. 1. S U M M I T SYDNEY
  2. 2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T The theory and practice, practice, practice of AWS Operations Colm MacCárthaigh Senior Principal Engineer Amazon Web Services
  3. 3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Do you carry a pager?
  4. 4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Industry-wide recovery rate
  5. 5. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  6. 6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T What is operations?
  7. 7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T What is operations? Operations is the doing of things Over and over, better and better How do we form good habits, and prevent bad habits
  8. 8. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  9. 9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T What is operations? Great operations is built on humility We build robust systems and designs, but anticipate that there will be something that we didn’t think of Healthy paranoia helps too
  10. 10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T What is operations? Every operational action is reviewed ”Two Person Rule” for anything that is not routine Constant strive to automate routine actions away
  11. 11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T What are we going to cover? What is different at scale? What is operational risk? Compartmentalisation Deployment Safety The Operational Mindset Staying SAFE when things go wrong
  12. 12. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  13. 13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T What is different at scale? Something is “broken” all of the time The stakes are higher The number of people involved is larger
  14. 14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T What is different at scale? There are greater opportunities to perfect automation and operational practices There is more experience to learn from The number of people involved is larger
  15. 15. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  16. 16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T How we think about operational risk Operational risk Is really related to change • Key types of change • Component failure • Environmental failure • Increases in load • New code paths being exercised • New modes of operation
  17. 17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T How we think about operational risk Operational risk Is really related to change • Key types of change • Component failure • Environmental failure • Increases in load • New code paths being exercised • New modes of operation
  18. 18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T How we think about operational risk Operational risk Is really related to change • Key types of change • Component failure • Environmental failure • Increases in load • New code paths being exercised • New modes of operation
  19. 19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T How we think about operational risk Operational risk Is really related to change • Key types of change • Component failure • Environmental failure • Increases in load • New code paths being exercised • New modes of operation
  20. 20. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T How we think about operational risk Operational risk Is really related to change • Key types of change • Component failure • Environmental failure • Increases in load • New code paths being exercised • New modes of operation
  21. 21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T How we think about operational risk Operational risk Is really related to change • Key types of change • Component failure • Environmental failure • Increases in load • New code paths being exercised • New modes of operation
  22. 22. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  23. 23. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Compartmentalisation
  24. 24. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Compartmentalisation
  25. 25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Compartmentalisation
  26. 26. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Compartmentalisation
  27. 27. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Compartmentalisation
  28. 28. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Compartmentalisation
  29. 29. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Compartmentalisation ap-southeast-2 us-east-2 eu-west-1 …
  30. 30. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Compartmentalisation ap-southeast-2b us-east-2b eu-west-1b … ap-southeast-2a ap-southeast-2c us-east-2c eu-west-1ceu-west-1a us-east-2a
  31. 31. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Compartmentalisation ap-southeast-2b us-east-2b eu-west-1b … ap-southeast-2a ap-southeast-2c us-east-2c eu-west-1ceu-west-1a us-east-2a
  32. 32. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Compartmentalisation ap-southeast-2b us-east-2b eu-west-1b … ap-southeast-2a ap-southeast-2c us-east-2c eu-west-1ceu-west-1a us-east-2a
  33. 33. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Compartmentalisation ap-southeast-2b us-east-2b eu-west-1b … ap-southeast-2a ap-southeast-2c us-east-2c eu-west-1ceu-west-1a us-east-2a
  34. 34. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  35. 35. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Deployment Safety
  36. 36. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Deployment Safety New code means risk, so we are incredibly paranoid about deploying it CI/CD staged deployment process Promotion testing and monitoring at every stage, with automated rollback Fast and reliable rollback
  37. 37. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Deployment Safety 1. Code-review 2. Check-in 3. Pre-Production 4. One Box 5. One Availability Zone 6. One Region 7. Onwards …
  38. 38. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Deployment Safety Reliable and fast rollback is key Random Selection is useful to avoid repeat issues Services have to be designed with phased deployment and rollback in mind What if we’re making backwards-incompatible changes?
  39. 39. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  40. 40. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  41. 41. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T The operational mindset Services are not simply code in an editor Services are live running systems that respond to input Services change over time even if you don’t change the code There is no “done”
  42. 42. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T The operational mindset Every week at AWS we have an ‘All Hands’ operations meeting We dig into any COEs (Corrections of Error) from the previous week We choose some services at random and dive into their operational metrics We look at operational sustainability too
  43. 43. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T The operational mindset
  44. 44. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T The operational mindset Every team has a corresponding weekly operations meeting: review the previous weeks metrics We pay close attention to any alarms that fired, and any tickets that were cut On-call report from the engineers who were on-call
  45. 45. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  46. 46. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Staying SAFE when things go wrong Every team has an active on-call engineer On-call engineers are automatically engaged for most issues CloudWatch Alarms -> tracking ticket -> page
  47. 47. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Staying SAFE when things go wrong For larger issues, or when there is elevated risk, we use voice conference calls to coordinate Every call has a designated “Call Leader” and a ”facilitator” from our technical operations team Call leaders are experienced and tenured AWS staff and are empowered to make decisions
  48. 48. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. When doesn’t apply
  49. 49. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T The Theory and Practice, Practice, Practice of AWS Operations Stay calm Assess the situation Focus on mitigation Escalate early and often
  50. 50. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Stay calm Big events can be overwhelming, but cooler heads prevail. Keep a sense of urgency, but be methodical and avoid a frenetic energy. Most of all, don’t panic! Remember that we have a 100% success rate of mitigating operational events, so take a quick breath if you need to, we will solve each and every challenge. Try to avoid team huddles that are not on a conference call with an experienced designated leader, as they split focus and detract from resolution. Call-leaders can help you prioritise and engage others. Use the ticket and chime or IRC for communication too.
  51. 51. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Assess the situation If you join a call, first read the associated ticket or SIM, especially the summary. If the call has a large number of participants then please don’t interrupt it to check in, instead use the ticket/SIM. Assess your own dashboards, alarms and alerts and be prepared to report the summary for your own area, service or team. Do this continuously for the duration of the call. Every call participant should be active. If you can, take notes for yourself as you go; the names of other people on the call, key pieces of technology or information you might not be familiar with, etc …
  52. 52. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Focus on mitigation, not root cause analysis Most issues can be remediated much more quickly with rollbacks, data center flips, database failover, throttling and other operational actions that essentially quench the source of the problem rather than truly “fixing” it in a deeper sense (i.e. patching the code). At Amazon we are happy to roll back speculatively, on the basis that it might fix the problem. More often than not, it is not necessary to truly understand the problem, so defer root cause analysis until you have either exhausted traditional rollback/flipping steps for your service, or for when you can do remediation actions and root cause analysis in parallel.
  53. 53. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Escalate early and often The absolute minimum knowledge for every on-call engineer is to know how to escalate: be able to page your secondary, your manager, and any other escalation channels appropriate for your service. It is always ok to escalate! If you yourself are a secondary or an escalation person, be grateful and supportive too for getting paged. If an LSE is impacting your service, and if recovery steps are likely, escalate early and get help! Paging early has been shown to save tens of minutes of impact time during typical events.
  54. 54. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Takeaways • At AWS teams own their business, development, and operations • We think of operational risk in terms of change • Compartmentalisation limits the blast radius of issues • Steady deployment safety wrangles risks associated with software and configuration changes • AWS company culture values operational excellence • SAFE when things go wrong
  55. 55. Thank you! S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Colm MacCárthaigh colmmacc@amazon.com

×