Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Applying Principles of Chaos
Engineering to Serv...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
What is chaos engineering?
New challenges...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
After the talk
Slides will be shared on Slidesha...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What is chaos engineering?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering is the discipline of experimen...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Smallpox
Earliest evidence of disease in third c...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
History of vaccination
First vaccine was develop...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
History of vaccination
WHO certified global erad...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
https://en.wikipedia.org/wiki/Vaccine
History of...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
History of vaccination
Vaccination is the most e...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
History of vaccination
Vaccines stimulate the im...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering
Use controlled experiments to ...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering
Help us learn about our system...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering
Lets us build confidence in it...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering is the vaccine to frailties in...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Who am I?
Principal engineer at DAZN
AWS Serverl...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
About DAZN
Available in seven countries—Austria,...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
About DAZN
Around 1,000,000 concurrent viewers a...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering has an image problem
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering has an image problem
Too much ...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering has an image problem
Easy to c...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering has an image problem
The goal ...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering has an image problem
The goal ...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos in practice
Four steps to start running ch...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 1. Define “steady state”
What does normal, ...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
this is not a
steady state
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesize steady state will
continue in both c...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos in practice
Explore unknown unknowns away ...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos in practice
Experiments that graduate to p...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos in practice
Treat production with the care...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos in practice
If you knew the system would b...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 3. Inject realistic failures
For example, s...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos in practice
Netflix’s Simian Army:
https:/...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 4. Disprove hypothesis
In other words, look...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos in practice
Look for evidence that steady ...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos in practice
Address weaknesses before fail...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Containment
Experiments needs to be controlled
T...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Containment
Ensure everyone knows what you are d...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Containment
Run experiments during office hours
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Containment
Avoid important dates
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Containment
Make the smallest change necessary t...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Containment
Have a rollback plan
Stop the experi...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Containment
Don’t start in production
Can learn ...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
by Russ Miles @russmiles
source https://medium.c...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
New challenges with serverless
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
chaos monkey kills an
Amazon Elastic Cloud
(Amaz...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless challenges
There are no servers that ...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
There is more inherent chaos and complexity in a...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless challenges
Smaller units of deploymen...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
serverful
serverlessServerless challenges
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless challenges
Every function needs to be...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Kinesis
?
SNS
CloudWatch
Events
CloudWatch
LogsI...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless challenges
A lot of managed, intermed...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless challenges
Unknown failure modes in t...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless challenges
Often there’s little we ca...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Common weaknesses
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Common weaknesses
Improperly tuned timeouts
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Common weaknesses
Missing error handling
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Common weaknesses
Missing fallback
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Common weaknesses
Missing regional failover
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Latency injection with serverless
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 1. Define “steady state”
What does normal, ...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Defining steady state
What metrics do you use?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Defining steady state
p95/p99 latencies, error c...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesize steady state will
continue in both c...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
API Gateway
Serverless considerations
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless considerations
Consider the effect of...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
Strategy should:
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
Strategy should:
1. Give reques...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
Strategy should:
1. Give reques...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
Finding the right timeout value...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
Too short: requests not given t...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
Too long: risk timing out the c...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
Even more complicated when you ...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Approach 1: Split invocation time equally
(for e...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Approach 2: Every request is given nearly all th...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
Proposal: set request timeouts ...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Set timeout based on remaining invocation time
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Set timeout based on remaining invocation time
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Recovery steps
Log the timeout with as much cont...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Recovery steps
Record custom metrics
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Recovery steps
Use fallbacks
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Recovery steps
Be mindful when you sacrifice pre...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 3. Inject realistic failures
For example, s...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesis:
Function has appropriate timeout on ...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
Should be applied to th...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
Be mindful of the blast...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
http client
public-api-a
http client
public-api-...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesis:
All functions have appropriate timeo...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
Large blast radius, can...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Priming (psychology):
Priming is a technique whe...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Use failure injection to program your colleagues...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
Make X% of all requests...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesis:
The client app has appropriate timeo...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 4. Disprove hypothesis
In other words, look...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How to inject latency?
Static weavers (such as P...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
https://theburningmonk.com/2015/04/design-for-la...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How to inject latency?
Manually crafted wrapper ...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Configured in SSM Parameter Store
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
No injected latency
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
With injected latency
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Factory wrapper function
(think bluebird’s promisifyAll function)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Error injection with serverless
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Common errors
HTTP 5XX
Amazon DynamoDB provision...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesis:
Function has appropriate error handl...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject errors?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesis:
Function has appropriate error handl...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject errors?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject errors?
Induce Lambda throttling...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Recap
Failures are INEVITABLE
The only way to truly know your system’s
resilience against failures is to test it
through CONTROLLED experiments
The goal of chaos engineering is NOT to
actually break production
CONTAINMENT should be front and
centre of your thinking
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 1. Define “steady state”
What does normal, ...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesize steady state will
continue in both c...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 3. Inject realistic failures
For example, s...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 4. Disprove hypothesis
In other words, look...
There is more inherent chaos and
complexity in a serverless application
Even without servers, you can still inject
CONTROLLED failures at the application level
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Yan Cui
@theburningmonk
https://thebu...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Related breakouts
Wednesday, Nov 28
SRV425-R - B...
Please complete the session
survey in the mobile app.
!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights re...
Applying principles of chaos engineering to serverless (reinvent DVC305)
Applying principles of chaos engineering to serverless (reinvent DVC305)
Applying principles of chaos engineering to serverless (reinvent DVC305)
Applying principles of chaos engineering to serverless (reinvent DVC305)
Applying principles of chaos engineering to serverless (reinvent DVC305)
Próxima SlideShare
Cargando en…5
×

Applying principles of chaos engineering to serverless (reinvent DVC305)

1.598 visualizaciones

Publicado el

Chaos engineering is a discipline that focuses on improving system resilience through experiments that expose the inherent chaos and failure modes in our system, in a controlled fashion, before these failure modes manifest themselves like a wildfire in production and impact our users.

Netflix is undoubtedly the leader in this field, but much of the publicised tools and articles focus on killing EC2 instances, and the efforts in the serverless community has been largely limited to moving those tools into AWS Lambda functions.

But how can we apply the same principles of chaos to a serverless architecture built around AWS Lambda functions?

These serverless architectures have more inherent chaos and complexity than their serverful counterparts, and, we have less control over their runtime behaviour. In short, there are far more unknown unknowns with these systems.

Can we adapt existing practices to expose the inherent chaos in these systems? What are the limitations and new challenges that we need to consider?

Publicado en: Tecnología
  • Sé el primero en comentar

Applying principles of chaos engineering to serverless (reinvent DVC305)

  1. 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Applying Principles of Chaos Engineering to Serverless Yan Cui Principal Engineer DAZN D V C 3 0 5
  2. 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda What is chaos engineering? New challenges with serverless Applying latency injection to serverless Applying error injection to serverless
  3. 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. After the talk Slides will be shared on Slideshare Recording will be posted on YouTube within 48 hours Find the links on https://theburningmonk.com/reinvent2018
  4. 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is chaos engineering?
  5. 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. - principlesofchaos.org
  6. 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Smallpox Earliest evidence of disease in third century BC Egyptian mummy Estimated 400K deaths per year in eighteenth century Europe
  7. 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. History of vaccination First vaccine was developed in 1798 by Edward Jenner https://en.wikipedia.org/wiki/Edward_Jenner
  8. 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. History of vaccination WHO certified global eradication in 1980 https://en.wikipedia.org/wiki/Edward_Jenner
  9. 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://en.wikipedia.org/wiki/Vaccine History of vaccination
  10. 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. History of vaccination Vaccination is the most effective method to prevent infectious diseases
  11. 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. History of vaccination Vaccines stimulate the immune system to recognize and destroy the disease before contracting it for real
  12. 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering Use controlled experiments to inject failures into our system
  13. 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering Help us learn about our system’s behavior and uncover unknown failure modes, before they manifest like wildfire in production
  14. 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering Lets us build confidence in its ability to withstand turbulent conditions
  15. 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering is the vaccine to frailties in modern software
  16. 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Who am I? Principal engineer at DAZN AWS Serverless hero Author of Production-Ready Serverless* course by Manning. Blogger**, speaker. * https://bit.ly/production-ready-serverless ** https://theburningmonk.com
  17. 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. About DAZN Available in seven countries—Austria, Switzerland, Germany, Japan, Canada, Italy, and USA Available on 30+ platforms
  18. 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. About DAZN Around 1,000,000 concurrent viewers at peak
  19. 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering has an image problem
  20. 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering has an image problem Too much emphasis is on breaking things
  21. 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering has an image problem Easy to conflate the action of injecting failures with the payback
  22. 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering has an image problem The goal is to learn about the system and build confidence
  23. 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering has an image problem The goal is not to break things
  24. 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Four steps to start running chaos experiments yourself
  25. 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 1. Define “steady state” What does normal, working condition looks like?
  26. 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. this is not a steady state
  27. 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesize steady state will continue in both control group & the experiment group In other words, you should have a reasonable degree of confidence the system would handle the failure before you proceed with the experiment STEP 2.
  28. 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Explore unknown unknowns away from production
  29. 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Experiments that graduate to production should be carefully considered and planned You should have reasonable confidence in the system before running experiments in production
  30. 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Treat production with the care it deserves The goal is not to break things
  31. 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice If you knew the system would break and you did it anyway, then it’s not a chaos experiment! It’s called being irresponsible.
  32. 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 3. Inject realistic failures For example, server crash, network error, HD malfunction, more
  33. 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Netflix’s Simian Army: https://github.com/Netflix/SimianArmy Chaos Engineering ebook (O’Reilly): http://oreil.ly/2tZU1Sn
  34. 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 4. Disprove hypothesis In other words, look for difference in steady state
  35. 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Look for evidence that steady state was impacted by the injected failure
  36. 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Address weaknesses before failures happen for real
  37. 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Experiments needs to be controlled The goal is not to break things
  38. 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Ensure everyone knows what you are doing Don’t surprise your teammates
  39. 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Run experiments during office hours
  40. 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Avoid important dates
  41. 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Make the smallest change necessary to prove or disprove hypothesis
  42. 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Have a rollback plan Stop the experiment right away if things start to go wrong
  43. 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Don’t start in production Can learn a lot by running experiments in staging
  44. 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  45. 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. New challenges with serverless
  46. 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. chaos monkey kills an Amazon Elastic Cloud (Amazon EC2) instance latency monkey induces artificial delay in APIs chaos gorilla kills an AWS Availability Zone chaos kong kills an entire AWS region
  47. 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  48. 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless challenges There are no servers that you can access and kill
  49. 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  50. 50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  51. 51. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. There is more inherent chaos and complexity in a serverless architecture.
  52. 52. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless challenges Smaller units of deployment, but a lot more of them
  53. 53. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. serverful serverlessServerless challenges
  54. 54. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless challenges Every function needs to be correctly configured and secured
  55. 55. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Kinesis ? SNS CloudWatch Events CloudWatch LogsIoT Core DynamoDB S3 SES Serverless challenges
  56. 56. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless challenges A lot of managed, intermediate services Each with its own set of failure modes
  57. 57. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless challenges Unknown failure modes in the infrastructure we don’t control
  58. 58. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless challenges Often there’s little we can do when an outage occurs in the platform
  59. 59. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common weaknesses
  60. 60. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common weaknesses Improperly tuned timeouts
  61. 61. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common weaknesses Missing error handling
  62. 62. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common weaknesses Missing fallback
  63. 63. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common weaknesses Missing regional failover
  64. 64. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Latency injection with serverless
  65. 65. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 1. Define “steady state” What does normal, working condition looks like?
  66. 66. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Defining steady state What metrics do you use?
  67. 67. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Defining steady state p95/p99 latencies, error count, backlog size, yield*, harvest** * percentage of requests completed ** completeness of the returned response
  68. 68. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesize steady state will continue in both control group & the experiment group In other words, you should have a reasonable degree of confidence the system would handle the failure before you proceed with the experiment STEP 2.
  69. 69. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. API Gateway Serverless considerations
  70. 70. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless considerations Consider the effect of cold starts How does it affect your strategy for handling slow responses
  71. 71. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Strategy should:
  72. 72. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Strategy should: 1. Give requests the best chance to succeed
  73. 73. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Strategy should: 1. Give requests the best chance to succeed 2. Do not allow slow response to timeout the caller function
  74. 74. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Finding the right timeout value is tricky
  75. 75. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Too short: requests not given the best chance to succeed
  76. 76. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Too long: risk timing out the calling function
  77. 77. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Even more complicated when you have multiple integration points
  78. 78. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Approach 1: Split invocation time equally (for example, 3 requests, 6s function timeout = 2s timeout per request)
  79. 79. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Approach 2: Every request is given nearly all the invocation time (for example, 3 requests, 6s function timeout = 5s timeout per request)
  80. 80. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Proposal: set request timeouts dynamically based on invocation time left
  81. 81. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts
  82. 82. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Set timeout based on remaining invocation time
  83. 83. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Set timeout based on remaining invocation time
  84. 84. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recovery steps Log the timeout with as much context as possible The API, timeout value, correlation IDs, request object, and more
  85. 85. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recovery steps Record custom metrics
  86. 86. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recovery steps Use fallbacks
  87. 87. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  88. 88. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recovery steps Be mindful when you sacrifice precision for availability User experience is the king
  89. 89. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 3. Inject realistic failures For example, server crash, network error, HD malfunction, more
  90. 90. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  91. 91. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesis: Function has appropriate timeout on its HTTP communications and can degrade gracefully when these requests time out
  92. 92. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  93. 93. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency? Should be applied to third-party services too DynamoDB, Twillio, Auth0 …
  94. 94. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  95. 95. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency? Be mindful of the blast radius of the experiment The goal is not to break things
  96. 96. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. http client public-api-a http client public-api-b internal-api Where to inject latency?
  97. 97. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesis: All functions have appropriate timeout on their HTTP communications to this internal API and can degrade gracefully when requests are timed out
  98. 98. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  99. 99. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  100. 100. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency? Large blast radius, can cause cascade failures unintentionally
  101. 101. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  102. 102. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Priming (psychology): Priming is a technique whereby exposure to one stimulus influences a response to a subsequent stimulus, without conscious guidance or intention It is a technique in psychology used to train a person's memory both in positive and negative ways
  103. 103. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use failure injection to program your colleagues into thinking about failure modes early.
  104. 104. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency? Make X% of all requests slow in the dev environment
  105. 105. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesis: The client app has appropriate timeout on their HTTP communication with the server and can degrade gracefully when requests are timed out
  106. 106. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  107. 107. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  108. 108. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  109. 109. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  110. 110. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 4. Disprove hypothesis In other words, look for difference in steady state
  111. 111. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to inject latency?
  112. 112. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to inject latency? Static weavers (such as PostSharp, AspectJ) Dynamic proxies
  113. 113. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://theburningmonk.com/2015/04/design-for-latency-issues/
  114. 114. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to inject latency? Manually crafted wrapper libraries
  115. 115. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  116. 116. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  117. 117. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  118. 118. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  119. 119. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  120. 120. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Configured in SSM Parameter Store
  121. 121. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  122. 122. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. No injected latency
  123. 123. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  124. 124. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. With injected latency
  125. 125. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  126. 126. Factory wrapper function (think bluebird’s promisifyAll function)
  127. 127. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  128. 128. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  129. 129. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  130. 130. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  131. 131. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  132. 132. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Error injection with serverless
  133. 133. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common errors HTTP 5XX Amazon DynamoDB provisioned throughput exceeded Throttled AWS Lambda invocations
  134. 134. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesis: Function has appropriate error handling on its HTTP communications and can degrade gracefully when downstream dependencies fail
  135. 135. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject errors?
  136. 136. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesis: Function has appropriate error handling on DynamoDB operations and can degrade gracefully when DynamoDB throughputs are exceeded
  137. 137. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject errors?
  138. 138. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject errors? Induce Lambda throttling by temporarily setting reserve concurrency
  139. 139. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recap
  140. 140. Failures are INEVITABLE
  141. 141. The only way to truly know your system’s resilience against failures is to test it through CONTROLLED experiments
  142. 142. The goal of chaos engineering is NOT to actually break production
  143. 143. CONTAINMENT should be front and centre of your thinking
  144. 144. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 1. Define “steady state” What does normal, working condition looks like?
  145. 145. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesize steady state will continue in both control group & the experiment group In other words, you should have a reasonable degree of confidence the system would handle the failure before you proceed with the experiment STEP 2.
  146. 146. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 3. Inject realistic failures For example, server crash, network error, HD malfunction, more
  147. 147. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 4. Disprove hypothesis In other words, look for difference in steady state
  148. 148. There is more inherent chaos and complexity in a serverless application
  149. 149. Even without servers, you can still inject CONTROLLED failures at the application level
  150. 150. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  151. 151. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  152. 152. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  153. 153. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  154. 154. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Yan Cui @theburningmonk https://theburningmonk.com
  155. 155. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Related breakouts Wednesday, Nov 28 SRV425-R - Best Practices for Building Multi-Region, Active-Active Serverless Applications 4:00PM – 5:00PM | Venetian, Level 4, Lando 4305 Wednesday, Nov 28 SRV343-R - Best Practices for Safe Deployments on AWS Lambda and Amazon API Gateway 4:45PM – 5:45PM | MGM, Level 1, South Concourse 105 Thursday, Nov 29 ARC308 - Chaos Engineering and Scalability at Audible.com 1:00PM – 2:00PM | Aria West, Level 3, Ironwood 5
  156. 156. Please complete the session survey in the mobile app. ! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

×