SlideShare una empresa de Scribd logo
1 de 28
Breaking Things on Purpose
Kolton Andrus (@KoltonAndrus)
Context
Effective Failure Testing
“What could go wrong?”
“How likely is this to occur?”
“What is the cost of being wrong?”
Validating our
assumptions
Experiment
Form a hypothesis If we lose the Ratings service,
members will get default ratings
Measurable Outcome This will manifest as increased
Hystrix Fallbacks
Success Criteria But the overall success rate will
remain constant
Abort Conditions Halt immediately if members are
unable to stream
Validate
Dial it up!
Test in Prod
Case Studies
Chaos Kong
Why?
Thanks for your time!
@KoltonAndrus
kandrus at gmail
“Required Reading” and References
Antifragile: Things That Gain from Disorder by Nassim Nicholas Taleb
On Designing and Deploying Internet-Scale Services by James Hamilton
Drift into Failure by Sidney Dekker
Photo Credits
http://i.gyazo.com/38b53958cccde98b712acfde6d880336.png
http://www.thedoctorschannel.com/wp-
content/uploads/2013/01/Vaccine_Vials_Syringe_Needle.jpg
http://www.horizonservicesinc.com/wp/wp-content/uploads/Explosion.jpg
Star Trek: The Next Generation
http://www.joshuanhook.com/wp-content/uploads/2014/11/broken-communication.jpg
http://s3.amazonaws.com/media.eremedia.com/uploads/2014/01/15174902/THINK-small.jpg
http://sdbn.org/wp-content/uploads/2010/12/dreamstime_volume_11_social_media_roi-
258x300.jpg

Más contenido relacionado

Destacado

Destacado (20)

How Hootsuite Manages its Growing Microservice Landscape - Adam Arsenault
How Hootsuite Manages its Growing Microservice Landscape - Adam ArsenaultHow Hootsuite Manages its Growing Microservice Landscape - Adam Arsenault
How Hootsuite Manages its Growing Microservice Landscape - Adam Arsenault
 
Microservices Practitioner Summit Jan '15 - Microservice Ecosystems At Scale ...
Microservices Practitioner Summit Jan '15 - Microservice Ecosystems At Scale ...Microservices Practitioner Summit Jan '15 - Microservice Ecosystems At Scale ...
Microservices Practitioner Summit Jan '15 - Microservice Ecosystems At Scale ...
 
The Hardest Part of Microservices: Your Data - Christian Posta, Red Hat
The Hardest Part of Microservices: Your Data - Christian Posta, Red HatThe Hardest Part of Microservices: Your Data - Christian Posta, Red Hat
The Hardest Part of Microservices: Your Data - Christian Posta, Red Hat
 
Microservices: The Organizational and People Impact
Microservices: The Organizational and People ImpactMicroservices: The Organizational and People Impact
Microservices: The Organizational and People Impact
 
Bringing Learnings from Googley Microservices with gRPC - Varun Talwar, Google
Bringing Learnings from Googley Microservices with gRPC - Varun Talwar, GoogleBringing Learnings from Googley Microservices with gRPC - Varun Talwar, Google
Bringing Learnings from Googley Microservices with gRPC - Varun Talwar, Google
 
Breaking things on purpose (with Gremlin)
Breaking things on purpose (with Gremlin)Breaking things on purpose (with Gremlin)
Breaking things on purpose (with Gremlin)
 
2016 - 10 questions you should answer before building a new microservice
2016 - 10 questions you should answer before building a new microservice2016 - 10 questions you should answer before building a new microservice
2016 - 10 questions you should answer before building a new microservice
 
Microservice Memoirs - Lachlan Evanson
Microservice Memoirs - Lachlan EvansonMicroservice Memoirs - Lachlan Evanson
Microservice Memoirs - Lachlan Evanson
 
Microservices in Your Datacenter
Microservices in Your DatacenterMicroservices in Your Datacenter
Microservices in Your Datacenter
 
Microservice, Microservice. Wherefore Art Thou, Microservice.
Microservice, Microservice. Wherefore Art Thou, Microservice.Microservice, Microservice. Wherefore Art Thou, Microservice.
Microservice, Microservice. Wherefore Art Thou, Microservice.
 
Microservices from dream to reality in an hour - Dr. Holly Cummins
Microservices from dream to reality in an hour - Dr. Holly CumminsMicroservices from dream to reality in an hour - Dr. Holly Cummins
Microservices from dream to reality in an hour - Dr. Holly Cummins
 
Come ti smantello un'app monolitica in microservices
Come ti smantello un'app monolitica in microservicesCome ti smantello un'app monolitica in microservices
Come ti smantello un'app monolitica in microservices
 
Microservices meetup
Microservices meetupMicroservices meetup
Microservices meetup
 
Trends in development distributed systems
Trends in development distributed systemsTrends in development distributed systems
Trends in development distributed systems
 
Microservices. The good the bad and the ugly
Microservices. The good the bad and the uglyMicroservices. The good the bad and the ugly
Microservices. The good the bad and the ugly
 
Microservices Standardization - Susan Fowler, Stripe
Microservices Standardization - Susan Fowler, StripeMicroservices Standardization - Susan Fowler, Stripe
Microservices Standardization - Susan Fowler, Stripe
 
Grokking microservices in 5 minutes
Grokking microservices in 5 minutesGrokking microservices in 5 minutes
Grokking microservices in 5 minutes
 
Software Architecture Conference - Monitoring Microservices - A Challenge
Software Architecture Conference -  Monitoring Microservices - A ChallengeSoftware Architecture Conference -  Monitoring Microservices - A Challenge
Software Architecture Conference - Monitoring Microservices - A Challenge
 
Modeling Microservices
Modeling MicroservicesModeling Microservices
Modeling Microservices
 
Principles of microservices XP Days Ukraine
Principles of microservices   XP Days UkrainePrinciples of microservices   XP Days Ukraine
Principles of microservices XP Days Ukraine
 

Similar a Microservices Practitioner Summit Jan '15 - Breaking Things On Purpose - Kolton Andrus

Chapter 4 Case study instructions1. Answer the Case Study Ques.docx
Chapter 4 Case study instructions1. Answer the Case Study Ques.docxChapter 4 Case study instructions1. Answer the Case Study Ques.docx
Chapter 4 Case study instructions1. Answer the Case Study Ques.docx
christinemaritza
 
Explainable AI in Industry (FAT* 2020 Tutorial)
Explainable AI in Industry (FAT* 2020 Tutorial)Explainable AI in Industry (FAT* 2020 Tutorial)
Explainable AI in Industry (FAT* 2020 Tutorial)
Krishnaram Kenthapadi
 
Large scale-ctr-prediction lessons-learned-florian-hartl
Large scale-ctr-prediction lessons-learned-florian-hartlLarge scale-ctr-prediction lessons-learned-florian-hartl
Large scale-ctr-prediction lessons-learned-florian-hartl
PyData
 

Similar a Microservices Practitioner Summit Jan '15 - Breaking Things On Purpose - Kolton Andrus (20)

Web Performance in the Age of HTTP2 - Topconf Tallinn 2016 - Holger Bartel
Web Performance in the Age of HTTP2 - Topconf Tallinn 2016 - Holger BartelWeb Performance in the Age of HTTP2 - Topconf Tallinn 2016 - Holger Bartel
Web Performance in the Age of HTTP2 - Topconf Tallinn 2016 - Holger Bartel
 
Predictive Analytics Modeling
Predictive Analytics ModelingPredictive Analytics Modeling
Predictive Analytics Modeling
 
Chapter 4 Case study instructions1. Answer the Case Study Ques.docx
Chapter 4 Case study instructions1. Answer the Case Study Ques.docxChapter 4 Case study instructions1. Answer the Case Study Ques.docx
Chapter 4 Case study instructions1. Answer the Case Study Ques.docx
 
Surviving the Change Agents - How Business Survive the Next Evolution
Surviving the Change Agents - How Business Survive the Next EvolutionSurviving the Change Agents - How Business Survive the Next Evolution
Surviving the Change Agents - How Business Survive the Next Evolution
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Design for failure in the IoT: what could possibly go wrong?
Design for failure in the IoT: what could possibly go wrong?Design for failure in the IoT: what could possibly go wrong?
Design for failure in the IoT: what could possibly go wrong?
 
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
 
MeasureWorks - Design for Fast Experiences
MeasureWorks - Design for Fast ExperiencesMeasureWorks - Design for Fast Experiences
MeasureWorks - Design for Fast Experiences
 
Driving customers to your website
Driving customers to your websiteDriving customers to your website
Driving customers to your website
 
MisuseCases
MisuseCasesMisuseCases
MisuseCases
 
How To Start Writing Your College Essay. CollegeVine
How To Start Writing Your College Essay. CollegeVineHow To Start Writing Your College Essay. CollegeVine
How To Start Writing Your College Essay. CollegeVine
 
Explainable AI in Industry (FAT* 2020 Tutorial)
Explainable AI in Industry (FAT* 2020 Tutorial)Explainable AI in Industry (FAT* 2020 Tutorial)
Explainable AI in Industry (FAT* 2020 Tutorial)
 
Reanimating DevOps to Build Things that Work
Reanimating DevOps to Build Things that WorkReanimating DevOps to Build Things that Work
Reanimating DevOps to Build Things that Work
 
Sage FAS for Sage ERP
Sage FAS for Sage ERPSage FAS for Sage ERP
Sage FAS for Sage ERP
 
Alexis max-Creating a bot experience as good as your user experience - Alexis...
Alexis max-Creating a bot experience as good as your user experience - Alexis...Alexis max-Creating a bot experience as good as your user experience - Alexis...
Alexis max-Creating a bot experience as good as your user experience - Alexis...
 
(In)convenient truths about applied machine learning
(In)convenient truths about applied machine learning(In)convenient truths about applied machine learning
(In)convenient truths about applied machine learning
 
Large scale-ctr-prediction lessons-learned-florian-hartl
Large scale-ctr-prediction lessons-learned-florian-hartlLarge scale-ctr-prediction lessons-learned-florian-hartl
Large scale-ctr-prediction lessons-learned-florian-hartl
 
Measure the thing: Continuous website improvement through continuous user ins...
Measure the thing: Continuous website improvement through continuous user ins...Measure the thing: Continuous website improvement through continuous user ins...
Measure the thing: Continuous website improvement through continuous user ins...
 
Leading A DevOps Transformation: Lessons Learned
Leading A DevOps Transformation: Lessons LearnedLeading A DevOps Transformation: Lessons Learned
Leading A DevOps Transformation: Lessons Learned
 
Validation Is (Not) Easy
Validation Is (Not) EasyValidation Is (Not) Easy
Validation Is (Not) Easy
 

Más de Ambassador Labs

[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li
[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li
[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li
Ambassador Labs
 

Más de Ambassador Labs (20)

Building Microservice Systems Without Cooking Your Laptop: Going “Remocal” wi...
Building Microservice Systems Without Cooking Your Laptop: Going “Remocal” wi...Building Microservice Systems Without Cooking Your Laptop: Going “Remocal” wi...
Building Microservice Systems Without Cooking Your Laptop: Going “Remocal” wi...
 
Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...
Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...
Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...
 
Cloud native development without the toil
Cloud native development without the toilCloud native development without the toil
Cloud native development without the toil
 
Webinar: Accelerate Your Inner Dev Loop for Kubernetes Services
Webinar: Accelerate Your Inner Dev Loop for Kubernetes Services Webinar: Accelerate Your Inner Dev Loop for Kubernetes Services
Webinar: Accelerate Your Inner Dev Loop for Kubernetes Services
 
[Confoo Montreal 2020] From Grief to Growth: The 7 Stages of Observability - ...
[Confoo Montreal 2020] From Grief to Growth: The 7 Stages of Observability - ...[Confoo Montreal 2020] From Grief to Growth: The 7 Stages of Observability - ...
[Confoo Montreal 2020] From Grief to Growth: The 7 Stages of Observability - ...
 
[Confoo Montreal 2020] Build Your Own Serverless with Knative - Alex Gervais
[Confoo Montreal 2020] Build Your Own Serverless with Knative - Alex Gervais[Confoo Montreal 2020] Build Your Own Serverless with Knative - Alex Gervais
[Confoo Montreal 2020] Build Your Own Serverless with Knative - Alex Gervais
 
[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li
[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li
[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li
 
What's New in the Ambassador Edge Stack 1.0?
What's New in the Ambassador Edge Stack 1.0? What's New in the Ambassador Edge Stack 1.0?
What's New in the Ambassador Edge Stack 1.0?
 
Webinar: Effective Management of APIs and the Edge when Adopting Kubernetes
Webinar: Effective Management of APIs and the Edge when Adopting Kubernetes Webinar: Effective Management of APIs and the Edge when Adopting Kubernetes
Webinar: Effective Management of APIs and the Edge when Adopting Kubernetes
 
Ambassador: Building a Control Plane for Envoy
Ambassador: Building a Control Plane for Envoy Ambassador: Building a Control Plane for Envoy
Ambassador: Building a Control Plane for Envoy
 
Telepresence - Fast Development Workflows for Kubernetes
Telepresence - Fast Development Workflows for KubernetesTelepresence - Fast Development Workflows for Kubernetes
Telepresence - Fast Development Workflows for Kubernetes
 
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
 
[KubeCon NA 2018] Effective Kubernetes Develop: Turbocharge Your Dev Loop - P...
[KubeCon NA 2018] Effective Kubernetes Develop: Turbocharge Your Dev Loop - P...[KubeCon NA 2018] Effective Kubernetes Develop: Turbocharge Your Dev Loop - P...
[KubeCon NA 2018] Effective Kubernetes Develop: Turbocharge Your Dev Loop - P...
 
The rise of Layer 7, microservices, and the proxy war with Envoy, NGINX, and ...
The rise of Layer 7, microservices, and the proxy war with Envoy, NGINX, and ...The rise of Layer 7, microservices, and the proxy war with Envoy, NGINX, and ...
The rise of Layer 7, microservices, and the proxy war with Envoy, NGINX, and ...
 
The Simply Complex Task of Implementing Kubernetes Ingress - Velocity NYC
The Simply Complex Task of Implementing Kubernetes Ingress - Velocity NYCThe Simply Complex Task of Implementing Kubernetes Ingress - Velocity NYC
The Simply Complex Task of Implementing Kubernetes Ingress - Velocity NYC
 
Ambassador Kubernetes-Native API Gateway
Ambassador Kubernetes-Native API GatewayAmbassador Kubernetes-Native API Gateway
Ambassador Kubernetes-Native API Gateway
 
Micro xchg 2018 - What is a Service Mesh?
Micro xchg 2018 - What is a Service Mesh? Micro xchg 2018 - What is a Service Mesh?
Micro xchg 2018 - What is a Service Mesh?
 
KubeCon NA 2017: Ambassador and Envoy (Envoy Salon)
KubeCon NA 2017: Ambassador and Envoy (Envoy Salon)KubeCon NA 2017: Ambassador and Envoy (Envoy Salon)
KubeCon NA 2017: Ambassador and Envoy (Envoy Salon)
 
Webinar: Code Faster on Kubernetes
Webinar: Code Faster on KubernetesWebinar: Code Faster on Kubernetes
Webinar: Code Faster on Kubernetes
 
QCon SF 2017 - Microservices: Service-Oriented Development
QCon SF 2017 - Microservices: Service-Oriented DevelopmentQCon SF 2017 - Microservices: Service-Oriented Development
QCon SF 2017 - Microservices: Service-Oriented Development
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Microservices Practitioner Summit Jan '15 - Breaking Things On Purpose - Kolton Andrus

Notas del editor

  1. Why failure testing is important why you should be running them in production for your microservices. Abstract Failure Testing prepares us, both socially and technically, for how our systems will behave in the face of failure. By proactively testing, we can find and fix problems before they become crises. Practice makes perfect, yet a real calamity is not a good time for training. Knowing how our systems fail is paramount to building a resilient service. At Netflix, we run failure exercises on a regular basis to ensure we are prepared. These efforts hardened our Edge services and helped us to have a quiet holiday season and a smooth global launch. Come and learn how to run an effective “Game Day” and safely test in production. Then sleep peacefully knowing you are ready!
  2. About me: Netflix - Edge Platform Engineer Amazon - Retail Website Availability and Performance “Call Leader” Lead Failure Test exercises at both Lead into “Context on why failure testing is important” though counter intuitive
  3. What is the opposite of fragile? Robust/Resilient? Those are indifferent to change. We want something that improves with change.
  4. Vaccine Analogy - Injecting a small amount of something bad can make us immune
  5. The downside is the impact of the failure test The upside is the prevention of future outages Additionally the upside is in training our organizations to handle failure.
  6. Prepare your organization for what could go wrong. Run on your own terms. During the day, after the caffeine has kicked in. Practice. Train. Answer questions. Know how to turn it off up front.
  7. Failure Scenarios :: Threat Model
  8. Analysis of past events - Start with low hanging fruit We can’t prepare for everything - Black Swan
  9. If we run only in one AWS region, and that region goes down, what will happen? Cost/Benefit Analysis -> Prioritize the largest risks first
  10. What is the downside Lost revenue two nines (3 days) for a $100M revenue company = $10M revenue lost three nines (8 hrs) for a $1B revenue company = $10M revenue lost Cost for Target being down on Black Friday? Est that an hour of downtime costs FB $1.7M in lost advertising revenue Brand Reputation Customer Trust
  11. Edge Service Failure Testing. Gateway for all the Netflix devices and website, talks to almost all of the streaming services. Process: Meet with the team to outline the exercise
  12. Discuss what could go wrong Common Points: Network Bounds Loss of a dependency
  13. Setup Communication Let your team/dependencies/organization know you are running a test. Invite anyone interested or impacted Many eyes looking will spot errors faster Share your pass/fail criteria Command Center Team Bullpen Chat Room Conference call
  14. Smallest possible step Run it locally Run in test Run it for a single instance Validate the expected outcome
  15. Example of a CDN selection ‘successful’ cdn selection failure test
  16. Small Scale - Then to find functional failures Large Scale - Resource Constraints, Queuing, Cascading Failure Emergent Behavior?
  17. Those unwilling to test in production aren't yet confident that the service will continue operating through failures. And, without production testing, recovery won't work when called upon. - James Hamilton
  18. Funny Anecdote - We did have an outage in Q3, and it came one day before the scheduled failure test. So run them early and often!
  19. Use to deploy Netflix services to the cloud. Open sourced, cloud independent solution. Very critical piece of infrastructure. Automation is there to help prevent outages, doing it by hand isn’t ideal.
  20. Low Hanging Fruit Single Points of Failure - Instance, AZ Lack of Monitoring - KPIs, Dashboards Lack of Alerting - ‘Normal’ behavior
  21. Brief Hystrix overview Leveraging Hystrix for protection Fallbacks for non-critical behavior Circuit Breaker pattern Resource isolation (Thread Pools) Separating critical from non-critical Configuration can be difficult Happy case vs Worst case (Timeouts, ThreadPool usage) Ensuring that fallbacks work on the client device
  22. Run by the traffic team, this is one of the best examples of the power of failure testing. New learnings every few runs Ready when called upon AWS Outage - Q4 2015 AWS Outage - Q1 2016 - Jan 14th? Counterpoint: Everything is a hammer Comes up in every outage (should we shift traffic?) Clear in some cases (AWS in one region is having problems) Unclear is others (A service in a single region is having problems) Bad in some (contaminate another region)
  23. YoY from `13 to `14 our team was paged 21% less. YoY from `14 to `15 our team was paged 20% less. Perfect uptime over the Holidays (busiest period) - Great when you’re the on call over NYE