5. @garethbowle
s
Netflix is the world’s leading Internet
television network with more than 50 million
members in 50 countries enjoying more
than one billion hours of TV shows and
movies per month.
We account for up to 34% of downstream
US internet traffic.
Source: http://ir.netflix.com
6. @garethbowle
s
Personaliza-tion
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
2B requests
per day
into the Netflix
API
12B outbound
requests per
day to API
dependencies
9. @garethbowle
s
What AWS Provides
• Machine Images (AMI)
• Instances (EC2)
• Elastic Load Balancers
• Security groups / Autoscaling groups
• Availability zones and regions
10. Our system has become so complex
that…
…No (single) body knows how everything
@garethbowle
s
works.
11. @garethbowle
s
How AWS Can Go Wrong -1
• Service goes down in one or more
availability zones
• 6/29/12 - storm related power outage
caused loss of EC2 and RDS instances in
Eastern US
• https://gigaom.com/2012/06/29/some-of-amazon-
web-services-are-down-again/
12. How AWS Can Go Wrong - 2
@garethbowle
s
• Loss of service in an entire region
• 12/24/12 - operator error caused loss of
multiple ELBs in Eastern US
• http://techblog.netflix.com/2012/12/a-closer-
look-at-christmas-eve-outage.html
13. How AWS Can Go Wrong - 3
@garethbowle
s
• Large number of instances get rebooted
• 9/25/14 to 9/30/14 - rolling reboot of 1000s
of instances to patch a security bug
• http://techblog.netflix.com/2014/10/a-state-of-
xen-chaos-monkey-cassandra.html
14. @garethbowle
s
Our Goal is Availability
• Members can stream Netflix whenever they
want
• New users can explore and sign up
• New members can activate their service
and add devices
16. @garethbowle
s
Freedom and Responsibility
• Developers deploy when
they want
• They also manage their
own capacity and
autoscaling
• And are on-call to fix
anything that breaks at
3am!
17. How the heck do you test this stuff
@garethbowle
s
?
18. @garethbowle
s
Failure is All Around Us
• Disks fail
• Power goes out - and your backup
generator fails
• Software bugs are introduced
• People make mistakes
19. @garethbowle
s
Design to Avoid Failure
• Exception handling
• Redundancy
• Fallback or degraded experience (circuit
breakers)
• But is it enough ?
20. @garethbowle
s
It’s Not Enough
• How do we know we’ve succeeded ?
• Does the system work as designed ?
• Is it as resilient as we believe ?
• How do we avoid drifting into failure ?
21. @garethbowle
s
More Testing !
• Unit
• Integration
• Stress
• Exhaustive testing to simulate all failure
modes
22. @garethbowle
s
Exhaustive Testing ~
Impossible
• Massive, rapidly changing data sets
• Internet scale traffic
• Complex interaction and information flow
• Independently-controlled services
• All while innovating and building features
23. @garethbowle
s
Another Way
• Cause failure deliberately to validate
resiliency
• Test design assumptions by stressing them
• Don’t wait for random failure. Remove its
uncertainty by forcing it regularly
26. @garethbowle
s
Chaos Monkey
• The original Monkey (2009)
• Randomly terminates instances in a cluster
• Simulates failures inherent to running in the
cloud
• During business hours
• Default for production services
27. @garethbowle
s
What did we do once we
were able to handle Chaos
Monkey ?
Bring in bigger Monkeys !
29. @garethbowle
s
Chaos Gorilla
• Simulate an Availability Zone becoming
unavailable
• Validate multi-AZ redundancy
• Deploy to multiple AZs by default
• Run regularly (but not continually !)
31. @garethbowle
s
Chaos Kong
• “One louder” than Chaos Gorilla
• Simulate an entire region outage
• Used to validate our “active-active” region
strategy
• Traffic has to be switched to the new region
• Run once every few months
33. @garethbowle
s
• Simulate dLegaratdeend cinyst aMnceosnkey
• Ensure degradation doesn’t affect other
services
• Multiple scenarios: network, CPU, I/O,
memory
• Validate that your service can handle
degradation
• Find effects on other services, then validate
that they can handle it too
35. @garethbowle
s
• Conformity Monkey Apply a set of conformity rules to all
instances
• Notify owners with a list of instances and
problems
• Example rules
• Standard security groups not applied
• Instance age is too old
• No health check URL
36. @garethbowle
s
Some More Monkeys
• Janitor Monkey - clean up unused
resources
• Security Monkey - analyze, audit and
notify on AWS security profile changes
• Howler Monkey - warn if we’re reaching
resource limits
37. @garethbowle
s
Try it out !
• Open sourced and available at
https://github.com/Netflix/SimianArmy and
https://github.com/Netflix/security_monkey
• Chaos, Conformity, Janitor and Security
available now; more to come
• Ported to VMWare
38. @garethbowle
s
What’s Next ?
• Finer grained control
• New failure modes
• Make chaos testing as well-understood as
regular regression testing
39. @garethbowle
s
A message from the owners
“Use Chaos Monkey to induce various kinds of
failures in a controlled environment.”
AWS blog post following the mass instance
reboot in Sep 2014:
http://aws.amazon.com/blogs/aws/ec2-
maintenance-update-2/
40. @garethbowle
s
Thank You !
Email: gbowles@{gmail,netflix}.com
Twitter: @garethbowles
Linkedin:
www.linkedin.com/in/garethbowles
We’re hiring - test engineers, and chaos
engineers ! http://jobs.netflix.com
Notas del editor
Notes
Hi, everyone. Thanks for coming today.
I’m going to talk about some testing challenges we face at Netflix. In particular, we have such a large and complex distributed system that that testing it exhaustively in an isolated environment is next to impossible. To meet that challenge we came up with the Simian Army, a set of tools that actually run in production, and test our ability to survive various types of failure.
I’ll spend a bit of time talking about Netflix and our streaming service to set up the problem space, then go over some of the things we need to test for and how more traditional test practices can fall short. Finally I’ll go into some detail on the Monkeys themselves.
Feel free to ask questions during the talk.
A little bit about me. I’ve been with Netflix for 4 years and I’m part of our Engineering Tools team. We’re responsible for developer productivity, with the goal that any engineer can build and deploy their code with the minimum possible effort.
Before Netflix I spent a long time in test engineering and technical operations, so once I got to Netflix I was fascinated to see how such a complex system gets tested.
Let’s take a look at how it all works.
Who here ISN’T familiar with Netflix ?
Any customers ? Thanks very much !
Netflix is first & foremost an entertainment company, but you can also look at us as an engineering company that creates all the technology to serve up that entertainment, and also collects a ton of data on who watches what, when they watch it, and how much they watch. We continuously analyze all that data to improve our customers’ entertainment experience by making it easy for them to find things they want to watch, and making sure they have a top quality viewing experience when they get comfy on the sofa (or the bus, or in the park, or wherever they can get connected).
So there’s a lot of engineering that goes on behind the scenes to make all that possible.
Some data that might be new to some of you guys. Our membership is growing fast and we’re now in more than 50 countries - all countries in North and South America, plus most of Europe.
The amount of content our viewers are watching is growing, too. Fun fact - a billion hours of video is about 114,000 years. 114,000 years ago, homo sapiens had just started to migrate out of Africa and you could find hippos in the River Thames in England.
This is an overview of our current architecture.
2 billion requests flow in from all kinds of connected devices - game consoles, PCs and Macs, phones, tablets, TVs, DVD players, and more.
Those generate 12 billion outbound requests to individual services. The diagram shows some of the main ones: the personalization engine that recommends what to watch based on your viewing history and ratings, movie metadata to give you information about what you’re watching, and one of the most important - the A/B test engine that lets us serve up a different customer experience to a set of users and measure its effect on how much they watch, what they watch and when the watch it.
Pause for audience reaction here. Anyone care to suggest what this diagram shows ?
We call it the “Shock and Awe” diagram at Netflix ! It’s generated by one of our monitoring systems and shows the interconnections between all of our services and data. Don’t look too hard, you won’t make out much detail - it’s only meant to illustrate the complexity of the system.
We run our production systems on Amazon Web Services, which many of you are probably familiar with. We’re one of AWS’ biggest customers, apart from Amazon themselves.
Using Amazon Web Services lets us stop worrying about procurement of hardware - servers, network switches, storage, firewalls, load balancers ...
AWS allows us to scale up and down without worrying about exceeding or underusing data center capacity.
And since every AWS service has an API, we can automate our deployments and throw away all those runbooks.
Each AWS service is available in multiple regions (geographic areas) and multiple availability zones (data centers) within each region.
So now that I’ve sort of described how everything works, here’s a big problem - in actual fact, there’s nobody who knows how it all works in depth. Although we have world experts in areas such as personalization, video encoding and machine learning, there’s just too much going on and it’s changing too fast for any individual to keep up.
AWS is increasingly reliable, but it has had some fairly spectacular outages as well as many smaller ones. When you run on a cloud platform that’s not under your control, you have to be able to cope with these outages.
On June 29th 2012, a storm caused a widespread power outage in Northern Virginia that took out many instances and database servers. Netflix streaming was affected for a while.
An even bigger outage happened on Christmas Eve 2012, when an Amazon engineer made a mistake that took out many Elastic Load Balancers in the US East region, which at that time was Netflix’ primary region for serving traffic. ELBs aren’t replicated between availability zones, they only apply to a given region. Many Netflix customers were affected; luckily for us, Christmas Eve is a much less busy day than Christmas Day when everyone gets given Netflix subscriptions, and the problem was fixed by the 25th.
In late September this year, AWS restarted a large number of instances, in multiple regions, to patch a security bug in the virtualization software that the instances run on. This time we were hardly affected at all and there was negligible impact on customers - again, more in our tech blog post.
Give those kinds of problems, we need to work pretty hard to work around them. Netflix wants our 50 million plus members to be able to play movies and TV shows whenever and wherever they want. We also want to make it as simple and fast as possible for people to sign up and start using their new subscriptions.
One little digression that’s an important part of how we meet these challenges - we couldn’t do it without our company culture, which we’re quite proud of - proud enough to publish the 126-slide deck I’ve linked here. All those slides can be boiled down to one key takeaway, which we call Freedom and Responsibility.
Here’s how the “freedom & responsibility” principle applies to our technical development and deployment.
Freedom - engineers deploy when and how often they need to, and control their own production capacity and scaling.
Responsibility - every engineer in each service team is in the PagerDuty rotation in case things go wrong.
So how can these teams be confident that their new versions will still work with all their dependencies, under all kinds of failure conditions ?
It’s a very tough problem.
Failure is unavoidable. We already saw some ways that our AWS platform can fail. Add to that bugs in our own software, and the inevitable human errors that you get when there are actual people involved in the development and deployment pipeline.
We can do a lot to make our code handle failure gracefully.
We can catch errors as exceptions and make sure they are handled in a way that doesn’t crash the code.
We can run multiple instances of our services to avoid single points of failure.
And we can use technologies such as the circuit breaker pattern to have services provide a degraded experience if one of their dependent services goes offline. For example, if our recommendation service is unavailable, we don’t show you a blank list of recommendations on your Netflix page - we fall back to a list of the most-watched content.
But that only gets us so far.
We want to make sure all our features work properly without waiting for customers to tell us they don’t.
We want to know that we are as resilient as we think we are, without waiting for an outage to happen. Given the scale we run at, it’s effectively impossible to create a realistic test system for running load and reliability tests.
We also want to be sure that the configuration of our services doesn’t diverge as we redeploy them over time - this can lead to errors that are very hard to debug given the thousands of instances that we run in production.
So, most of you guys are testers and probably have this reaction - let’s do more testing !
But can we effectively simulate such a large- scale distributed system - and what’s more, can we predict every possible failure mode and encode it into our tests ?
Today’s large internet systems have become too big and complex to just rely on traditional testing - but don’t get me wrong, all those types of testing I just mentioned have a very important place at Netflix, and we have some of the best test engineers in the business working on them. But here are some of the things they struggle with.
It’s very hard to find realistic test data.
It would be hugely expensive for us to create a similarly-sized copy of our production system for testing - not quite “copying the internet”, but getting there.
Because teams deploy their own changes on different schedules, it’s difficult to keep up with changes and code them into integration tests.
So we came up with the idea of deliberately triggering failures in production, to augment our more traditional testing. By causing our own failures on a known schedule, we can be prepared to deal with their effects and test our assumptions in a predictable way, rather than having a fire drill when a “real” outage happens.
So with all that context done with, let’s take a look at our lovable monkeys.
Chaos Monkey was the one who started it all.
Chaos Monkey has been around in some form for about 5 years.
It’s a service that looks for groups of instances (known as clusters) of each of our services and picks a random instance to terminate, on a defined schedule and with a defined probability of termination.
This simulates a fairly frequent thing in AWS (although not nearly as frequent as it used to be) where instances are terminated unexpectedly, usually due to a failure in the underlying hardware.
We run Chaos Monkey during business hours so that engineers are on hand to diagnose and fix problems,rather than getting a 3am page.
If you deploy a new Netflix service, Chaos Monkey will be enabled for it unless you explicitly turn it off.
We’ve got to a point where Chaos Monkey instance terminations go virtually unnoticed.
Gorillas are bigger than monkeys and can carry bigger weapons.
Chaos Gorilla takes out an entire Availability Zone.
AWS has multiple regions in different parts of the world, such Eastern USA, Western USA, Asia Pacific and Western Europe.
Each region has multiple Availability Zones. These are equivalent to physical data centers in different geographic locations - for example, the US East region is located in Virginia but has 3 separate Availability Zones.
Running Chaos Gorilla ensures that our services is running correctly in multiple Availability Zones, and that we have sufficient capacity in each zone to handle our traffic load.
Runs of Chaos Gorilla are announced ahead of time, and our Reliability Engineering team sets up an incident room where engineers from each service team can watch progress.
As we already picked the gorilla, we had to resort to a fictional creature to cause even bigger chaos.
Once we were happy that we could survive an Availability Zone outage, we wanted to go a step further and see if we could cope with an entire region being taken out.
This hasn’t happened in reality yet, but there’s a small possibility that it could - for example, the us-west-1 region is in Northern California, so a really big earthquake could feasibly take out all of its Availability Zones.
And the Elastic Load Balancer outage at the end of 2012 did have the effect of bringing down a key service in an entire region.
To handle this, we had to rearchitect to an “active-active” setup where we have complete copies of our services and data running in two different regions.
If an Availability Zone goes down we just have to make sure we have enough capacity in the surviving zones to handle all our traffic, but if we lose a region we also have to reroute all traffic to the backup region.
Chaos Kong gets an outing every few months, again with an incident room where engineers can watch progress and react to any problems.
So we can deal with instances disappearing individually, or in large numbers. But what if instances are still there, but running in a degraded state ? Because our architecture involves so many interdependent services, we have to be careful that problems with one service don’t cascade to other services.
This is where Latency Monkey comes in. It can simulate multiple types of degradation: network connections maxing out, high CPU loads, high disk I/O and running out of memory.
Degradation in a service oriented architecture is extremely hard to test exhaustively. With Latency Monkey we can introduce degradation in a controlled way (like the recommendations example I mentioned earlier), find any problems in dependent services and fix them, then verify those fixes.
Some service teams even discovered dependencies they didn’t know they had by running Latency Monkey and finding unexpected degradation in the dependent services.
One other key aspect of running so many services, all with dozens or hundreds of instances, is that it’s very important to keep all of the instances consistent. They should all have the same system configuration and the same version and configuration of the service, for example. This decreases the complexity / surface area of the testing
We use Conformity Monkey to automate these checks.
It runs over all services at a fixed interval, and notifies service owners when any of the conditions are not met. An email is sent containing a list of rule violations, each with a list of non-conforming instances.
Here are a few examples of the things we check:
Instances should be in the correct security groups so that they are reachable by other services and our monitoring and deployment tools.
Instances shouldn’t have been running for more than a given time, which depends on how often the service is deployed. Older instances could be running an out of date version of the service.
Instances should all have a valid health check URL so that our monitoring tools can know whether they are running properly.
I’d like to give an honourable mention to a few more Simian Army members.
Janitor Monkey looks for unused resources and deletes them. Examples are machine images or launch configurations that are no longer in use by any running instances.
Security Monkey is similar to Conformity Monkey, but specializes in tracking and evaluating security changes.
Howler Monkey warns us if we’re running out of AWS resources - for example, if we have nearly maxed out on the number of instances of a given type that are available in a particular region. It also looks for SSL certificates that are about to expire. We don’t seem to have a Netflix logo for this one, so I had to get clearance from a real howler monkey.
There’s a lot more we want to do with the Monkeys.
We’d like to have a way to induce failures that are more chaotic than the individual instances that Chaos Monkey knocks out, but less impactful than having Chaos Gorilla take out an entire availability zone.
We’re constantly on the lookout for new failure modes that we can rigger - hopefully before they happen in the wild.
Eventually, we’d like to make chaos testing in large distributed systems as well understood and commonly practiced as regular regression testing.
After the mass instance reboot I mentioned early, Amazon themselves recommended that AWS customers use Chaos Monkey to test their resilience. High praise indeed !
You can contact me in one of these ways,
or ask your question now.
thanks again