This session explains how Netflix is using the capabilities of AWS to balance the rate of change against the risk of introducing a fault. Netflix uses a modular architecture with fault isolation and fallback logic for dependencies to maximize availability. This approach allows for rapid independent evolution of individual components to maximize the pace of innovation and A/B testing, and offers nearly unlimited scalability as the business grows. Learn how we balance managing change to (or subtraction from) the customer experience, while aggressively scraping barnacle features that add complexity for little value.
4. Assumptions
Everything is Broken
Hardware will fail
Scale
Slowly Changing
Large Scale
Rapid Change
Large Scale
Telcos Web-Scale
Enterprise IT Startups
Slowly Changing
Small Scale
Rapid Change
Small Scale
Everything works
Software will fail
Speed
6. Performance
• Reduce session start by 1s
Save 1 human lifetime per day!
Win more moments of truth
• Suggest choices 1% better
500k hours/day additional value delivered
7. Scale
•
•
•
•
•
50% y/y traffic growth
50 Countries, 3 continents
Tens of thousands of instances at peak
4 AWS regions, 12 datacenters
~$.001 per start
8. Availability
• Aspire to 4 x nines (99.99% of starts successful)
• Per Quarter:
– Downtime: < 3 mins (peak time)
– Successful starts: 9.999B
– Failures: 1M
frustration, calls, lost business
10. Availabilities Compound
To achieve 99.99% availability
with 1000 components
requires:
or
99.9999% availability
for each dependency
Isolation for
independence
Component failure leads
to system failure
Component failure leads
to degradation rather than
system failure
12. Rapid Iteration – Rate of Change
• Running tests
• Rolling out tests
– Engineering the winning test experience for scale
• Adding features
• Scaling up
• Removing features, simplifying, minimizing
14. Rate of Change
• Change leads to bugs
–
–
–
–
New features
New configurations
New types of inputs
Scaling up
• Availability is in tension with rate of change
15. Availability / Rate of Change Tradeoff
Availability
99.999%
99.99%
Frontier of
availability/change
99.9%
99%
1
10
100
Rate of Change
1000
16. Availability / Rate of Change Tradeoff
Availability
99.999%
99.99%
Frontier of
availability/change
99.9%
99%
1
10
100
Rate of Change
1000
18. Shifting the Curve
• Must break the chained dependencies
that compound in cascading system failure
• Subsystem isolation:
– Failure in one component
should never result in cascading system failure
19. Isolating Subsystems
Redundant systems with timeout & failover
• Failure of instance
• Failure of network
• Latency monkey to
test
Dependent
System
Timeout
Dependence
20. Isolating Subsystems
Redundant systems with timeout & failover
• Failure of instance
• Failure of network
Higher Tier
System
Longer
timeout
Dependent
System
Short
timeout
• Latency monkey to
test
Dependence
24. Isolating Subsystems
Standby Blue system
• Independent
implementation
• Simplified logic
Dependent
System
Fail to static
version
Static reference
implementation
Dependence
V2.3
26. Isolating Subsystems
Region isolation
DNS
• Infrastructure
software bugs
(e.g. load
balancer fail)
• Chaos Kong
Region E
Region W
Load
Balancer
Load
Balancer
Zone A
Zone B
Zone A
Zone B
Dependen
t System
Dependen
t System
Dependen
t System
Dependen
t System
Dependence
Dependence
Dependence
Dependence
27. Isolating Subsystems
Dependency Mode
Isolating Technique
Instance Failure
Network failure
Redundant systems with failover and timeout
Timeout with default response
Network failure
Software bug
Canary push
Red-black deployment
Blue systems
Infrastructure failure
Zone isolation
Cross-zone software bugs
Region isolation
28. Trying Harder Won’t Cut It
• Trying harder gets a linear return on an exponential
problem
• Need to be great at execution
AND
Have the right architecture
• What architectural features are you using to ensure
availability, scale, performance, & rapid rate of change?
29. Please give us your feedback on this
presentation
DMG206
As a thank you, we will select prize
winners daily for completed surveys!