Netflix has evolved to rely heavily on cloud infrastructure from AWS. It uses microservices across multiple availability zones to provide highly available and scalable streaming globally. Netflix has open sourced many of the tools it has developed to operate in the cloud under the NetflixOSS project. It continues to migrate more systems like billing, payments, and big data analytics to the cloud.
2. @atseitlin
About Netflix
Netflix is the world’s
leading Internet
television network with
nearly 38 million
members in 40
countries enjoying more
than one billion hours
of TV shows and movies
per month, including
original series[1]
[1] http://ir.netflix.com/
6. @atseitlin
How Netflix Streaming Works
Customer Device
(PC, PS3, TV…)
Web Site or
Discovery API
User Data
Personalization
Streaming API
DRM
QoS Logging
OpenConnect
CDN Boxes
CDN
Management and
Steering
Content Encoding
Consumer
Electronics
AWS Cloud
Services
CDN Edge
Locations
Browse
Play
Watch
8. @atseitlin
Web Server Dependencies Flow
Home page business transaction
Start Here
memcached
Cassandra
Web service
S3 bucket
Personalization movie
group chooser
Each icon is
three to a few
hundred
instances
across three
AWS zones
10. @atseitlin
Three Balanced Availability Zones
Test with Chaos Gorilla
Cassandra and Evcache
Replicas
Zone A
Cassandra and Evcache
Replicas
Zone B
Cassandra and Evcache
Replicas
Zone C
Load Balancers
11. @atseitlin
Triple Replicated Persistence
Cassandra maintenance affects individual replicas
Cassandra and Evcache
Replicas
Zone A
Cassandra and Evcache
Replicas
Zone B
Cassandra and Evcache
Replicas
Zone C
Load Balancers
12. @atseitlin
Isolated Regions
Will someday test with Chaos Kong
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
EU-West Load Balancers
13. @atseitlin
Failure Modes and Effects
Failure Mode Probability Current Mitigation Plan
Application Failure High Automatic degraded response
AWS Region Failure Low Wait for region to recover
AWS Zone Failure Medium Continue to run on 2 out of 3 zones
Datacenter Failure Medium Migrate more functions to cloud
Data store failure Low Restore from S3 backups
S3 failure Low Restore from remote archive
Until we got really good at mitigating high and medium
probability failures, the ROI for mitigating regional
failures didn’t make sense. Getting there…
15. @atseitlin
Run What You Wrote
• Make developers responsible for failures
– Then they learn and write code that doesn’t fail
• Use Incident Reviews to find gaps to fix
– Make sure its not about finding “who to blame”
• Keep timeouts short, fail fast
– Don’t let cascading timeouts stack up
16. @atseitlin
Rapid Detection
• If your pilot had no instument panel, would
you ever board fly on a plane?
– Never run your service blind
• Monitor services, not instances
– Make instance failure a non-event
• Don’t pay people to watch screens
– Instead pay them to build alerting
17. @atseitlin
Rapid Rollback
• Use a new Autoscale Group to push code
• Leave existing ASG in place, switch traffic
• If OK, auto-delete old ASG a few hours later
• If “whoops”, switch traffic back in seconds
22. @atseitlin
Efficiency
• ~10x trough to peak ratio. Fill trough with
batch workloads
• Optimize machine class for each service
• Highly available red/black deployments
25. @atseitlin
Big Data & Analytics
• On deck for cloud migration
• ETL already in cloud with EMR (Hadoop)
• Many cloud alternatives but not yet as mature
as the old guard
26. @atseitlin
Corporate system moving to SaaS
• Email (Exchange->Google Apps)
• Expense Management (Concur->Workday)
• Document sharing (File Servers->Box)
• Goal is 100% SaaS
28. @atseitlin
Open Source Projects
Github / Techblog
Apache Contributions
Techblog Post
Coming Soon
Priam
Cassandra as a Service
Astyanax
Cassandra client for Java
CassJMeter
Cassandra test suite
Cassandra
Multi-region EC2 datastore
support
Aegisthus
Hadoop ETL for Cassandra
Ice
Spend analytics
Governator
Library lifecycle and dependency
injection
Odin
Cloud orchestration
Blitz4j Async logging
Exhibitor
Zookeeper as a Service
Curator
Zookeeper Patterns
EVCache
Memcached as a Service
Eureka / Discovery
Service Directory
Archaius
Dynamics Properties Service
Edda
Config state with history
Denominator
Ribbon
REST Client + mid-tier LB
Karyon
Instrumented REST Base Serve
Servo and Autoscaling Scripts
Genie
Hadoop PaaS
Hystrix
Robust service pattern
RxJava Reactive Patterns
Asgard
AutoScaleGroup based AWS
console
Chaos Monkey
Robustness verification
Latency Monkey
Janitor Monkey
Bakeries / Aminotor
Legend
31. @atseitlin
We’re hiring!
• Simian Army
• Cloud Tools
• NetflixOSS
• Cloud Operations
• Reliability Engineering
• Many, many more
jobs.netflix.com
32. @atseitlin
Takeaways
Netflix has built and deployed a scalable global and highly available Platform as a
Service and opened sourced it (NetflixOSS)
The Cloud enables elasticity, efficiency and fine-grained control via APIs
Credit cards, Big Data, and rest of corporate systems are next to move to the Cloud
http://netflix.github.com
http://techblog.netflix.com
http://slideshare.net/Netflix
http://www.linkedin.com/in/atseitlin
@atseitlin @NetflixOSS