SlideShare a Scribd company logo
1 of 45
@atseitlin	
  
Resiliency	
  through	
  failure	
  
	
  
Ne3lix's	
  Approach	
  to	
  Extreme	
  Availability	
  in	
  the	
  Cloud	
  
	
  
Ariel	
  Tseitlin	
  
h.p://www.linkedin.com/in/atseitlin	
  
@atseitlin	
  
	
  
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/netflix-resiliency-failure-cloud
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
@atseitlin	
  
About	
  Ne<lix	
  
Ne#lix	
  is	
  the	
  world’s	
  
leading	
  Internet	
  
television	
  network	
  with	
  
more	
  than	
  36	
  million	
  
members	
  in	
  40	
  
countries	
  enjoying	
  more	
  
than	
  one	
  billion	
  hours	
  
of	
  TV	
  shows	
  and	
  movies	
  
per	
  month,	
  including	
  
original	
  series[1]	
  
[1]	
  h.p://ir.ne<lix.com/	
  
@atseitlin	
  
A	
  complex	
  distributed	
  system	
  
@atseitlin	
  
How	
  Ne<lix	
  Streaming	
  Works	
  
Customer	
  Device	
  
(PC,	
  PS3,	
  TV…)	
  
Web	
  Site	
  or	
  
Discovery	
  API	
  
User	
  Data	
  
PersonalizaSon	
  
Streaming	
  API	
  
DRM	
  
QoS	
  Logging	
  
OpenConnect	
  
CDN	
  Boxes	
  
CDN	
  
Management	
  and	
  
Steering	
  
Content	
  Encoding	
  
Consumer	
  
Electronics	
  
AWS	
  Cloud	
  
Services	
  
CDN	
  Edge	
  
LocaSons	
  
Browse	
  
Play	
  
Watch	
  
@atseitlin	
  
@atseitlin	
  
@atseitlin	
  
Our	
  goal	
  is	
  availability	
  
•  Members	
  can	
  stream	
  Ne<lix	
  whenever	
  they	
  
want	
  
•  New	
  users	
  can	
  explore	
  and	
  sign	
  up	
  for	
  the	
  
service	
  
•  New	
  members	
  can	
  acSvate	
  their	
  service	
  and	
  
add	
  new	
  devices	
  
@atseitlin	
  
Failure	
  is	
  all	
  around	
  us	
  
•  Disks	
  fail	
  
•  Power	
  goes	
  out.	
  And	
  your	
  generator	
  fails.	
  
•  So]ware	
  bugs	
  introduced	
  
•  People	
  make	
  mistakes	
  
	
  
Failure	
  is	
  unavoidable	
  
@atseitlin	
  
We	
  design	
  around	
  failure	
  
•  ExcepSon	
  handling	
  
•  Clusters	
  
•  Redundancy	
  
•  Fault	
  tolerance	
  	
  
•  Fall-­‐back	
  or	
  degraded	
  experience	
  (Hystrix)	
  
•  All	
  to	
  insulate	
  our	
  users	
  from	
  failure	
  
Is	
  that	
  enough?	
  	
  
@atseitlin	
  
It’s	
  not	
  enough	
  
•  How	
  do	
  we	
  know	
  if	
  we’ve	
  succeeded?	
  
•  Does	
  the	
  system	
  work	
  as	
  designed?	
  
•  Is	
  it	
  as	
  resilient	
  as	
  we	
  believe?	
  
•  How	
  do	
  we	
  prevent	
  dri]ing	
  into	
  failure?	
  
	
  
The	
  typical	
  answer	
  is…	
  
@atseitlin	
  
More	
  tesSng!	
  
•  Unit	
  tesSng	
  
•  IntegraSon	
  tesSng	
  
•  Stress	
  tesSng	
  
•  ExhausSve	
  test	
  suites	
  to	
  simulate	
  and	
  test	
  all	
  
failure	
  mode	
  
Can	
  we	
  effec<vely	
  simulate	
  a	
  large-­‐
scale	
  distributed	
  system?	
  
	
  
@atseitlin	
  
Building	
  distributed	
  systems	
  is	
  hard	
  
TesSng	
  them	
  exhausSvely	
  is	
  even	
  harder	
  
•  Massive	
  data	
  sets	
  and	
  changing	
  shape	
  
•  Internet-­‐scale	
  traffic	
  
•  Complex	
  interacSon	
  and	
  informaSon	
  flow	
  
•  Asynchronous	
  nature	
  
•  3rd	
  party	
  services	
  
•  All	
  while	
  innovaSng	
  and	
  building	
  features	
  
	
  
	
  
	
  
Prohibi<vely	
  expensive,	
  if	
  not	
  impossible,	
  
for	
  most	
  large-­‐scale	
  systems	
  
@atseitlin	
  
What	
  if	
  we	
  could	
  reduce	
  variability	
  of	
  failures?	
  
@atseitlin	
  
There	
  is	
  another	
  way 	
  	
  
•  Cause	
  failure	
  to	
  validate	
  resiliency	
  
•  Test	
  design	
  assumpSon	
  by	
  stressing	
  them	
  
•  Don’t	
  wait	
  for	
  random	
  failure.	
  	
  Remove	
  its	
  
uncertainty	
  by	
  forcing	
  it	
  periodically	
  
@atseitlin	
  
And	
  that’s	
  exactly	
  what	
  we	
  did	
  
@atseitlin	
  
Instances	
  fail	
  
@atseitlin	
  
@atseitlin	
  
Chaos	
  Monkey	
  taught	
  us…	
  
•  State	
  is	
  bad	
  
•  Clusters	
  are	
  good	
  
•  Surviving	
  single	
  instance	
  failure	
  is	
  not	
  enough	
  
@atseitlin	
  
Lots	
  of	
  instances	
  fail	
  
@atseitlin	
  
Chaos	
  Gorilla	
  
@atseitlin	
  
Chaos	
  Gorilla	
  taught	
  us…	
  
•  Hidden	
  assumpSons	
  on	
  deployment	
  topology	
  
•  Infrastructure	
  control	
  plane	
  can	
  be	
  a	
  
bo.leneck	
  
•  Large	
  scale	
  events	
  are	
  hard	
  to	
  simulate	
  
•  Rapidly	
  shi]ing	
  traffic	
  is	
  error	
  prone	
  
•  Smooth	
  recovery	
  is	
  a	
  challenge	
  
•  Cassandra	
  works	
  as	
  expected	
  
@atseitlin	
  
What	
  about	
  larger	
  catastrophes?	
  
	
  	
  
	
  Anyone	
  remember	
  Sandy?	
  
@atseitlin	
  
Chaos	
  Kong	
  (*some	
  day	
  soon*)	
  
@atseitlin	
  
The	
  Sick	
  and	
  Wounded	
  
@atseitlin	
  
Latency	
  Monkey	
  
@atseitlin	
  
@atseitlin	
  
Hystrix,	
  RxJava	
  
h.p://techblog.ne<lix.com/2012/02/fault-­‐tolerance-­‐in-­‐high-­‐volume.html	
  
@atseitlin	
  
Latency	
  Monkey	
  taught	
  us	
  
•  Startup	
  resiliency	
  is	
  o]en	
  missed	
  
•  An	
  ongoing	
  unified	
  approach	
  to	
  runSme	
  
dependency	
  management	
  is	
  important	
  (visibility	
  &	
  
transparency	
  gets	
  missed	
  otherwise)	
  
•  Know	
  thy	
  neighbor	
  (unknown	
  dependencies)	
  
•  Fall	
  backs	
  can	
  fail	
  too	
  
@atseitlin	
  
Entropy	
  
@atseitlin	
  
Clu.er	
  accumulates	
  
•  Complexity	
  	
  
•  Cru]	
  
•  VulnerabiliSes	
  
•  Cost	
  
@atseitlin	
  
Janitor	
  Monkey	
  
@atseitlin	
  
Janitor	
  Monkey	
  taught	
  us…	
  
•  Label	
  everything	
  
•  Clu.er	
  builds	
  up	
  
@atseitlin	
  
Ranks	
  of	
  the	
  Simian	
  Army	
  
•  Chaos	
  Monkey	
  
•  Chaos	
  Gorilla	
  
•  Latency	
  Monkey	
  
•  Janitor	
  Monkey	
  
•  Conformity	
  
Monkey	
  
	
  
•  Circus	
  Monkey	
  
•  Doctor	
  Monkey	
  
•  Howler	
  Monkey	
  
•  Security	
  Monkey	
  
•  Chaos	
  Kong	
  
•  Efficiency	
  Monkey	
  
@atseitlin	
  
Observability	
  is	
  key	
  
•  Don’t	
  exacerbate	
  real	
  customer	
  issues	
  with	
  
failure	
  exercises	
  
•  Deep	
  system	
  visibility	
  is	
  key	
  to	
  root-­‐cause	
  
failures	
  and	
  understand	
  the	
  system	
  
@atseitlin	
  
OrganizaSonal	
  elements	
  
•  Every	
  engineer	
  is	
  an	
  operator	
  of	
  the	
  service	
  
•  Each	
  failure	
  is	
  an	
  opportunity	
  to	
  learn	
  
•  Blameless	
  culture	
  
	
  
	
  
Goal	
  is	
  to	
  create	
  a	
  learning	
  organiza<on	
  
	
  
@atseitlin	
  
Assembling	
  the	
  Puzzle	
  
	
  
@atseitlin	
  
Open	
  Source	
  Projects	
  
Github	
  /	
  Techblog	
  
Apache	
  ContribuSons	
  
Techblog	
  Post	
  
Coming	
  Soon	
  
Priam	
  
Cassandra	
  as	
  a	
  Service	
  
Astyanax	
  
Cassandra	
  client	
  for	
  Java	
  
CassJMeter	
  
Cassandra	
  test	
  suite	
  
Cassandra	
  
MulS-­‐region	
  EC2	
  datastore	
  
support	
  
Aegisthus	
  
Hadoop	
  ETL	
  for	
  Cassandra	
  
AWS	
  Usage	
  
Spend	
  analyScs	
  
Governator	
  
Library	
  lifecycle	
  and	
  dependency	
  
injecSon	
  
Odin	
  
Cloud	
  orchestraSon	
  
Blitz4j	
  Async	
  logging	
  
Exhibitor	
  
Zookeeper	
  as	
  a	
  Service	
  
Curator	
  
Zookeeper	
  Pa.erns	
  
EVCache	
  
Memcached	
  as	
  a	
  Service	
  
Eureka	
  /	
  Discovery	
  
Service	
  Directory	
  
Archaius	
  
Dynamics	
  ProperSes	
  Service	
  
Edda	
  
Config	
  state	
  with	
  history	
  
Denominator	
  	
  
Ribbon	
  
REST	
  Client	
  +	
  mid-­‐Ser	
  LB	
  
Karyon	
  
Instrumented	
  REST	
  Base	
  Serve	
  
Servo	
  and	
  Autoscaling	
  Scripts	
  
Genie	
  
Hadoop	
  PaaS	
  
Hystrix	
  
Robust	
  service	
  pa.ern	
  
RxJava	
  ReacSve	
  Pa.erns	
  
Asgard	
  
AutoScaleGroup	
  based	
  AWS	
  
console	
  
Chaos	
  Monkey	
  
Robustness	
  verificaSon	
  
Latency	
  Monkey	
  
Janitor	
  Monkey	
  
Bakeries	
  /	
  Aminotor	
  
Legend	
  
@atseitlin	
  
How	
  does	
  it	
  all	
  fit	
  together?	
  
@atseitlin	
  
@atseitlin	
  
Our	
  Current	
  Catalog	
  of	
  Releases	
  
Free	
  code	
  available	
  at	
  h.p://ne<lix.github.com	
  
@atseitlin	
  
Takeaways	
  
Regularly	
  inducing	
  failure	
  in	
  your	
  producSon	
  
environment	
  validates	
  resiliency	
  and	
  increases	
  
availability	
  
	
  
Use	
  the	
  Ne<lixOSS	
  pla<orm	
  to	
  handle	
  the	
  heavy	
  
li]ing	
  for	
  building	
  large-­‐scale	
  distributed	
  cloud-­‐
naSve	
  applicaSons	
  
@atseitlin	
  
Thank	
  you!	
  
Any	
  quesSons?	
  
Ariel	
  Tseitlin	
  
h.p://www.linkedin.com/in/atseitlin	
  
@atseitlin	
  
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/netflix-
resiliency-failure-cloud

More Related Content

Viewers also liked

Psychology of Website Design - Dr. Pamela Rutledge
Psychology of Website Design - Dr. Pamela RutledgePsychology of Website Design - Dr. Pamela Rutledge
Psychology of Website Design - Dr. Pamela RutledgePamela Rutledge
 
Sitemap Templates by Creately
Sitemap Templates by CreatelySitemap Templates by Creately
Sitemap Templates by CreatelyCreately
 
Website Architecture Presentation from Web Strategy Workshops
Website Architecture Presentation from Web Strategy WorkshopsWebsite Architecture Presentation from Web Strategy Workshops
Website Architecture Presentation from Web Strategy WorkshopsCharles Edmunds
 
Website Layout and Structure
Website Layout and StructureWebsite Layout and Structure
Website Layout and StructureMichael Zinniger
 
Islamic Architecture History
Islamic Architecture HistoryIslamic Architecture History
Islamic Architecture HistoryAira Altovar
 
Creating a WordPress Website that Works from the Start
Creating a WordPress Website that Works from the StartCreating a WordPress Website that Works from the Start
Creating a WordPress Website that Works from the StartNile Flores
 

Viewers also liked (9)

Psychology of Website Design - Dr. Pamela Rutledge
Psychology of Website Design - Dr. Pamela RutledgePsychology of Website Design - Dr. Pamela Rutledge
Psychology of Website Design - Dr. Pamela Rutledge
 
Sitemap Templates by Creately
Sitemap Templates by CreatelySitemap Templates by Creately
Sitemap Templates by Creately
 
Creating a Website: Design and Layout
Creating a Website: Design and LayoutCreating a Website: Design and Layout
Creating a Website: Design and Layout
 
Website Architecture Presentation from Web Strategy Workshops
Website Architecture Presentation from Web Strategy WorkshopsWebsite Architecture Presentation from Web Strategy Workshops
Website Architecture Presentation from Web Strategy Workshops
 
Websites that work
Websites that workWebsites that work
Websites that work
 
Website Layout and Structure
Website Layout and StructureWebsite Layout and Structure
Website Layout and Structure
 
Website Design Basics
Website Design BasicsWebsite Design Basics
Website Design Basics
 
Islamic Architecture History
Islamic Architecture HistoryIslamic Architecture History
Islamic Architecture History
 
Creating a WordPress Website that Works from the Start
Creating a WordPress Website that Works from the StartCreating a WordPress Website that Works from the Start
Creating a WordPress Website that Works from the Start
 

More from C4Media

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoC4Media
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileC4Media
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsC4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 

More from C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Recently uploaded

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Resiliency through Failure - Netflix's Approach to Extreme Availability in the Cloud

  • 1. @atseitlin   Resiliency  through  failure     Ne3lix's  Approach  to  Extreme  Availability  in  the  Cloud     Ariel  Tseitlin   h.p://www.linkedin.com/in/atseitlin   @atseitlin    
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /netflix-resiliency-failure-cloud
  • 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. @atseitlin   About  Ne<lix   Ne#lix  is  the  world’s   leading  Internet   television  network  with   more  than  36  million   members  in  40   countries  enjoying  more   than  one  billion  hours   of  TV  shows  and  movies   per  month,  including   original  series[1]   [1]  h.p://ir.ne<lix.com/  
  • 5. @atseitlin   A  complex  distributed  system  
  • 6. @atseitlin   How  Ne<lix  Streaming  Works   Customer  Device   (PC,  PS3,  TV…)   Web  Site  or   Discovery  API   User  Data   PersonalizaSon   Streaming  API   DRM   QoS  Logging   OpenConnect   CDN  Boxes   CDN   Management  and   Steering   Content  Encoding   Consumer   Electronics   AWS  Cloud   Services   CDN  Edge   LocaSons   Browse   Play   Watch  
  • 9. @atseitlin   Our  goal  is  availability   •  Members  can  stream  Ne<lix  whenever  they   want   •  New  users  can  explore  and  sign  up  for  the   service   •  New  members  can  acSvate  their  service  and   add  new  devices  
  • 10. @atseitlin   Failure  is  all  around  us   •  Disks  fail   •  Power  goes  out.  And  your  generator  fails.   •  So]ware  bugs  introduced   •  People  make  mistakes     Failure  is  unavoidable  
  • 11. @atseitlin   We  design  around  failure   •  ExcepSon  handling   •  Clusters   •  Redundancy   •  Fault  tolerance     •  Fall-­‐back  or  degraded  experience  (Hystrix)   •  All  to  insulate  our  users  from  failure   Is  that  enough?    
  • 12. @atseitlin   It’s  not  enough   •  How  do  we  know  if  we’ve  succeeded?   •  Does  the  system  work  as  designed?   •  Is  it  as  resilient  as  we  believe?   •  How  do  we  prevent  dri]ing  into  failure?     The  typical  answer  is…  
  • 13. @atseitlin   More  tesSng!   •  Unit  tesSng   •  IntegraSon  tesSng   •  Stress  tesSng   •  ExhausSve  test  suites  to  simulate  and  test  all   failure  mode   Can  we  effec<vely  simulate  a  large-­‐ scale  distributed  system?    
  • 14. @atseitlin   Building  distributed  systems  is  hard   TesSng  them  exhausSvely  is  even  harder   •  Massive  data  sets  and  changing  shape   •  Internet-­‐scale  traffic   •  Complex  interacSon  and  informaSon  flow   •  Asynchronous  nature   •  3rd  party  services   •  All  while  innovaSng  and  building  features         Prohibi<vely  expensive,  if  not  impossible,   for  most  large-­‐scale  systems  
  • 15. @atseitlin   What  if  we  could  reduce  variability  of  failures?  
  • 16. @atseitlin   There  is  another  way     •  Cause  failure  to  validate  resiliency   •  Test  design  assumpSon  by  stressing  them   •  Don’t  wait  for  random  failure.    Remove  its   uncertainty  by  forcing  it  periodically  
  • 17. @atseitlin   And  that’s  exactly  what  we  did  
  • 20. @atseitlin   Chaos  Monkey  taught  us…   •  State  is  bad   •  Clusters  are  good   •  Surviving  single  instance  failure  is  not  enough  
  • 21. @atseitlin   Lots  of  instances  fail  
  • 23. @atseitlin   Chaos  Gorilla  taught  us…   •  Hidden  assumpSons  on  deployment  topology   •  Infrastructure  control  plane  can  be  a   bo.leneck   •  Large  scale  events  are  hard  to  simulate   •  Rapidly  shi]ing  traffic  is  error  prone   •  Smooth  recovery  is  a  challenge   •  Cassandra  works  as  expected  
  • 24. @atseitlin   What  about  larger  catastrophes?        Anyone  remember  Sandy?  
  • 25. @atseitlin   Chaos  Kong  (*some  day  soon*)  
  • 26. @atseitlin   The  Sick  and  Wounded  
  • 29. @atseitlin   Hystrix,  RxJava   h.p://techblog.ne<lix.com/2012/02/fault-­‐tolerance-­‐in-­‐high-­‐volume.html  
  • 30. @atseitlin   Latency  Monkey  taught  us   •  Startup  resiliency  is  o]en  missed   •  An  ongoing  unified  approach  to  runSme   dependency  management  is  important  (visibility  &   transparency  gets  missed  otherwise)   •  Know  thy  neighbor  (unknown  dependencies)   •  Fall  backs  can  fail  too  
  • 32. @atseitlin   Clu.er  accumulates   •  Complexity     •  Cru]   •  VulnerabiliSes   •  Cost  
  • 34. @atseitlin   Janitor  Monkey  taught  us…   •  Label  everything   •  Clu.er  builds  up  
  • 35. @atseitlin   Ranks  of  the  Simian  Army   •  Chaos  Monkey   •  Chaos  Gorilla   •  Latency  Monkey   •  Janitor  Monkey   •  Conformity   Monkey     •  Circus  Monkey   •  Doctor  Monkey   •  Howler  Monkey   •  Security  Monkey   •  Chaos  Kong   •  Efficiency  Monkey  
  • 36. @atseitlin   Observability  is  key   •  Don’t  exacerbate  real  customer  issues  with   failure  exercises   •  Deep  system  visibility  is  key  to  root-­‐cause   failures  and  understand  the  system  
  • 37. @atseitlin   OrganizaSonal  elements   •  Every  engineer  is  an  operator  of  the  service   •  Each  failure  is  an  opportunity  to  learn   •  Blameless  culture       Goal  is  to  create  a  learning  organiza<on    
  • 39. @atseitlin   Open  Source  Projects   Github  /  Techblog   Apache  ContribuSons   Techblog  Post   Coming  Soon   Priam   Cassandra  as  a  Service   Astyanax   Cassandra  client  for  Java   CassJMeter   Cassandra  test  suite   Cassandra   MulS-­‐region  EC2  datastore   support   Aegisthus   Hadoop  ETL  for  Cassandra   AWS  Usage   Spend  analyScs   Governator   Library  lifecycle  and  dependency   injecSon   Odin   Cloud  orchestraSon   Blitz4j  Async  logging   Exhibitor   Zookeeper  as  a  Service   Curator   Zookeeper  Pa.erns   EVCache   Memcached  as  a  Service   Eureka  /  Discovery   Service  Directory   Archaius   Dynamics  ProperSes  Service   Edda   Config  state  with  history   Denominator     Ribbon   REST  Client  +  mid-­‐Ser  LB   Karyon   Instrumented  REST  Base  Serve   Servo  and  Autoscaling  Scripts   Genie   Hadoop  PaaS   Hystrix   Robust  service  pa.ern   RxJava  ReacSve  Pa.erns   Asgard   AutoScaleGroup  based  AWS   console   Chaos  Monkey   Robustness  verificaSon   Latency  Monkey   Janitor  Monkey   Bakeries  /  Aminotor   Legend  
  • 40. @atseitlin   How  does  it  all  fit  together?  
  • 42. @atseitlin   Our  Current  Catalog  of  Releases   Free  code  available  at  h.p://ne<lix.github.com  
  • 43. @atseitlin   Takeaways   Regularly  inducing  failure  in  your  producSon   environment  validates  resiliency  and  increases   availability     Use  the  Ne<lixOSS  pla<orm  to  handle  the  heavy   li]ing  for  building  large-­‐scale  distributed  cloud-­‐ naSve  applicaSons  
  • 44. @atseitlin   Thank  you!   Any  quesSons?   Ariel  Tseitlin   h.p://www.linkedin.com/in/atseitlin   @atseitlin  
  • 45. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/netflix- resiliency-failure-cloud