Resiliency through Failure - Netflix's Approach to Extreme Availability in the Cloud

@atseitlin

Resiliency
through
failure

Ne3lix's
Approach
to
Extreme
Availability
in
the
Cloud

Ariel
Tseitlin

h.p://www.linkedin.com/in/atseitlin

@atseitlin

InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/netflix-resiliency-failure-cloud

Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide

@atseitlin

About
Ne<lix

Ne#lix
is
the
world’s

leading
Internet

television
network
with

more
than
36
million

members
in
40

countries
enjoying
more

than
one
billion
hours

of
TV
shows
and
movies

per
month,
including

original
series[1]

[1]
h.p://ir.ne<lix.com/

@atseitlin

A
complex
distributed
system

@atseitlin

How
Ne<lix
Streaming
Works

Customer
Device

(PC,
PS3,
TV…)

Web
Site
or

Discovery
API

User
Data

PersonalizaSon

Streaming
API

DRM

QoS
Logging

OpenConnect

CDN
Boxes

CDN

Management
and

Steering

Content
Encoding

Consumer

Electronics

AWS
Cloud

Services

CDN
Edge

LocaSons

Browse

Play

Watch

@atseitlin

Our
goal
is
availability

•  Members
can
stream
Ne<lix
whenever
they

want

•  New
users
can
explore
and
sign
up
for
the

service

•  New
members
can
acSvate
their
service
and

add
new
devices

@atseitlin

Failure
is
all
around
us

•  Disks
fail

•  Power
goes
out.
And
your
generator
fails.

•  So]ware
bugs
introduced

•  People
make
mistakes

Failure
is
unavoidable

@atseitlin

We
design
around
failure

•  ExcepSon
handling

•  Clusters

•  Redundancy

•  Fault
tolerance

•  Fall-‐back
or
degraded
experience
(Hystrix)

•  All
to
insulate
our
users
from
failure

Is
that
enough?

@atseitlin

It’s
not
enough

•  How
do
we
know
if
we’ve
succeeded?

•  Does
the
system
work
as
designed?

•  Is
it
as
resilient
as
we
believe?

•  How
do
we
prevent
dri]ing
into
failure?

The
typical
answer
is…

@atseitlin

More
tesSng!

•  Unit
tesSng

•  IntegraSon
tesSng

•  Stress
tesSng

•  ExhausSve
test
suites
to
simulate
and
test
all

failure
mode

Can
we
eﬀec<vely
simulate
a
large-‐
scale
distributed
system?

@atseitlin

Building
distributed
systems
is
hard

TesSng
them
exhausSvely
is
even
harder

•  Massive
data
sets
and
changing
shape

•  Internet-‐scale
traﬃc

•  Complex
interacSon
and
informaSon
ﬂow

•  Asynchronous
nature

•  3rd
party
services

•  All
while
innovaSng
and
building
features

Prohibi<vely
expensive,
if
not
impossible,

for
most
large-‐scale
systems

@atseitlin

What
if
we
could
reduce
variability
of
failures?

@atseitlin

There
is
another
way

•  Cause
failure
to
validate
resiliency

•  Test
design
assumpSon
by
stressing
them

•  Don’t
wait
for
random
failure.

Remove
its

uncertainty
by
forcing
it
periodically

@atseitlin

And
that’s
exactly
what
we
did

@atseitlin

Instances
fail

@atseitlin

Chaos
Monkey
taught
us…

•  State
is
bad

•  Clusters
are
good

•  Surviving
single
instance
failure
is
not
enough

@atseitlin

Lots
of
instances
fail

@atseitlin

Chaos
Gorilla

@atseitlin

Chaos
Gorilla
taught
us…

•  Hidden
assumpSons
on
deployment
topology

•  Infrastructure
control
plane
can
be
a

bo.leneck

•  Large
scale
events
are
hard
to
simulate

•  Rapidly
shi]ing
traﬃc
is
error
prone

•  Smooth
recovery
is
a
challenge

•  Cassandra
works
as
expected

@atseitlin

What
about
larger
catastrophes?

Anyone
remember
Sandy?

@atseitlin

Chaos
Kong
(*some
day
soon*)

@atseitlin

The
Sick
and
Wounded

@atseitlin

Latency
Monkey

@atseitlin

Hystrix,
RxJava

h.p://techblog.ne<lix.com/2012/02/fault-‐tolerance-‐in-‐high-‐volume.html

@atseitlin

Latency
Monkey
taught
us

•  Startup
resiliency
is
o]en
missed

•  An
ongoing
uniﬁed
approach
to
runSme

dependency
management
is
important
(visibility
&

transparency
gets
missed
otherwise)

•  Know
thy
neighbor
(unknown
dependencies)

•  Fall
backs
can
fail
too

@atseitlin

Clu.er
accumulates

•  Complexity

•  Cru]

•  VulnerabiliSes

•  Cost

@atseitlin

Janitor
Monkey

@atseitlin

Janitor
Monkey
taught
us…

•  Label
everything

•  Clu.er
builds
up

@atseitlin

Ranks
of
the
Simian
Army

•  Chaos
Monkey

•  Chaos
Gorilla

•  Latency
Monkey

•  Janitor
Monkey

•  Conformity

Monkey

•  Circus
Monkey

•  Doctor
Monkey

•  Howler
Monkey

•  Security
Monkey

•  Chaos
Kong

•  Eﬃciency
Monkey

@atseitlin

Observability
is
key

•  Don’t
exacerbate
real
customer
issues
with

failure
exercises

•  Deep
system
visibility
is
key
to
root-‐cause

failures
and
understand
the
system

@atseitlin

OrganizaSonal
elements

•  Every
engineer
is
an
operator
of
the
service

•  Each
failure
is
an
opportunity
to
learn

•  Blameless
culture

Goal
is
to
create
a
learning
organiza<on

@atseitlin

Assembling
the
Puzzle

@atseitlin

Open
Source
Projects

Github
/
Techblog

Apache
ContribuSons

Techblog
Post

Coming
Soon

Priam

Cassandra
as
a
Service

Astyanax

Cassandra
client
for
Java

CassJMeter

Cassandra
test
suite

Cassandra

MulS-‐region
EC2
datastore

support

Aegisthus

Hadoop
ETL
for
Cassandra

AWS
Usage

Spend
analyScs

Governator

Library
lifecycle
and
dependency

injecSon

Odin

Cloud
orchestraSon

Blitz4j
Async
logging

Exhibitor

Zookeeper
as
a
Service

Curator

Zookeeper
Pa.erns

EVCache

Memcached
as
a
Service

Eureka
/
Discovery

Service
Directory

Archaius

Dynamics
ProperSes
Service

Edda

Conﬁg
state
with
history

Denominator

Ribbon

REST
Client
+
mid-‐Ser
LB

Karyon

Instrumented
REST
Base
Serve

Servo
and
Autoscaling
Scripts

Genie

Hadoop
PaaS

Hystrix

Robust
service
pa.ern

RxJava
ReacSve
Pa.erns

Asgard

AutoScaleGroup
based
AWS

console

Chaos
Monkey

Robustness
veriﬁcaSon

Latency
Monkey

Janitor
Monkey

Bakeries
/
Aminotor

Legend

@atseitlin

How
does
it
all
ﬁt
together?

@atseitlin

Our
Current
Catalog
of
Releases

Free
code
available
at
h.p://ne<lix.github.com

@atseitlin

Takeaways

Regularly
inducing
failure
in
your
producSon

environment
validates
resiliency
and
increases

availability

Use
the
Ne<lixOSS
pla<orm
to
handle
the
heavy

li]ing
for
building
large-‐scale
distributed
cloud-‐
naSve
applicaSons

@atseitlin

Thank
you!

Any
quesSons?

Ariel
Tseitlin

h.p://www.linkedin.com/in/atseitlin

@atseitlin

Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/netflix-
resiliency-failure-cloud

Resiliency through Failure - Netflix's Approach to Extreme Availability in the Cloud

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (9)

More from C4Media

More from C4Media (20)

Recently uploaded

Recently uploaded (20)

Resiliency through Failure - Netflix's Approach to Extreme Availability in the Cloud