Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/16BEthL.
Ariel Tseitlin discusses Netflix' suite of tools, collectively called the Simian Army, used to improve resiliency and maintain the cloud environment. The tools simulate failure in order to see how the system reacts to it. Filmed at qconnewyork.com.
Ariel Tseitlin manages the Netflix Cloud and is interested in all things cloudy. At Netflix, he is Director of Cloud Solutions, helping Netflix be successful in the Cloud, including cloud tooling, monitoring, performance and scalability, and cloud operations and reliability engineering.
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Resiliency through Failure - Netflix's Approach to Extreme Availability in the Cloud
1. @atseitlin
Resiliency
through
failure
Ne3lix's
Approach
to
Extreme
Availability
in
the
Cloud
Ariel
Tseitlin
h.p://www.linkedin.com/in/atseitlin
@atseitlin
2. InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/netflix-resiliency-failure-cloud
3. Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
4. @atseitlin
About
Ne<lix
Ne#lix
is
the
world’s
leading
Internet
television
network
with
more
than
36
million
members
in
40
countries
enjoying
more
than
one
billion
hours
of
TV
shows
and
movies
per
month,
including
original
series[1]
[1]
h.p://ir.ne<lix.com/
6. @atseitlin
How
Ne<lix
Streaming
Works
Customer
Device
(PC,
PS3,
TV…)
Web
Site
or
Discovery
API
User
Data
PersonalizaSon
Streaming
API
DRM
QoS
Logging
OpenConnect
CDN
Boxes
CDN
Management
and
Steering
Content
Encoding
Consumer
Electronics
AWS
Cloud
Services
CDN
Edge
LocaSons
Browse
Play
Watch
9. @atseitlin
Our
goal
is
availability
• Members
can
stream
Ne<lix
whenever
they
want
• New
users
can
explore
and
sign
up
for
the
service
• New
members
can
acSvate
their
service
and
add
new
devices
10. @atseitlin
Failure
is
all
around
us
• Disks
fail
• Power
goes
out.
And
your
generator
fails.
• So]ware
bugs
introduced
• People
make
mistakes
Failure
is
unavoidable
11. @atseitlin
We
design
around
failure
• ExcepSon
handling
• Clusters
• Redundancy
• Fault
tolerance
• Fall-‐back
or
degraded
experience
(Hystrix)
• All
to
insulate
our
users
from
failure
Is
that
enough?
12. @atseitlin
It’s
not
enough
• How
do
we
know
if
we’ve
succeeded?
• Does
the
system
work
as
designed?
• Is
it
as
resilient
as
we
believe?
• How
do
we
prevent
dri]ing
into
failure?
The
typical
answer
is…
13. @atseitlin
More
tesSng!
• Unit
tesSng
• IntegraSon
tesSng
• Stress
tesSng
• ExhausSve
test
suites
to
simulate
and
test
all
failure
mode
Can
we
effec<vely
simulate
a
large-‐
scale
distributed
system?
14. @atseitlin
Building
distributed
systems
is
hard
TesSng
them
exhausSvely
is
even
harder
• Massive
data
sets
and
changing
shape
• Internet-‐scale
traffic
• Complex
interacSon
and
informaSon
flow
• Asynchronous
nature
• 3rd
party
services
• All
while
innovaSng
and
building
features
Prohibi<vely
expensive,
if
not
impossible,
for
most
large-‐scale
systems
16. @atseitlin
There
is
another
way
• Cause
failure
to
validate
resiliency
• Test
design
assumpSon
by
stressing
them
• Don’t
wait
for
random
failure.
Remove
its
uncertainty
by
forcing
it
periodically
23. @atseitlin
Chaos
Gorilla
taught
us…
• Hidden
assumpSons
on
deployment
topology
• Infrastructure
control
plane
can
be
a
bo.leneck
• Large
scale
events
are
hard
to
simulate
• Rapidly
shi]ing
traffic
is
error
prone
• Smooth
recovery
is
a
challenge
• Cassandra
works
as
expected
30. @atseitlin
Latency
Monkey
taught
us
• Startup
resiliency
is
o]en
missed
• An
ongoing
unified
approach
to
runSme
dependency
management
is
important
(visibility
&
transparency
gets
missed
otherwise)
• Know
thy
neighbor
(unknown
dependencies)
• Fall
backs
can
fail
too
35. @atseitlin
Ranks
of
the
Simian
Army
• Chaos
Monkey
• Chaos
Gorilla
• Latency
Monkey
• Janitor
Monkey
• Conformity
Monkey
• Circus
Monkey
• Doctor
Monkey
• Howler
Monkey
• Security
Monkey
• Chaos
Kong
• Efficiency
Monkey
36. @atseitlin
Observability
is
key
• Don’t
exacerbate
real
customer
issues
with
failure
exercises
• Deep
system
visibility
is
key
to
root-‐cause
failures
and
understand
the
system
37. @atseitlin
OrganizaSonal
elements
• Every
engineer
is
an
operator
of
the
service
• Each
failure
is
an
opportunity
to
learn
• Blameless
culture
Goal
is
to
create
a
learning
organiza<on
39. @atseitlin
Open
Source
Projects
Github
/
Techblog
Apache
ContribuSons
Techblog
Post
Coming
Soon
Priam
Cassandra
as
a
Service
Astyanax
Cassandra
client
for
Java
CassJMeter
Cassandra
test
suite
Cassandra
MulS-‐region
EC2
datastore
support
Aegisthus
Hadoop
ETL
for
Cassandra
AWS
Usage
Spend
analyScs
Governator
Library
lifecycle
and
dependency
injecSon
Odin
Cloud
orchestraSon
Blitz4j
Async
logging
Exhibitor
Zookeeper
as
a
Service
Curator
Zookeeper
Pa.erns
EVCache
Memcached
as
a
Service
Eureka
/
Discovery
Service
Directory
Archaius
Dynamics
ProperSes
Service
Edda
Config
state
with
history
Denominator
Ribbon
REST
Client
+
mid-‐Ser
LB
Karyon
Instrumented
REST
Base
Serve
Servo
and
Autoscaling
Scripts
Genie
Hadoop
PaaS
Hystrix
Robust
service
pa.ern
RxJava
ReacSve
Pa.erns
Asgard
AutoScaleGroup
based
AWS
console
Chaos
Monkey
Robustness
verificaSon
Latency
Monkey
Janitor
Monkey
Bakeries
/
Aminotor
Legend
42. @atseitlin
Our
Current
Catalog
of
Releases
Free
code
available
at
h.p://ne<lix.github.com
43. @atseitlin
Takeaways
Regularly
inducing
failure
in
your
producSon
environment
validates
resiliency
and
increases
availability
Use
the
Ne<lixOSS
pla<orm
to
handle
the
heavy
li]ing
for
building
large-‐scale
distributed
cloud-‐
naSve
applicaSons
44. @atseitlin
Thank
you!
Any
quesSons?
Ariel
Tseitlin
h.p://www.linkedin.com/in/atseitlin
@atseitlin
45. Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/netflix-
resiliency-failure-cloud