Surviving Black Friday - A resilience engineering tale - Omri Fima - Codemotion Rome 2018

Surviving Black Friday
Omri Fima
ROME - APRIL 13/14 2018

SURVIVING
BLACK FRIDAYA STORY ABOUT SHOPPING & SOFTWARE ENGINEERING

I’m not going to talk
about that kind of
Surviving
Black Friday

Omri Fima
• Technical Lead at
Data & ML Group
• Resilience engineer
• Maker

Social E-Commerce
We give our members a
place to connect, discover,
and shop with people who
share their passion.

Typical E-Commerce Architecture

It’s probably more like this

Long
GC
Load Balancer
cluster cluster
Table
lockSlow
API
Server
in japan
Limited network
CDN

So lets see what happens
when a server fails

But even this is not what
happens most of times.

Choking Choking Choking
Under
Load
Choking

Dead Dead Dead
Yay!
No
Load
Dead

Bazillion servers on the cloud will not make it
better.
Under certain scenarios it might even make it
worse.
Sad Truth #1

It won’t happen only in black Friday.
Other good scenarios
Good marketing campaign, Bad code, API
load, Failure in external services, Hurricanes.
Sad Truth #2

Even 99.99% reliability on 30 servers
is 2HR of downtime per month
Sad Truth #3

2 Hours on Black Friday
can be worth millions of $

our job is to stop
failure from cascading
throughout the system

OK, nice but where do I start?

First you need to understand your SLA
What are our most important flows?
What it means when they fail?
How much can we allow them to fail?

Your “most essential” page

We can agree on SLA?
GREAT!
Because this is usually the hardest part.

You don’t know if a service is fault
tolerant if you don’t test for faults

Simulate and test
Hydra – simulate service latency and error
JMETER - for testing our system under load.

Why not to just kill machines?
Dead machines are
The least interesting.

Why under load?
Failure is a game of
percentiles.

Why under load?
Latency failures
usually cascade
under stress.

Now we know what to solve first,
and can easily re-test it!

Preventing latency
cascade through the
system.

Dependency A Dependency B Dependency C Dependency D
UserRequest
UserRequest
UserRequest

UserRequest
UserRequest
UserRequest
UserRequest
UserRequest
UserRequest
UserRequest
UserRequest
UserRequest
UserRequest
UserRequest
UserRequest
UserRequest
UserRequest
UserRequest
UserRequest

What is a good
number for a timeout?

I’m OK with
0.05% of my
users getting
timeouts

Bulkhead pattern
20 Threads 10 Threads 15 Threads
UserRequest
UserRequest
UserRequest
5 Threads
Command A Command B Command C Command D

Ok,
so now we are protecting our servers
while annoying our users.

A little bit too aggressive…

Degrade to less
accurate service

Failover to another
experience

And if I cannot fail silently?

What is not failing gracefully?

Feature Toggles
• Have the option to kill your features when
needed.
• Have the option to return your feature back
to life gradually.
0%
50%
100%

Gradual rollout
AB Testing
Targeting
Smart bulkheads.
Added Bonus

Ok, so we have timeout of 100ms.
The day we were all afraid of has
come.
And it works!
Nobody waits more than 100ms!
But everybody waits 100ms.
And fail

You knew you going to fail!
So why wasting everybody
time on trying?

Break if you are
likely to
“burn the house”

Tools
• Netflix Hystrix (Java)
• Polly & TPL (.Net)
• HystrixJS (Node)
• CircuitBreakerJS (Node)
• Dyno (Python)

Measure everything
• when you have failed.
• latency of dependencies.
• when fallbacks were triggered.
• when fallbacks were “almost” triggered.
• when a service failure has “leaked” to the
user.

Tools we are using
• StatsD – statistical analysis on metrics
• Graphite – graphing metrics
• Graphana – graphing just about anything
• Skyline – alert on anomalies on grpahs.
• Oculus - find relations between graphs.

Not so much.
Measuring everything is easy
clearing the noise is hard.

But guess what?
clearing the noise makes the
system more fault tolerant!

• Know your priorities
• Simulate, Test, Fix, Repeat.
• Fail Fast
• Fail Silent
• Turn broken stuff off
• Turn broken stuff off (automatically)
• Measure everything

THANK YOU
MORE ABOUT US @ SEARS.CO.IL

Surviving Black Friday - A resilience engineering tale - Omri Fima - Codemotion Rome 2018

Recomendados

Recomendados

Más contenido relacionado

Similar a Surviving Black Friday - A resilience engineering tale - Omri Fima - Codemotion Rome 2018

Similar a Surviving Black Friday - A resilience engineering tale - Omri Fima - Codemotion Rome 2018 (20)

Más de Codemotion

Más de Codemotion (20)

Último

Último (20)

Surviving Black Friday - A resilience engineering tale - Omri Fima - Codemotion Rome 2018