The ‘Black Friday fail’ is the greatest fear of every major online retailer. Since downtime equals money, and in Black Friday it means quite a lot of money. But the sad truth is that a failure of a service is inevitable, and not only on black Friday. So how can we survive a failure of a service when it inevitably fails? In this lecture I will share our approach for SRE. Why we all have misconceptions on how a major website failure unfolds. and how to use tools like chaos testing, gradual rollout, circuit breakers and automatic fallback to protect your system.
39. Bazillion servers on the cloud will not make it
better.
Under certain scenarios it might even make it
worse.
Sad Truth #1
40. It won’t happen only in black Friday.
Other good scenarios
Good marketing campaign, Bad code, API
load, Failure in external services, Hurricanes.
Sad Truth #2
82. Bulkhead pattern
20 Threads 10 Threads 15 Threads
UserRequest
UserRequest
UserRequest
Dependency A Dependency B Dependency C Dependency D
5 Threads
Command A Command B Command C Command D
83. Ok,
so now we are protecting our servers
while annoying our users.
105. Ok, so we have timeout of 100ms.
The day we were all afraid of has
come.
And it works!
Nobody waits more than 100ms!
But everybody waits 100ms.
And fail
106. You knew you going to fail!
So why wasting everybody
time on trying?
112. Measure everything
• when you have failed.
• latency of dependencies.
• when fallbacks were triggered.
• when fallbacks were “almost” triggered.
• when a service failure has “leaked” to the
user.
114. Tools we are using
• StatsD – statistical analysis on metrics
• Graphite – graphing metrics
• Graphana – graphing just about anything
• Skyline – alert on anomalies on grpahs.
• Oculus - find relations between graphs.