SlideShare una empresa de Scribd logo
1 de 123
Descargar para leer sin conexión
Surviving Black Friday
Omri Fima
ROME - APRIL 13/14 2018
SURVIVING
BLACK FRIDAYA STORY ABOUT SHOPPING & SOFTWARE ENGINEERING
I’m not going to talk
about that kind of
Surviving
Black Friday
#BlackFridayFail
Omri Fima
• Technical Lead at
Data & ML Group
• Resilience engineer
• Maker
Social E-Commerce
We give our members a
place to connect, discover,
and shop with people who
share their passion.
My Story
Typical E-Commerce Architecture
Well, not really
It’s probably more like this
or more accurately,
like this
Long
GC
Load Balancer
cluster cluster
Table
lockSlow
API
Server
in japan
Limited network
CDN
So lets see what happens
when a server fails
Or in real life
But even this is not what
happens most of times.
Usually it’s more like this
Under
Load
Under
Load
Choking Choking Choking
Under
Load
Choking
Dead Dead Dead
Yay!
No
Load
Dead
Oy Vey…
Bazillion servers on the cloud will not make it
better.
Under certain scenarios it might even make it
worse.
Sad Truth #1
It won’t happen only in black Friday.
Other good scenarios
Good marketing campaign, Bad code, API
load, Failure in external services, Hurricanes.
Sad Truth #2
Even 99.99% reliability on 30 servers
is 2HR of downtime per month
Sad Truth #3
2 Hours on Black Friday
can be worth millions of $
You can accept it
or...
Resilience
Engineering
our job is to stop
failure from cascading
throughout the system
OK, nice but where do I start?
Just for you I have …
First you need to understand your SLA
What are our most important flows?
What it means when they fail?
How much can we allow them to fail?
Your page
Your “essential” page
Your “most essential” page
We can agree on SLA?
GREAT!
Because this is usually the hardest part.
You don’t know if a service is fault
tolerant if you don’t test for faults
Simulate and test
Hydra – simulate service latency and error
JMETER - for testing our system under load.
Why not to just kill machines?
Dead machines are
The least interesting.
Why under load?
Failure is a game of
percentiles.
Why under load?
Latency failures
usually cascade
under stress.
Hydra + JMETER = Great fun
Now we know what to solve first,
and can easily re-test it!
Let’s pause for a second
Focus
Preventing latency
cascade through the
system.
Latency is
UX killer
Latency is
Server killer
Dependency A Dependency B Dependency C Dependency D
UserRequest
UserRequest
UserRequest
Dependency A Dependency B Dependency C Dependency D
UserRequest
UserRequest
UserRequest
Dependency A Dependency B Dependency C Dependency D
UserRequest
UserRequest
UserRequest
UserRequest
UserRequest
UserRequest
UserRequest
UserRequest
UserRequest
UserRequest
UserRequest
UserRequest
UserRequest
UserRequest
UserRequest
UserRequest
Aggressive
Timeouts
What is a good
number for a timeout?
it depends
I’m OK with
0.05% of my
users getting
timeouts
Limit queues
Bulkhead pattern
20 Threads 10 Threads 15 Threads
UserRequest
UserRequest
UserRequest
Dependency A Dependency B Dependency C Dependency D
5 Threads
Command A Command B Command C Command D
Ok,
so now we are protecting our servers
while annoying our users.
Remember this?
Fail Silently
A little bit too aggressive…
In the real world
Degrade to less
accurate service
Failover to another
experience
Use stale data
from cache
Return Empty data
Defer writes
And if I cannot fail silently?
At least fail gracefully
What is not failing gracefully?
What is not failing gracefully?
What is not failing gracefully?
What is not failing gracefully?
Feature Toggles
• Have the option to kill your features when
needed.
• Have the option to return your feature back
to life gradually.
0%
50%
100%
Gradual rollout
AB Testing
Targeting
Smart bulkheads.
Added Bonus
Ok, so we have timeout of 100ms.
The day we were all afraid of has
come.
And it works!
Nobody waits more than 100ms!
But everybody waits 100ms.
And fail
You knew you going to fail!
So why wasting everybody
time on trying?
Break if you are
likely to
“burn the house”
Tools
• Netflix Hystrix (Java)
• Polly & TPL (.Net)
• HystrixJS (Node)
• CircuitBreakerJS (Node)
• Dyno (Python)
Measure everything
• when you have failed.
• latency of dependencies.
• when fallbacks were triggered.
• when fallbacks were “almost” triggered.
• when a service failure has “leaked” to the
user.
Visualize everything
Tools we are using
• StatsD – statistical analysis on metrics
• Graphite – graphing metrics
• Graphana – graphing just about anything
• Skyline – alert on anomalies on grpahs.
• Oculus - find relations between graphs.
Sounds trivial
Not so much.
Measuring everything is easy
clearing the noise is hard.
But guess what?
clearing the noise makes the
system more fault tolerant!
• Know your priorities
• Simulate, Test, Fix, Repeat.
• Fail Fast
• Fail Silent
• Turn broken stuff off
• Turn broken stuff off (automatically)
• Measure everything
THANK YOU
MORE ABOUT US @ SEARS.CO.IL

Más contenido relacionado

Similar a Surviving Black Friday - A resilience engineering tale - Omri Fima - Codemotion Rome 2018

Similar a Surviving Black Friday - A resilience engineering tale - Omri Fima - Codemotion Rome 2018 (20)

Applying Chaos Engineering to Build Resilient Serverless Applications
Applying Chaos Engineering to Build Resilient Serverless Applications Applying Chaos Engineering to Build Resilient Serverless Applications
Applying Chaos Engineering to Build Resilient Serverless Applications
 
The cyber security hype cycle is upon us
The cyber security hype cycle is upon usThe cyber security hype cycle is upon us
The cyber security hype cycle is upon us
 
Rate Limiting at Scale, from SANS AppSec Las Vegas 2012
Rate Limiting at Scale, from SANS AppSec Las Vegas 2012Rate Limiting at Scale, from SANS AppSec Las Vegas 2012
Rate Limiting at Scale, from SANS AppSec Las Vegas 2012
 
Antifragile, Microservices and DevOps - A Study
Antifragile, Microservices and DevOps - A StudyAntifragile, Microservices and DevOps - A Study
Antifragile, Microservices and DevOps - A Study
 
Cloud adoption fails - 5 ways deployments go wrong and 5 solutions
Cloud adoption fails - 5 ways deployments go wrong and 5 solutionsCloud adoption fails - 5 ways deployments go wrong and 5 solutions
Cloud adoption fails - 5 ways deployments go wrong and 5 solutions
 
Practical resource monitoring with munin (English editon)
Practical resource monitoring with munin  (English editon)Practical resource monitoring with munin  (English editon)
Practical resource monitoring with munin (English editon)
 
Load testing, Lessons learnt and Loadzen - Martin Buhr at DevTank - 31st Janu...
Load testing, Lessons learnt and Loadzen - Martin Buhr at DevTank - 31st Janu...Load testing, Lessons learnt and Loadzen - Martin Buhr at DevTank - 31st Janu...
Load testing, Lessons learnt and Loadzen - Martin Buhr at DevTank - 31st Janu...
 
AI & AWS DeepComposer
AI & AWS DeepComposerAI & AWS DeepComposer
AI & AWS DeepComposer
 
A Technical Dive into Defensive Trickery
A Technical Dive into Defensive TrickeryA Technical Dive into Defensive Trickery
A Technical Dive into Defensive Trickery
 
You wouldn't build a toast, would you?
You wouldn't build a toast, would you?You wouldn't build a toast, would you?
You wouldn't build a toast, would you?
 
IMC Summit 2016 Breakout - Aleksandar Seovic - The Illusion of Statelessness
IMC Summit 2016 Breakout - Aleksandar Seovic - The Illusion of StatelessnessIMC Summit 2016 Breakout - Aleksandar Seovic - The Illusion of Statelessness
IMC Summit 2016 Breakout - Aleksandar Seovic - The Illusion of Statelessness
 
AWS Summit Singapore - Automation & Augmentation Driven by AI, Enabling Self-...
AWS Summit Singapore - Automation & Augmentation Driven by AI, Enabling Self-...AWS Summit Singapore - Automation & Augmentation Driven by AI, Enabling Self-...
AWS Summit Singapore - Automation & Augmentation Driven by AI, Enabling Self-...
 
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
 
Monitoring of OpenNebula installations
Monitoring of OpenNebula installationsMonitoring of OpenNebula installations
Monitoring of OpenNebula installations
 
How to build observability into Serverless (O'Reilly Velocity 2018)
How to build observability into Serverless (O'Reilly Velocity 2018)How to build observability into Serverless (O'Reilly Velocity 2018)
How to build observability into Serverless (O'Reilly Velocity 2018)
 
Javaland 2017: "You´ll do microservices now". Now what?
Javaland 2017: "You´ll do microservices now". Now what?Javaland 2017: "You´ll do microservices now". Now what?
Javaland 2017: "You´ll do microservices now". Now what?
 
Breaking down monolithic applications into microservices - DEM06-S - Chicago ...
Breaking down monolithic applications into microservices - DEM06-S - Chicago ...Breaking down monolithic applications into microservices - DEM06-S - Chicago ...
Breaking down monolithic applications into microservices - DEM06-S - Chicago ...
 
Apply best parts of microservices to serverless
Apply best parts of microservices to serverlessApply best parts of microservices to serverless
Apply best parts of microservices to serverless
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE
 
Surviving SOA - delivering (somewhat) continuously on a hostile planet
Surviving SOA - delivering (somewhat) continuously on a hostile planetSurviving SOA - delivering (somewhat) continuously on a hostile planet
Surviving SOA - delivering (somewhat) continuously on a hostile planet
 

Más de Codemotion

Más de Codemotion (20)

Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
 
Pompili - From hero to_zero: The FatalNoise neverending story
Pompili - From hero to_zero: The FatalNoise neverending storyPompili - From hero to_zero: The FatalNoise neverending story
Pompili - From hero to_zero: The FatalNoise neverending story
 
Pastore - Commodore 65 - La storia
Pastore - Commodore 65 - La storiaPastore - Commodore 65 - La storia
Pastore - Commodore 65 - La storia
 
Pennisi - Essere Richard Altwasser
Pennisi - Essere Richard AltwasserPennisi - Essere Richard Altwasser
Pennisi - Essere Richard Altwasser
 
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
 
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
 
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
 
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
Francesco Baldassarri  - Deliver Data at Scale - Codemotion Amsterdam 2019 - Francesco Baldassarri  - Deliver Data at Scale - Codemotion Amsterdam 2019 -
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
 
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
 
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
 
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
 
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
 
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
 
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
 
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
 
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
 
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
 
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
 
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
 
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Surviving Black Friday - A resilience engineering tale - Omri Fima - Codemotion Rome 2018