SlideShare una empresa de Scribd logo
1 de 41
Descargar para leer sin conexión
Learning from Learnings:
Anatomy of Three Incidents
Randy Shoup
@randyshoup
linkedin.com/in/randyshoup
Background
@randyshoup
App Engine Outage
Oct 2012
http://googleappengine.blogspot.com/2012/10/about-todays-app-engine-outage.html
App Engine Outage
Oct 2012
App Engine
Reliability Fixit
• Step 1: Identify the Problem
o All team leads and senior engineers met in a room with a whiteboard
o Enumerated all known and suspected reliability issues
o Too much technical debt had accumulated
o Reliability issues had not been prioritized
o Identify 8-10 themes
• Step 2: Understand the Problem
o Each theme assigned to a senior engineer to investigate
o Timeboxed for 1 week
o After 1 week, all leads came back with
• Detailed list of issues
• Recommended steps to address them
• Estimated order-of-magnitude of effort (1 day, 1 week, 1 month, etc.)
App Engine
Reliability Fixit
• Step 3: Consensus and Prioritization
o Leads discussed themes and prioritized work
o Assigned engineers to tasks
App Engine
Reliability Fixit
• Step 4: Implementation and Follow-up
o Engineers worked on assigned tasks
o Simple spreadsheet of task status, which engineers updated weekly
o Minimal effort from management (~1 hour / week) to summarize progress at
weekly team meeting
App Engine
Reliability Fixit
•  Results
o 10x reduction in reliability issues
o Improved team cohesion and camaraderie
o Broader participation and ownership of the future health of the platform
o Still remembered several years later
App Engine
Reliability Fixit
Stitch Fix Outages
Oct / Nov 2016
• (11/08/2016) Spectre unavailable for ~3 minutes [Shared Database]
• (11/05/2016) Spectre unavailable for ~5 minutes [Shared Database]
• (10/25/2016) All systems unavailable for ~5 minutes [Shared Database]
• (10/24/2016) All systems unavailable for ~5 minutes [Shared Database]
• (10/21/2016) All systems unavailable for ~3 ½ hours [DDOS attack]
• (10/18/2016) All systems unavailable for ~3 minutes [Shared Database]
• (10/17/2016) All systems unavailable for ~20 minutes [Shared Database]
• (10/13/2016) Minx escalation broken for ~2 hours [Zendesk outage]
• (10/11/2016) Label printing unavailable for ~10 minutes [FedEx outage]
• (10/10/2016) Label printing unavailable for ~15 minutes [FedEx outage]
• (10/10/2016) All systems unavailable for ~10 minutes [Shared Database]
Database Stability
Problems
• 1. Applications contended on common tables
• 2. Scalability limited by database connections
• 3. One application could take down entire company
Stitch Fix
Stability Retrospective
• Step 1: Identify the Problem
• Step 2: Understand the Problem
• Step 3: Consensus and Prioritization
• Step 4: Implementation and Follow-Up
•  Results
@randyshoup
Stability
Solutions
• 1. Focus on expensive queries
o Log
o Eliminate
o Rewrite
o Reduce
• 2. Manage database connections via connection
concentrator
• 3. Stability and Scalability Program
o Ongoing 25% investment in services migration
@randyshoup
WeWork
Login Issues
• Problem: Some members unable to log in
• Inconsistent representations across different services in
the system
• Over time, simple system interactions grew increasingly
complex and convoluted
• Not enough graceful degradation or automated repair
@randyshoup
WeWork
Login Retrospective
• Step 1: Identify the Problem
• Step 2: Understand the Problem
• Step 3: Consensus and Prioritization
• Step 4: Implementation and Follow-Up
@randyshoup
Common
Elements
• Unintentional, long-term accumulation of small,
individually reasonable decisions
• “Compelling event” catalyzes long-term change
• Blameless culture makes learning and improvement
possible
• Structured post-incident approach
@randyshoup
If you don’t end up regretting
your early technology
decisions, you probably over-
engineered.
Lessons
•Leadership and Collaboration
•Quality and Discipline
•Driving Change
Lessons
•Leadership and Collaboration
•Quality and Discipline
•Driving Change
Blameless
Post-Mortems
• Open and Honest Discussion
o Detect
o Diagnose
o Mitigate
o Remediate
o Prevent
Engineers compete to take personal responsibility (!)
@randyshoup linkedin.com/in/randyshoup
“Finally we can prioritize
fixing that broken system!”
Cross-Functional
Collaboration
• Best decisions made through partnership
• Shared context
o Tradeoffs and implications
o Given common context, well-meaning people generally agree
• “Disagree and Commit”
@randyshoup
15 Million
“Never let a
good crisis go
to waste.”
Lessons
•Leadership and Collaboration
•Quality and Discipline
•Driving Change
Quality and reliability are
business concerns
@randyshoup
“Do you have time to do it
twice?”
“We don’t have time to do it
right!”
@randyshoup
The more constrained you are
on time or resources, the more
important it is to build it right
the first time.
@randyshoup
“Do not try to
do everything.
Do one thing
well.”
@randyshoup
Definition of
Done
• Feature complete
• Automated Tests
• Released to Production
• Feature gate turned on
• Monitored
Vicious Cycle
of Technical Debt
Technical
Debt
“No time
to do it
right”
Quick-
and-dirty
@randyshoup
Virtuous Cycle
of Investment
Solid
Foundation
Confidence
Faster and
Better
Quality
Investment
@randyshoup
Lessons
•Leadership and Collaboration
•Quality and Discipline
•Driving Change
Top-down
and
Bottom-up
Funding the
Improvement Program
• Approach 1: Standard project
o Prioritize and fund like any other project
o Works best when project is within one team
• Approach 2: Explicit global investment
o Agree on an up-front investment
(e.g., 25%, 30% of engineering efforts)
o Works best when program is across many teams
Finance
An unexpected ally …
@randyshoup
Lessons
•Leadership and Collaboration
•Quality and Discipline
•Driving Change
“If you can’t change your
organization,
change your organization.”
-- Martin Fowler
@randyshoup
We are Hiring!
700 software engineers
globally, in
• New York
• Tel Aviv
• San Francisco
• Seattle
• Shanghai
• Singapore
@randyshoup

Más contenido relacionado

La actualidad más candente

Minimum Viable Architecture - Good Enough is Good Enough
Minimum Viable Architecture - Good Enough is Good EnoughMinimum Viable Architecture - Good Enough is Good Enough
Minimum Viable Architecture - Good Enough is Good EnoughRandy Shoup
 
A CTO's Guide to Scaling Organizations
A CTO's Guide to Scaling OrganizationsA CTO's Guide to Scaling Organizations
A CTO's Guide to Scaling OrganizationsRandy Shoup
 
DevOps - It's About How We Work
DevOps - It's About How We WorkDevOps - It's About How We Work
DevOps - It's About How We WorkRandy Shoup
 
Managing Data in Microservices
Managing Data in MicroservicesManaging Data in Microservices
Managing Data in MicroservicesRandy Shoup
 
Service Architectures At Scale - QCon London 2015
Service Architectures At Scale - QCon London 2015Service Architectures At Scale - QCon London 2015
Service Architectures At Scale - QCon London 2015Randy Shoup
 
Monoliths, Migrations, and Microservices
Monoliths, Migrations, and MicroservicesMonoliths, Migrations, and Microservices
Monoliths, Migrations, and MicroservicesRandy Shoup
 
Moving Fast at Scale
Moving Fast at ScaleMoving Fast at Scale
Moving Fast at ScaleRandy Shoup
 
Pragmatic Microservices
Pragmatic MicroservicesPragmatic Microservices
Pragmatic MicroservicesRandy Shoup
 
Managing Data at Scale - Microservices and Events
Managing Data at Scale - Microservices and EventsManaging Data at Scale - Microservices and Events
Managing Data at Scale - Microservices and EventsRandy Shoup
 
Effective Microservices In a Data-centric World
Effective Microservices In a Data-centric WorldEffective Microservices In a Data-centric World
Effective Microservices In a Data-centric WorldRandy Shoup
 
An Agile Approach to Machine Learning
An Agile Approach to Machine LearningAn Agile Approach to Machine Learning
An Agile Approach to Machine LearningRandy Shoup
 
Minimum Viable Architecture -- Good Enough is Good Enough in a Startup
Minimum Viable Architecture -- Good Enough is Good Enough in a StartupMinimum Viable Architecture -- Good Enough is Good Enough in a Startup
Minimum Viable Architecture -- Good Enough is Good Enough in a StartupRandy Shoup
 
Scaling Your Architecture with Services and Events
Scaling Your Architecture with Services and EventsScaling Your Architecture with Services and Events
Scaling Your Architecture with Services and EventsRandy Shoup
 
Microservices - Scaling Development and Service
Microservices - Scaling Development and ServiceMicroservices - Scaling Development and Service
Microservices - Scaling Development and ServicePaulo Gaspar
 
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...DevOpsDays Tel Aviv
 
Enterprise DevOps fact or fiction - DevOps Summit 2014
Enterprise DevOps fact or fiction - DevOps Summit 2014Enterprise DevOps fact or fiction - DevOps Summit 2014
Enterprise DevOps fact or fiction - DevOps Summit 2014Chris Riley ☁
 
Being Elastic -- Evolving Programming for the Cloud
Being Elastic -- Evolving Programming for the CloudBeing Elastic -- Evolving Programming for the Cloud
Being Elastic -- Evolving Programming for the CloudRandy Shoup
 
You build it - Cyber Chicago Keynote
You build it -  Cyber Chicago KeynoteYou build it -  Cyber Chicago Keynote
You build it - Cyber Chicago KeynoteJohn Willis
 
Art of the Possible - Serverless Conference NYC 2017
Art of the Possible - Serverless Conference NYC 2017 Art of the Possible - Serverless Conference NYC 2017
Art of the Possible - Serverless Conference NYC 2017 John Willis
 
The agile elephant in the room
The agile elephant in the roomThe agile elephant in the room
The agile elephant in the roomAgileDenver
 

La actualidad más candente (20)

Minimum Viable Architecture - Good Enough is Good Enough
Minimum Viable Architecture - Good Enough is Good EnoughMinimum Viable Architecture - Good Enough is Good Enough
Minimum Viable Architecture - Good Enough is Good Enough
 
A CTO's Guide to Scaling Organizations
A CTO's Guide to Scaling OrganizationsA CTO's Guide to Scaling Organizations
A CTO's Guide to Scaling Organizations
 
DevOps - It's About How We Work
DevOps - It's About How We WorkDevOps - It's About How We Work
DevOps - It's About How We Work
 
Managing Data in Microservices
Managing Data in MicroservicesManaging Data in Microservices
Managing Data in Microservices
 
Service Architectures At Scale - QCon London 2015
Service Architectures At Scale - QCon London 2015Service Architectures At Scale - QCon London 2015
Service Architectures At Scale - QCon London 2015
 
Monoliths, Migrations, and Microservices
Monoliths, Migrations, and MicroservicesMonoliths, Migrations, and Microservices
Monoliths, Migrations, and Microservices
 
Moving Fast at Scale
Moving Fast at ScaleMoving Fast at Scale
Moving Fast at Scale
 
Pragmatic Microservices
Pragmatic MicroservicesPragmatic Microservices
Pragmatic Microservices
 
Managing Data at Scale - Microservices and Events
Managing Data at Scale - Microservices and EventsManaging Data at Scale - Microservices and Events
Managing Data at Scale - Microservices and Events
 
Effective Microservices In a Data-centric World
Effective Microservices In a Data-centric WorldEffective Microservices In a Data-centric World
Effective Microservices In a Data-centric World
 
An Agile Approach to Machine Learning
An Agile Approach to Machine LearningAn Agile Approach to Machine Learning
An Agile Approach to Machine Learning
 
Minimum Viable Architecture -- Good Enough is Good Enough in a Startup
Minimum Viable Architecture -- Good Enough is Good Enough in a StartupMinimum Viable Architecture -- Good Enough is Good Enough in a Startup
Minimum Viable Architecture -- Good Enough is Good Enough in a Startup
 
Scaling Your Architecture with Services and Events
Scaling Your Architecture with Services and EventsScaling Your Architecture with Services and Events
Scaling Your Architecture with Services and Events
 
Microservices - Scaling Development and Service
Microservices - Scaling Development and ServiceMicroservices - Scaling Development and Service
Microservices - Scaling Development and Service
 
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...
 
Enterprise DevOps fact or fiction - DevOps Summit 2014
Enterprise DevOps fact or fiction - DevOps Summit 2014Enterprise DevOps fact or fiction - DevOps Summit 2014
Enterprise DevOps fact or fiction - DevOps Summit 2014
 
Being Elastic -- Evolving Programming for the Cloud
Being Elastic -- Evolving Programming for the CloudBeing Elastic -- Evolving Programming for the Cloud
Being Elastic -- Evolving Programming for the Cloud
 
You build it - Cyber Chicago Keynote
You build it -  Cyber Chicago KeynoteYou build it -  Cyber Chicago Keynote
You build it - Cyber Chicago Keynote
 
Art of the Possible - Serverless Conference NYC 2017
Art of the Possible - Serverless Conference NYC 2017 Art of the Possible - Serverless Conference NYC 2017
Art of the Possible - Serverless Conference NYC 2017
 
The agile elephant in the room
The agile elephant in the roomThe agile elephant in the room
The agile elephant in the room
 

Similar a Learning from Learnings: Anatomy of Three Incidents

Hanno Jarvet - VSM, Planning and Problem Solving - ConFu
Hanno Jarvet - VSM, Planning and Problem Solving - ConFuHanno Jarvet - VSM, Planning and Problem Solving - ConFu
Hanno Jarvet - VSM, Planning and Problem Solving - ConFuDevConFu
 
Refactoring Into Microservices 2016-11-06
Refactoring Into Microservices 2016-11-06Refactoring Into Microservices 2016-11-06
Refactoring Into Microservices 2016-11-06Derek Ashmore
 
Large Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Large Scale Architecture -- The Unreasonable Effectiveness of SimplicityLarge Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Large Scale Architecture -- The Unreasonable Effectiveness of SimplicityRandy Shoup
 
Turning Human Capital into High Performance Organizational Capital
Turning Human Capital into High Performance Organizational CapitalTurning Human Capital into High Performance Organizational Capital
Turning Human Capital into High Performance Organizational CapitalJohn Willis
 
ConFoo: Moving web performance testing to the left
ConFoo: Moving web performance testing to the leftConFoo: Moving web performance testing to the left
ConFoo: Moving web performance testing to the leftTom Chavez
 
Refactoring Into Microservices 2016-11-08
Refactoring Into Microservices 2016-11-08Refactoring Into Microservices 2016-11-08
Refactoring Into Microservices 2016-11-08Derek Ashmore
 
From Technical Debt to Technical Health
From Technical Debt to Technical HealthFrom Technical Debt to Technical Health
From Technical Debt to Technical HealthDeclan Whelan
 
A modern architecturereview–usingcodereviewtools-ver-3.5
A modern architecturereview–usingcodereviewtools-ver-3.5A modern architecturereview–usingcodereviewtools-ver-3.5
A modern architecturereview–usingcodereviewtools-ver-3.5SSW
 
Microservices for architects los angeles-2016-07-16
Microservices for architects los angeles-2016-07-16Microservices for architects los angeles-2016-07-16
Microservices for architects los angeles-2016-07-16Derek Ashmore
 
Software Project Management
Software Project ManagementSoftware Project Management
Software Project ManagementDeepak Kumar
 
Building A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning PipelineBuilding A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning PipelineRobert Dempsey
 
Scaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON TutorialScaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON Tutorialduleepa
 
Did your ____ car come with satellite radio as a trial? Did they keep it?
Did your ____ car come with satellite radio as a trial? Did they keep it? Did your ____ car come with satellite radio as a trial? Did they keep it?
Did your ____ car come with satellite radio as a trial? Did they keep it? c9busera
 
How does your _____ car handle in the rain?
How does your _____ car handle in the rain?How does your _____ car handle in the rain?
How does your _____ car handle in the rain?c9busera
 
Mark Andersen DFW DevOps Days 2017
Mark Andersen DFW DevOps Days 2017Mark Andersen DFW DevOps Days 2017
Mark Andersen DFW DevOps Days 2017Mark Andersen
 
From Project Manager to Scrum Master
From Project Manager to Scrum MasterFrom Project Manager to Scrum Master
From Project Manager to Scrum MasterLitheSpeed
 
DNN-Connect 2019: DNN Horror Stories
DNN-Connect 2019: DNN Horror StoriesDNN-Connect 2019: DNN Horror Stories
DNN-Connect 2019: DNN Horror StoriesWill Strohl
 

Similar a Learning from Learnings: Anatomy of Three Incidents (20)

Hanno Jarvet - VSM, Planning and Problem Solving - ConFu
Hanno Jarvet - VSM, Planning and Problem Solving - ConFuHanno Jarvet - VSM, Planning and Problem Solving - ConFu
Hanno Jarvet - VSM, Planning and Problem Solving - ConFu
 
CMS Crash Course!
CMS Crash Course!CMS Crash Course!
CMS Crash Course!
 
Refactoring Into Microservices 2016-11-06
Refactoring Into Microservices 2016-11-06Refactoring Into Microservices 2016-11-06
Refactoring Into Microservices 2016-11-06
 
Large Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Large Scale Architecture -- The Unreasonable Effectiveness of SimplicityLarge Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Large Scale Architecture -- The Unreasonable Effectiveness of Simplicity
 
Turning Human Capital into High Performance Organizational Capital
Turning Human Capital into High Performance Organizational CapitalTurning Human Capital into High Performance Organizational Capital
Turning Human Capital into High Performance Organizational Capital
 
ConFoo: Moving web performance testing to the left
ConFoo: Moving web performance testing to the leftConFoo: Moving web performance testing to the left
ConFoo: Moving web performance testing to the left
 
Refactoring Into Microservices 2016-11-08
Refactoring Into Microservices 2016-11-08Refactoring Into Microservices 2016-11-08
Refactoring Into Microservices 2016-11-08
 
From Technical Debt to Technical Health
From Technical Debt to Technical HealthFrom Technical Debt to Technical Health
From Technical Debt to Technical Health
 
A modern architecturereview–usingcodereviewtools-ver-3.5
A modern architecturereview–usingcodereviewtools-ver-3.5A modern architecturereview–usingcodereviewtools-ver-3.5
A modern architecturereview–usingcodereviewtools-ver-3.5
 
Microservices for architects los angeles-2016-07-16
Microservices for architects los angeles-2016-07-16Microservices for architects los angeles-2016-07-16
Microservices for architects los angeles-2016-07-16
 
Software Project Management
Software Project ManagementSoftware Project Management
Software Project Management
 
8D Problem Solving (Oshkosh).pdf
8D Problem Solving (Oshkosh).pdf8D Problem Solving (Oshkosh).pdf
8D Problem Solving (Oshkosh).pdf
 
Building A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning PipelineBuilding A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning Pipeline
 
Scaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON TutorialScaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON Tutorial
 
Did your ____ car come with satellite radio as a trial? Did they keep it?
Did your ____ car come with satellite radio as a trial? Did they keep it? Did your ____ car come with satellite radio as a trial? Did they keep it?
Did your ____ car come with satellite radio as a trial? Did they keep it?
 
How does your _____ car handle in the rain?
How does your _____ car handle in the rain?How does your _____ car handle in the rain?
How does your _____ car handle in the rain?
 
Mark Andersen DFW DevOps Days 2017
Mark Andersen DFW DevOps Days 2017Mark Andersen DFW DevOps Days 2017
Mark Andersen DFW DevOps Days 2017
 
Effective Scrum
Effective ScrumEffective Scrum
Effective Scrum
 
From Project Manager to Scrum Master
From Project Manager to Scrum MasterFrom Project Manager to Scrum Master
From Project Manager to Scrum Master
 
DNN-Connect 2019: DNN Horror Stories
DNN-Connect 2019: DNN Horror StoriesDNN-Connect 2019: DNN Horror Stories
DNN-Connect 2019: DNN Horror Stories
 

Más de Randy Shoup

Breaking Codes, Designing Jets, and Building Teams
Breaking Codes, Designing Jets, and Building TeamsBreaking Codes, Designing Jets, and Building Teams
Breaking Codes, Designing Jets, and Building TeamsRandy Shoup
 
Ten Lessons of the DevOps Transition
Ten Lessons of the DevOps TransitionTen Lessons of the DevOps Transition
Ten Lessons of the DevOps TransitionRandy Shoup
 
From the Monolith to Microservices - CraftConf 2015
From the Monolith to Microservices - CraftConf 2015From the Monolith to Microservices - CraftConf 2015
From the Monolith to Microservices - CraftConf 2015Randy Shoup
 
Concurrency at Scale: Evolution to Micro-Services
Concurrency at Scale:  Evolution to Micro-ServicesConcurrency at Scale:  Evolution to Micro-Services
Concurrency at Scale: Evolution to Micro-ServicesRandy Shoup
 
Why Enterprises Are Embracing the Cloud
Why Enterprises Are Embracing the CloudWhy Enterprises Are Embracing the Cloud
Why Enterprises Are Embracing the CloudRandy Shoup
 
DevOpsDays Silicon Valley 2014 - The Game of Operations
DevOpsDays Silicon Valley 2014 - The Game of OperationsDevOpsDays Silicon Valley 2014 - The Game of Operations
DevOpsDays Silicon Valley 2014 - The Game of OperationsRandy Shoup
 
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYE
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYEQCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYE
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYERandy Shoup
 
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...Randy Shoup
 
The Importance of Culture: Building and Sustaining Effective Engineering Org...
The Importance of Culture:  Building and Sustaining Effective Engineering Org...The Importance of Culture:  Building and Sustaining Effective Engineering Org...
The Importance of Culture: Building and Sustaining Effective Engineering Org...Randy Shoup
 

Más de Randy Shoup (9)

Breaking Codes, Designing Jets, and Building Teams
Breaking Codes, Designing Jets, and Building TeamsBreaking Codes, Designing Jets, and Building Teams
Breaking Codes, Designing Jets, and Building Teams
 
Ten Lessons of the DevOps Transition
Ten Lessons of the DevOps TransitionTen Lessons of the DevOps Transition
Ten Lessons of the DevOps Transition
 
From the Monolith to Microservices - CraftConf 2015
From the Monolith to Microservices - CraftConf 2015From the Monolith to Microservices - CraftConf 2015
From the Monolith to Microservices - CraftConf 2015
 
Concurrency at Scale: Evolution to Micro-Services
Concurrency at Scale:  Evolution to Micro-ServicesConcurrency at Scale:  Evolution to Micro-Services
Concurrency at Scale: Evolution to Micro-Services
 
Why Enterprises Are Embracing the Cloud
Why Enterprises Are Embracing the CloudWhy Enterprises Are Embracing the Cloud
Why Enterprises Are Embracing the Cloud
 
DevOpsDays Silicon Valley 2014 - The Game of Operations
DevOpsDays Silicon Valley 2014 - The Game of OperationsDevOpsDays Silicon Valley 2014 - The Game of Operations
DevOpsDays Silicon Valley 2014 - The Game of Operations
 
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYE
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYEQCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYE
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYE
 
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...
 
The Importance of Culture: Building and Sustaining Effective Engineering Org...
The Importance of Culture:  Building and Sustaining Effective Engineering Org...The Importance of Culture:  Building and Sustaining Effective Engineering Org...
The Importance of Culture: Building and Sustaining Effective Engineering Org...
 

Último

JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxAS Design & AST.
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdfSteve Caron
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...kalichargn70th171
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 

Último (20)

JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptx
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 

Learning from Learnings: Anatomy of Three Incidents

  • 1. Learning from Learnings: Anatomy of Three Incidents Randy Shoup @randyshoup linkedin.com/in/randyshoup
  • 3.
  • 6. App Engine Reliability Fixit • Step 1: Identify the Problem o All team leads and senior engineers met in a room with a whiteboard o Enumerated all known and suspected reliability issues o Too much technical debt had accumulated o Reliability issues had not been prioritized o Identify 8-10 themes
  • 7. • Step 2: Understand the Problem o Each theme assigned to a senior engineer to investigate o Timeboxed for 1 week o After 1 week, all leads came back with • Detailed list of issues • Recommended steps to address them • Estimated order-of-magnitude of effort (1 day, 1 week, 1 month, etc.) App Engine Reliability Fixit
  • 8. • Step 3: Consensus and Prioritization o Leads discussed themes and prioritized work o Assigned engineers to tasks App Engine Reliability Fixit
  • 9. • Step 4: Implementation and Follow-up o Engineers worked on assigned tasks o Simple spreadsheet of task status, which engineers updated weekly o Minimal effort from management (~1 hour / week) to summarize progress at weekly team meeting App Engine Reliability Fixit
  • 10. •  Results o 10x reduction in reliability issues o Improved team cohesion and camaraderie o Broader participation and ownership of the future health of the platform o Still remembered several years later App Engine Reliability Fixit
  • 11.
  • 12. Stitch Fix Outages Oct / Nov 2016 • (11/08/2016) Spectre unavailable for ~3 minutes [Shared Database] • (11/05/2016) Spectre unavailable for ~5 minutes [Shared Database] • (10/25/2016) All systems unavailable for ~5 minutes [Shared Database] • (10/24/2016) All systems unavailable for ~5 minutes [Shared Database] • (10/21/2016) All systems unavailable for ~3 ½ hours [DDOS attack] • (10/18/2016) All systems unavailable for ~3 minutes [Shared Database] • (10/17/2016) All systems unavailable for ~20 minutes [Shared Database] • (10/13/2016) Minx escalation broken for ~2 hours [Zendesk outage] • (10/11/2016) Label printing unavailable for ~10 minutes [FedEx outage] • (10/10/2016) Label printing unavailable for ~15 minutes [FedEx outage] • (10/10/2016) All systems unavailable for ~10 minutes [Shared Database]
  • 13. Database Stability Problems • 1. Applications contended on common tables • 2. Scalability limited by database connections • 3. One application could take down entire company
  • 14. Stitch Fix Stability Retrospective • Step 1: Identify the Problem • Step 2: Understand the Problem • Step 3: Consensus and Prioritization • Step 4: Implementation and Follow-Up •  Results @randyshoup
  • 15. Stability Solutions • 1. Focus on expensive queries o Log o Eliminate o Rewrite o Reduce • 2. Manage database connections via connection concentrator • 3. Stability and Scalability Program o Ongoing 25% investment in services migration
  • 17. WeWork Login Issues • Problem: Some members unable to log in • Inconsistent representations across different services in the system • Over time, simple system interactions grew increasingly complex and convoluted • Not enough graceful degradation or automated repair @randyshoup
  • 18. WeWork Login Retrospective • Step 1: Identify the Problem • Step 2: Understand the Problem • Step 3: Consensus and Prioritization • Step 4: Implementation and Follow-Up @randyshoup
  • 19. Common Elements • Unintentional, long-term accumulation of small, individually reasonable decisions • “Compelling event” catalyzes long-term change • Blameless culture makes learning and improvement possible • Structured post-incident approach @randyshoup
  • 20. If you don’t end up regretting your early technology decisions, you probably over- engineered.
  • 21. Lessons •Leadership and Collaboration •Quality and Discipline •Driving Change
  • 22. Lessons •Leadership and Collaboration •Quality and Discipline •Driving Change
  • 23. Blameless Post-Mortems • Open and Honest Discussion o Detect o Diagnose o Mitigate o Remediate o Prevent Engineers compete to take personal responsibility (!) @randyshoup linkedin.com/in/randyshoup
  • 24. “Finally we can prioritize fixing that broken system!”
  • 25. Cross-Functional Collaboration • Best decisions made through partnership • Shared context o Tradeoffs and implications o Given common context, well-meaning people generally agree • “Disagree and Commit” @randyshoup
  • 26. 15 Million “Never let a good crisis go to waste.”
  • 27. Lessons •Leadership and Collaboration •Quality and Discipline •Driving Change
  • 28. Quality and reliability are business concerns @randyshoup
  • 29. “Do you have time to do it twice?” “We don’t have time to do it right!” @randyshoup
  • 30. The more constrained you are on time or resources, the more important it is to build it right the first time. @randyshoup
  • 31. “Do not try to do everything. Do one thing well.” @randyshoup
  • 32. Definition of Done • Feature complete • Automated Tests • Released to Production • Feature gate turned on • Monitored
  • 33. Vicious Cycle of Technical Debt Technical Debt “No time to do it right” Quick- and-dirty @randyshoup
  • 34. Virtuous Cycle of Investment Solid Foundation Confidence Faster and Better Quality Investment @randyshoup
  • 35. Lessons •Leadership and Collaboration •Quality and Discipline •Driving Change
  • 37. Funding the Improvement Program • Approach 1: Standard project o Prioritize and fund like any other project o Works best when project is within one team • Approach 2: Explicit global investment o Agree on an up-front investment (e.g., 25%, 30% of engineering efforts) o Works best when program is across many teams
  • 38. Finance An unexpected ally … @randyshoup
  • 39. Lessons •Leadership and Collaboration •Quality and Discipline •Driving Change
  • 40. “If you can’t change your organization, change your organization.” -- Martin Fowler @randyshoup
  • 41. We are Hiring! 700 software engineers globally, in • New York • Tel Aviv • San Francisco • Seattle • Shanghai • Singapore @randyshoup