SlideShare una empresa de Scribd logo
1 de 77
Descargar para leer sin conexión
Building A Successful
Organization By
Mastering Failure
John Goulah (@johngoulah)
Etsy
Marketplace
• $1.93B Annual GMS 2014
• 1.4M active sellers
• 20M+ active buyers
• 30% international GMS
• 57%+ mobile visits
Infrastructure
• over 5500 MySQL databases
• 750K graphite metrics/min
• 1.3GB logs written/min
• 50M - 75M gearman jobs / day
• 30-50 deploys / day
Company
• Headquartered in Brooklyn
• Over 700 employees
• 7 offices around the world
• 80+ dogs / 80+ cats
Values
Learning Org
a company that facilitates the learning of its members and
continuously transforms itself
Five Disciplines
Systems Thinking
process of understanding how people, structure, and
processes influence one another within a larger system
Personal Mastery
an individual holds great importance in a learning organization
Mental Models
the assumptions held by individials and organizations
Shared Vision
creates a common identity that provides focus and energy for
learning
Team Learning
the problem solving capacity of the organization is improved
through better access to knowledge and expertise
Learning About Failure
• architecture reviews
• operability reviews
• blameless post mortems
failure and success
come from the same
source
context
can study the system
at any time
inflection points
• architecture reviews
• early feedback and discussion
• operability reviews
• held before launching
• blameless post mortems
• held after a failure
Architecture Reviews
Architecture Reviews
understand the costs and benefits of a proposed solution, and
discuss alternatives
Etsy Tech Axioms
• we use a small number of well known tools
• all technology decisions come with trade offs
• with new technology, many of those trade offs are
unknown
• we’re growing. things change
with new technology
many of those tradeoffs are unknown
Departures
a departure is when new technologies or patterns are
introduced that deviate from the current known methods of
operating the system and maintaining the software
How do I know I need an
architecture review?
when there is a perceived departure from current technology
choices or patterns
How early do you hold them?
early enough to be able to bail out or make major course
corrections
Who should come?
• the people presenting the change
• key stakeholders (sr. engineers, or arch review working
group)
• everyone else that wants to learn about the proposed
changes to the system
Architecture Review
Meeting Format
Preparation
• a proposal is written in a shared document and circulated
• comments are added, discussed, and potentially resolved in
advance
• initial questions for the meeting are collected in a tool such
as google moderator
Some General Questions
• Do we understand the costs of this departure?
• Have we asked hard questions about trade-offs?
• What will this prohibit us from doing in the future?
Some General Questions (cont)
• Are we impacting visibility, measurability, debuggability and
other operability concerns?
• Are we impacting testability, security, translatability,
performance and other product quality concerns?
• Does it makes sense?
The Arch Review
• proposal is presented to the group
• discuss questions and concerns
• decide if we are moving forward or need further discussion
you're saying my
project might not
move forward?
Why might this end a project?
• we learned through this discussion that an alternative is
better
• we find goals overlap with other projects that are in
progress
• we discover that it isn't worth the costs now that we have a
better idea what they are
At the end we should have
• detailed notes from the conversation
• agreement on tricky components and document them
• a compilation of learnings and questions
• a decision of whether to keep going with the project, stop
and rethink, or gather more information
Operability
Reviews
Operability Reviews
understand how the system could break, how we will know,
and how we will react
When do we do operability
reviews?
• after architecture reviews in the product lifecycle, generally
right before launch
• when we need to gain increased confidence for launch due
to the technology, product, or communication choices
being risky
• if there's a chance you'd surprise teams that operate the
software
Who comes to the operability
review?
representatives from:
• Product
• Development
• Operations
• Community/Support
• QA
Some Questions
• Has the feature been tested enough to deploy to
production?
• Does everyone know when it will go live, and who will push
the feature?
• Is there communication about the feature ready to go out
with the feature?
• Is it possible to turn up this feature on a percentage basis,
dark launch, or gameday it?
Some Questions (cont)
• Does the launch involves any new production infrastructure?
• If so, are those pieces in monitoring or metrics collection?
• If so, is there a deployment pipeline in place?
• If so, is there a development environment set up to make
it work in dev?
• If so, are there tests that can be and are run on CI?
Contingency
Checklist
Contingency Checklist
a list of things that could possibly go "wrong" with a new
feature, what we could do about it
Issue
What could possibly go wrong with the feature launched in
production?
Likelihood
What is the likelihood of each item going wrong?
Comments
Any comments about the item?
Impact
This is just a measure of how impactful this will be if it does
actually turn out to be a concern.
Engineering
What do we do to mitigate the issue with the item (i.e. can we
gracefully degrade?)
Onsite Messaging
What is the messaging to the user in the forums, blog, and
social media if this needs graceful degradation?
PR
Is PR needed for the contingency (i.e. larger scale failure)
Blameless
Post Mortems
What is a post mortem?
a postmortem is a facilitated meeting during which people
involved/interested/close to an accident or incident debriefs
together on how we think the event came about
What does it cover?
• walking through a timeline of events
• learning how things are expected to work "normally",
adding the context of everyone’s perspective
• exploring what we might do to improve things for the future
Local Rationality
we want to know how it made sense for someone to do what
they did at the time
searching for second stories
instead of human error
• asking why is leading to who is responsible
• asking how leads to what
Avoiding Human Error
Human error points directly to individuals in a complex
system. But, in complex systems, system behaviour is driven
fundamentally by the goals of the system and the system
structure. People just provide the flexibility to make it work.
Avoiding Human Error (cont)
Human error implies deviation from “normal” or "ideal", but in
complex situations and tasks there is often no normal ideal that
can be precisely and exactly described, many variable
interconnected touchpoints influence decisions that are made
Recognizing Human Error
• be aware of other terms for it: slip, lapse, distraction,
mistake, deviation, carelessness, malpractice, recklessness,
violation, misjudgement, etc
• don’t point to individuals when you really want to
understand system itself and the work
• how do you feel when something goes wrong?
• is it to find who did it / who screwed up, or to find how it
happened?
Other Things to Avoid
Root Cause
• it leads to a simplistic and linear explanation of how events
transpired
• linear mental models of causality don’t capture what is
needed to improve the safety of a system
• ignores the complexity of an event, which is what should be
explored if we are going to learn
• leads directly to blaming things on human error
Nietzschean anxiety
when situations appear both threatening and ambiguous we
seem to demand a clear causal agency; because if we cannot
establish this agency then the "problem" is potentially
irresolvable
Hindsight Bias
inclination, after an event has occurred, to see the event as
having been predictable, despite there having been little or no
objective basis for predicting it
Counterfactuals
the human tendency to create possible alternatives to life
events that have already occurred; something that is contrary
to what actually happened
Morgue
https://github.com/etsy/morgue
Post Mortem
Meeting Format
Meeting Format
• Timeline
• Discussion
• Remediation Items
Timeline
• a rough timeline scaffolding is required
• talk about facts that were known at the time, even if
hindsight reveals misunderstandings in what we knew
• look out for knowledge that some people were aware of,
that others were not, and dig into that
• no judgement about actions or knowledge (counterfactuals)
• tell people to hold that thought if they jump to remediation
items at this point
Timeline (cont)
• continually ask "What are we missing?" until those involved
feel its complete
• continually ask "Does everyone agree this is the order in
which events took place?"
• make sure to include important times for events that
happened (alerts, discoveries)
• reach a consensus on the timeline and move on to the
discussion
Discussion
• When an action or decision was taken in the timeline, ask
the person: "Think back to what you knew at the time, why
did that action make sense to you at the time?"
• Did we clean up anything after we were stable, how long
did it take?
• Was there any troubleshooting fatigue?
Discussion (cont)
• Did we do a good job with communication (site status,
support, forums, etc)?
• Were all tools on hand and working, ready to use when we
needed them during the issue? Where there tools we would
have liked to have?
• Did we have enough metrics visibility to diagnose the issue?
• Was there collaborative and thoughtful communication
during the issue?
Remediation
• Remediation items should have tickets associated with them
to follow up on
• There can be further post meeting discussion on these but
tasks should not linger
Remediation questions
• What things could we do to prevent this exact thing from
happening in the future?
• What things could we do to make troubleshooting similar
incidents in the future easier?
In Summary
We Can Learn Before
and After Failure
Before
• Architecture reviews for new technology
• Operability reviews to gain launch confidence
After
• Postmortems are done soon after a failure
• avoid human error, counterfactuals, hindsight bias, and
root cause
Questions?
John Goulah (@johngoulah)
Etsy

Más contenido relacionado

La actualidad más candente

Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7NUS-ISS
 
Microservice Architecture 101
Microservice Architecture 101Microservice Architecture 101
Microservice Architecture 101Kochih Wu
 
Transforming Organizations with CI/CD
Transforming Organizations with CI/CDTransforming Organizations with CI/CD
Transforming Organizations with CI/CDCprime
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Rundeck
 
Principles and Practices in Continuous Deployment at Etsy
Principles and Practices in Continuous Deployment at EtsyPrinciples and Practices in Continuous Deployment at Etsy
Principles and Practices in Continuous Deployment at EtsyMike Brittain
 
Gap Survey, Assessment and Analysis for DevSecOps
Gap Survey, Assessment and Analysis for DevSecOpsGap Survey, Assessment and Analysis for DevSecOps
Gap Survey, Assessment and Analysis for DevSecOpsMarc Hornbeek
 
Customer case - Dynatrace Monitoring Redefined
Customer case - Dynatrace Monitoring RedefinedCustomer case - Dynatrace Monitoring Redefined
Customer case - Dynatrace Monitoring RedefinedMichel Duruel
 
Chaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWSChaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWSBilal Aybar
 
Improve monitoring and observability for kubernetes with oss tools
Improve monitoring and observability for kubernetes with oss toolsImprove monitoring and observability for kubernetes with oss tools
Improve monitoring and observability for kubernetes with oss toolsNilesh Gule
 
Shift Left - Approach and practices with IBM
Shift Left - Approach and practices with IBMShift Left - Approach and practices with IBM
Shift Left - Approach and practices with IBMIBM UrbanCode Products
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityAcquia
 
DevOps Approach (Point of View by Ravi Tadwalkar)
DevOps Approach (Point of View by Ravi Tadwalkar)DevOps Approach (Point of View by Ravi Tadwalkar)
DevOps Approach (Point of View by Ravi Tadwalkar)Ravi Tadwalkar
 
Demystifying observability
Demystifying observability Demystifying observability
Demystifying observability Abigail Bangser
 
Developer Experience
Developer ExperienceDeveloper Experience
Developer ExperienceThoughtworks
 
Everything You Need to Know About the 2019 DORA Accelerate State of DevOps Re...
Everything You Need to Know About the 2019 DORA Accelerate State of DevOps Re...Everything You Need to Know About the 2019 DORA Accelerate State of DevOps Re...
Everything You Need to Know About the 2019 DORA Accelerate State of DevOps Re...Red Gate Software
 
Taming technical debt
Taming technical debt Taming technical debt
Taming technical debt Panji Gautama
 

La actualidad más candente (20)

Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
 
Microservice Architecture 101
Microservice Architecture 101Microservice Architecture 101
Microservice Architecture 101
 
Transforming Organizations with CI/CD
Transforming Organizations with CI/CDTransforming Organizations with CI/CD
Transforming Organizations with CI/CD
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE
 
Principles and Practices in Continuous Deployment at Etsy
Principles and Practices in Continuous Deployment at EtsyPrinciples and Practices in Continuous Deployment at Etsy
Principles and Practices in Continuous Deployment at Etsy
 
Gap Survey, Assessment and Analysis for DevSecOps
Gap Survey, Assessment and Analysis for DevSecOpsGap Survey, Assessment and Analysis for DevSecOps
Gap Survey, Assessment and Analysis for DevSecOps
 
Customer case - Dynatrace Monitoring Redefined
Customer case - Dynatrace Monitoring RedefinedCustomer case - Dynatrace Monitoring Redefined
Customer case - Dynatrace Monitoring Redefined
 
Chaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWSChaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWS
 
Improve monitoring and observability for kubernetes with oss tools
Improve monitoring and observability for kubernetes with oss toolsImprove monitoring and observability for kubernetes with oss tools
Improve monitoring and observability for kubernetes with oss tools
 
Shift Left - Approach and practices with IBM
Shift Left - Approach and practices with IBMShift Left - Approach and practices with IBM
Shift Left - Approach and practices with IBM
 
SRE From Scratch
SRE From ScratchSRE From Scratch
SRE From Scratch
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
 
DevOps Approach (Point of View by Ravi Tadwalkar)
DevOps Approach (Point of View by Ravi Tadwalkar)DevOps Approach (Point of View by Ravi Tadwalkar)
DevOps Approach (Point of View by Ravi Tadwalkar)
 
Demystifying observability
Demystifying observability Demystifying observability
Demystifying observability
 
Developer Experience
Developer ExperienceDeveloper Experience
Developer Experience
 
Everything You Need to Know About the 2019 DORA Accelerate State of DevOps Re...
Everything You Need to Know About the 2019 DORA Accelerate State of DevOps Re...Everything You Need to Know About the 2019 DORA Accelerate State of DevOps Re...
Everything You Need to Know About the 2019 DORA Accelerate State of DevOps Re...
 
DevOps
DevOpsDevOps
DevOps
 
Introduction to DevOps
Introduction to DevOpsIntroduction to DevOps
Introduction to DevOps
 
Taming technical debt
Taming technical debt Taming technical debt
Taming technical debt
 
Introduction to DevSecOps
Introduction to DevSecOpsIntroduction to DevSecOps
Introduction to DevSecOps
 

Destacado

The Tester Role & Scrum
The Tester Role & ScrumThe Tester Role & Scrum
The Tester Role & ScrumJohan Hoberg
 
Scaling Management without Sacrificing Culture - Velocity Europe 2014
Scaling Management without Sacrificing Culture - Velocity Europe 2014Scaling Management without Sacrificing Culture - Velocity Europe 2014
Scaling Management without Sacrificing Culture - Velocity Europe 2014Patrick McDonnell
 
Go or No-Go: Operability and Contingency Planning at Etsy.com
Go or No-Go: Operability and Contingency Planning at Etsy.comGo or No-Go: Operability and Contingency Planning at Etsy.com
Go or No-Go: Operability and Contingency Planning at Etsy.comJohn Allspaw
 
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data HubsWhat Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data HubsCloudera, Inc.
 
Impact Analysis - LoopConf
Impact Analysis - LoopConfImpact Analysis - LoopConf
Impact Analysis - LoopConfChris Lema
 
Removing Barriers to Going Fast
Removing Barriers to Going FastRemoving Barriers to Going Fast
Removing Barriers to Going Fastjgoulah
 
Development is Production Too
Development is Production TooDevelopment is Production Too
Development is Production Toojgoulah
 
Personal mastery (chppd)
Personal mastery (chppd)Personal mastery (chppd)
Personal mastery (chppd)medenison
 
Netflix Billing System
Netflix Billing SystemNetflix Billing System
Netflix Billing SystemNirmalSrini
 
Resilient Response In Complex Systems
Resilient Response In Complex SystemsResilient Response In Complex Systems
Resilient Response In Complex SystemsJohn Allspaw
 
Types of Production and Manufacturing
Types of Production and ManufacturingTypes of Production and Manufacturing
Types of Production and ManufacturingCasey Robertson
 
Intro to social psychology [1]
Intro to social psychology [1]Intro to social psychology [1]
Intro to social psychology [1]elmakrufi
 
SOCIAL PSYCH INTRO (Psych 201 - Chapter 1 - Spring 2014)
SOCIAL PSYCH INTRO (Psych 201 - Chapter 1 - Spring 2014)SOCIAL PSYCH INTRO (Psych 201 - Chapter 1 - Spring 2014)
SOCIAL PSYCH INTRO (Psych 201 - Chapter 1 - Spring 2014)Melanie Tannenbaum
 
METHODS (Psych 201 - Chapter 2 - Spring 2014)
METHODS (Psych 201 - Chapter 2 - Spring 2014)METHODS (Psych 201 - Chapter 2 - Spring 2014)
METHODS (Psych 201 - Chapter 2 - Spring 2014)Melanie Tannenbaum
 
Confirmation bias
Confirmation bias Confirmation bias
Confirmation bias yongseenyee
 
Writing Code That Lasts - Joomla!Dagen 2015
Writing Code That Lasts - Joomla!Dagen 2015Writing Code That Lasts - Joomla!Dagen 2015
Writing Code That Lasts - Joomla!Dagen 2015Rafael Dohms
 
Appraiser : How Airbnb Generates Complex Models in Spark for Demand Prediction
Appraiser : How Airbnb Generates Complex Models in Spark for Demand PredictionAppraiser : How Airbnb Generates Complex Models in Spark for Demand Prediction
Appraiser : How Airbnb Generates Complex Models in Spark for Demand PredictionYang Li Hector Yee
 

Destacado (20)

The Tester Role & Scrum
The Tester Role & ScrumThe Tester Role & Scrum
The Tester Role & Scrum
 
Scaling Management without Sacrificing Culture - Velocity Europe 2014
Scaling Management without Sacrificing Culture - Velocity Europe 2014Scaling Management without Sacrificing Culture - Velocity Europe 2014
Scaling Management without Sacrificing Culture - Velocity Europe 2014
 
Go or No-Go: Operability and Contingency Planning at Etsy.com
Go or No-Go: Operability and Contingency Planning at Etsy.comGo or No-Go: Operability and Contingency Planning at Etsy.com
Go or No-Go: Operability and Contingency Planning at Etsy.com
 
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data HubsWhat Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
 
Impact Analysis - LoopConf
Impact Analysis - LoopConfImpact Analysis - LoopConf
Impact Analysis - LoopConf
 
Removing Barriers to Going Fast
Removing Barriers to Going FastRemoving Barriers to Going Fast
Removing Barriers to Going Fast
 
Development is Production Too
Development is Production TooDevelopment is Production Too
Development is Production Too
 
Personal mastery recap
Personal mastery recapPersonal mastery recap
Personal mastery recap
 
Personal mastery (chppd)
Personal mastery (chppd)Personal mastery (chppd)
Personal mastery (chppd)
 
Personal Mastery
Personal MasteryPersonal Mastery
Personal Mastery
 
Netflix Billing System
Netflix Billing SystemNetflix Billing System
Netflix Billing System
 
Resilient Response In Complex Systems
Resilient Response In Complex SystemsResilient Response In Complex Systems
Resilient Response In Complex Systems
 
Types of Production and Manufacturing
Types of Production and ManufacturingTypes of Production and Manufacturing
Types of Production and Manufacturing
 
Avr lecture8
Avr lecture8Avr lecture8
Avr lecture8
 
Intro to social psychology [1]
Intro to social psychology [1]Intro to social psychology [1]
Intro to social psychology [1]
 
SOCIAL PSYCH INTRO (Psych 201 - Chapter 1 - Spring 2014)
SOCIAL PSYCH INTRO (Psych 201 - Chapter 1 - Spring 2014)SOCIAL PSYCH INTRO (Psych 201 - Chapter 1 - Spring 2014)
SOCIAL PSYCH INTRO (Psych 201 - Chapter 1 - Spring 2014)
 
METHODS (Psych 201 - Chapter 2 - Spring 2014)
METHODS (Psych 201 - Chapter 2 - Spring 2014)METHODS (Psych 201 - Chapter 2 - Spring 2014)
METHODS (Psych 201 - Chapter 2 - Spring 2014)
 
Confirmation bias
Confirmation bias Confirmation bias
Confirmation bias
 
Writing Code That Lasts - Joomla!Dagen 2015
Writing Code That Lasts - Joomla!Dagen 2015Writing Code That Lasts - Joomla!Dagen 2015
Writing Code That Lasts - Joomla!Dagen 2015
 
Appraiser : How Airbnb Generates Complex Models in Spark for Demand Prediction
Appraiser : How Airbnb Generates Complex Models in Spark for Demand PredictionAppraiser : How Airbnb Generates Complex Models in Spark for Demand Prediction
Appraiser : How Airbnb Generates Complex Models in Spark for Demand Prediction
 

Similar a Building a Successful Organization By Mastering Failure

Architecting a Post Mortem - Velocity 2018 San Jose Tutorial
Architecting a Post Mortem - Velocity 2018 San Jose TutorialArchitecting a Post Mortem - Velocity 2018 San Jose Tutorial
Architecting a Post Mortem - Velocity 2018 San Jose TutorialWill Gallego
 
vBrownBag Presentation
vBrownBag PresentationvBrownBag Presentation
vBrownBag PresentationJon Hildebrand
 
Online Participation 101 In Five Minutes (Gasp!)
Online Participation 101 In Five Minutes (Gasp!)Online Participation 101 In Five Minutes (Gasp!)
Online Participation 101 In Five Minutes (Gasp!)Intellitics, Inc.
 
How to Effectively Lead Focus Groups: Presented at ProductTank Toronto
How to Effectively Lead Focus Groups: Presented at ProductTank TorontoHow to Effectively Lead Focus Groups: Presented at ProductTank Toronto
How to Effectively Lead Focus Groups: Presented at ProductTank TorontoTremis Skeete
 
Team building insights from artificial intelligence
Team building insights from artificial intelligenceTeam building insights from artificial intelligence
Team building insights from artificial intelligenceRobert Roan
 
Erkki Poyhonen - Software Testing - A Users Guide
Erkki Poyhonen - Software Testing - A Users GuideErkki Poyhonen - Software Testing - A Users Guide
Erkki Poyhonen - Software Testing - A Users GuideTEST Huddle
 
Usability Evaluation
Usability EvaluationUsability Evaluation
Usability EvaluationSaqib Shehzad
 
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
"Startups, comment gérer une équipe de développeurs" par Laurent CerveauTheFamily
 
How To Drive Data Driven Change In A Legacy Organization
How To Drive Data Driven Change In A Legacy OrganizationHow To Drive Data Driven Change In A Legacy Organization
How To Drive Data Driven Change In A Legacy OrganizationJovi Pinon
 
Systemic Design Toolkit - Systems Innovation Barcelona
Systemic Design Toolkit - Systems Innovation BarcelonaSystemic Design Toolkit - Systems Innovation Barcelona
Systemic Design Toolkit - Systems Innovation BarcelonaPeter Jones
 
Stakeholder engagement
Stakeholder engagement Stakeholder engagement
Stakeholder engagement Rohela Raouf
 
Cognitive walkthroughs - CSUN 2018
Cognitive walkthroughs - CSUN 2018Cognitive walkthroughs - CSUN 2018
Cognitive walkthroughs - CSUN 2018Intopia
 
Using cognitive walkthroughs for a task-oriented accessibility review
Using cognitive walkthroughs for a task-oriented accessibility reviewUsing cognitive walkthroughs for a task-oriented accessibility review
Using cognitive walkthroughs for a task-oriented accessibility reviewIntopia
 
The Best from the UX Summit in Chicago
The Best from the UX Summit in ChicagoThe Best from the UX Summit in Chicago
The Best from the UX Summit in ChicagoLina Angel
 
VMUG UserCon Presentation for 2018
VMUG UserCon Presentation for 2018VMUG UserCon Presentation for 2018
VMUG UserCon Presentation for 2018Jon Hildebrand
 
Bit by Bit: Effective Use of People, Processes and Computer Technology in the...
Bit by Bit: Effective Use of People, Processes and Computer Technology in the...Bit by Bit: Effective Use of People, Processes and Computer Technology in the...
Bit by Bit: Effective Use of People, Processes and Computer Technology in the...Jack Pringle
 
Managing Knowledge and Change
Managing Knowledge and ChangeManaging Knowledge and Change
Managing Knowledge and ChangePeter Bjellerup
 

Similar a Building a Successful Organization By Mastering Failure (20)

Architecting a Post Mortem - Velocity 2018 San Jose Tutorial
Architecting a Post Mortem - Velocity 2018 San Jose TutorialArchitecting a Post Mortem - Velocity 2018 San Jose Tutorial
Architecting a Post Mortem - Velocity 2018 San Jose Tutorial
 
vBrownBag Presentation
vBrownBag PresentationvBrownBag Presentation
vBrownBag Presentation
 
Online Participation 101 In Five Minutes (Gasp!)
Online Participation 101 In Five Minutes (Gasp!)Online Participation 101 In Five Minutes (Gasp!)
Online Participation 101 In Five Minutes (Gasp!)
 
How to Effectively Lead Focus Groups: Presented at ProductTank Toronto
How to Effectively Lead Focus Groups: Presented at ProductTank TorontoHow to Effectively Lead Focus Groups: Presented at ProductTank Toronto
How to Effectively Lead Focus Groups: Presented at ProductTank Toronto
 
Team building insights from artificial intelligence
Team building insights from artificial intelligenceTeam building insights from artificial intelligence
Team building insights from artificial intelligence
 
Modeling and Measuring DevOps Culture
Modeling and Measuring DevOps CultureModeling and Measuring DevOps Culture
Modeling and Measuring DevOps Culture
 
Erkki Poyhonen - Software Testing - A Users Guide
Erkki Poyhonen - Software Testing - A Users GuideErkki Poyhonen - Software Testing - A Users Guide
Erkki Poyhonen - Software Testing - A Users Guide
 
Usability Evaluation
Usability EvaluationUsability Evaluation
Usability Evaluation
 
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
 
How To Drive Data Driven Change In A Legacy Organization
How To Drive Data Driven Change In A Legacy OrganizationHow To Drive Data Driven Change In A Legacy Organization
How To Drive Data Driven Change In A Legacy Organization
 
Binary crosswords
Binary crosswordsBinary crosswords
Binary crosswords
 
Systemic Design Toolkit - Systems Innovation Barcelona
Systemic Design Toolkit - Systems Innovation BarcelonaSystemic Design Toolkit - Systems Innovation Barcelona
Systemic Design Toolkit - Systems Innovation Barcelona
 
Stakeholder engagement
Stakeholder engagement Stakeholder engagement
Stakeholder engagement
 
Cognitive walkthroughs - CSUN 2018
Cognitive walkthroughs - CSUN 2018Cognitive walkthroughs - CSUN 2018
Cognitive walkthroughs - CSUN 2018
 
hci Evaluation Techniques.pptx
 hci Evaluation Techniques.pptx hci Evaluation Techniques.pptx
hci Evaluation Techniques.pptx
 
Using cognitive walkthroughs for a task-oriented accessibility review
Using cognitive walkthroughs for a task-oriented accessibility reviewUsing cognitive walkthroughs for a task-oriented accessibility review
Using cognitive walkthroughs for a task-oriented accessibility review
 
The Best from the UX Summit in Chicago
The Best from the UX Summit in ChicagoThe Best from the UX Summit in Chicago
The Best from the UX Summit in Chicago
 
VMUG UserCon Presentation for 2018
VMUG UserCon Presentation for 2018VMUG UserCon Presentation for 2018
VMUG UserCon Presentation for 2018
 
Bit by Bit: Effective Use of People, Processes and Computer Technology in the...
Bit by Bit: Effective Use of People, Processes and Computer Technology in the...Bit by Bit: Effective Use of People, Processes and Computer Technology in the...
Bit by Bit: Effective Use of People, Processes and Computer Technology in the...
 
Managing Knowledge and Change
Managing Knowledge and ChangeManaging Knowledge and Change
Managing Knowledge and Change
 

Último

Monthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptxMonthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptxAndy Lambert
 
Progress Report - Oracle Database Analyst Summit
Progress  Report - Oracle Database Analyst SummitProgress  Report - Oracle Database Analyst Summit
Progress Report - Oracle Database Analyst SummitHolger Mueller
 
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒anilsa9823
 
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdf
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdfCatalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdf
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdfOrient Homes
 
Ensure the security of your HCL environment by applying the Zero Trust princi...
Ensure the security of your HCL environment by applying the Zero Trust princi...Ensure the security of your HCL environment by applying the Zero Trust princi...
Ensure the security of your HCL environment by applying the Zero Trust princi...Roland Driesen
 
It will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayIt will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayNZSG
 
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLMONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLSeo
 
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Dipal Arora
 
Vip Dewas Call Girls #9907093804 Contact Number Escorts Service Dewas
Vip Dewas Call Girls #9907093804 Contact Number Escorts Service DewasVip Dewas Call Girls #9907093804 Contact Number Escorts Service Dewas
Vip Dewas Call Girls #9907093804 Contact Number Escorts Service Dewasmakika9823
 
GD Birla and his contribution in management
GD Birla and his contribution in managementGD Birla and his contribution in management
GD Birla and his contribution in managementchhavia330
 
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesMysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesDipal Arora
 
Grateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdfGrateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdfPaul Menig
 
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779Delhi Call girls
 
DEPED Work From Home WORKWEEK-PLAN.docx
DEPED Work From Home  WORKWEEK-PLAN.docxDEPED Work From Home  WORKWEEK-PLAN.docx
DEPED Work From Home WORKWEEK-PLAN.docxRodelinaLaud
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageMatteo Carbone
 
The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024christinemoorman
 
Catalogue ONG NUOC PPR DE NHAT .pdf
Catalogue ONG NUOC PPR DE NHAT      .pdfCatalogue ONG NUOC PPR DE NHAT      .pdf
Catalogue ONG NUOC PPR DE NHAT .pdfOrient Homes
 
VIP Call Girl Jamshedpur Aashi 8250192130 Independent Escort Service Jamshedpur
VIP Call Girl Jamshedpur Aashi 8250192130 Independent Escort Service JamshedpurVIP Call Girl Jamshedpur Aashi 8250192130 Independent Escort Service Jamshedpur
VIP Call Girl Jamshedpur Aashi 8250192130 Independent Escort Service JamshedpurSuhani Kapoor
 

Último (20)

Monthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptxMonthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptx
 
Progress Report - Oracle Database Analyst Summit
Progress  Report - Oracle Database Analyst SummitProgress  Report - Oracle Database Analyst Summit
Progress Report - Oracle Database Analyst Summit
 
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒
 
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdf
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdfCatalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdf
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdf
 
Ensure the security of your HCL environment by applying the Zero Trust princi...
Ensure the security of your HCL environment by applying the Zero Trust princi...Ensure the security of your HCL environment by applying the Zero Trust princi...
Ensure the security of your HCL environment by applying the Zero Trust princi...
 
It will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayIt will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 May
 
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLMONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
 
Nepali Escort Girl Kakori \ 9548273370 Indian Call Girls Service Lucknow ₹,9517
Nepali Escort Girl Kakori \ 9548273370 Indian Call Girls Service Lucknow ₹,9517Nepali Escort Girl Kakori \ 9548273370 Indian Call Girls Service Lucknow ₹,9517
Nepali Escort Girl Kakori \ 9548273370 Indian Call Girls Service Lucknow ₹,9517
 
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
 
Vip Dewas Call Girls #9907093804 Contact Number Escorts Service Dewas
Vip Dewas Call Girls #9907093804 Contact Number Escorts Service DewasVip Dewas Call Girls #9907093804 Contact Number Escorts Service Dewas
Vip Dewas Call Girls #9907093804 Contact Number Escorts Service Dewas
 
GD Birla and his contribution in management
GD Birla and his contribution in managementGD Birla and his contribution in management
GD Birla and his contribution in management
 
Best Practices for Implementing an External Recruiting Partnership
Best Practices for Implementing an External Recruiting PartnershipBest Practices for Implementing an External Recruiting Partnership
Best Practices for Implementing an External Recruiting Partnership
 
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesMysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
 
Grateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdfGrateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdf
 
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
 
DEPED Work From Home WORKWEEK-PLAN.docx
DEPED Work From Home  WORKWEEK-PLAN.docxDEPED Work From Home  WORKWEEK-PLAN.docx
DEPED Work From Home WORKWEEK-PLAN.docx
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usage
 
The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024
 
Catalogue ONG NUOC PPR DE NHAT .pdf
Catalogue ONG NUOC PPR DE NHAT      .pdfCatalogue ONG NUOC PPR DE NHAT      .pdf
Catalogue ONG NUOC PPR DE NHAT .pdf
 
VIP Call Girl Jamshedpur Aashi 8250192130 Independent Escort Service Jamshedpur
VIP Call Girl Jamshedpur Aashi 8250192130 Independent Escort Service JamshedpurVIP Call Girl Jamshedpur Aashi 8250192130 Independent Escort Service Jamshedpur
VIP Call Girl Jamshedpur Aashi 8250192130 Independent Escort Service Jamshedpur
 

Building a Successful Organization By Mastering Failure

  • 1. Building A Successful Organization By Mastering Failure John Goulah (@johngoulah) Etsy
  • 2.
  • 3. Marketplace • $1.93B Annual GMS 2014 • 1.4M active sellers • 20M+ active buyers • 30% international GMS • 57%+ mobile visits
  • 4. Infrastructure • over 5500 MySQL databases • 750K graphite metrics/min • 1.3GB logs written/min • 50M - 75M gearman jobs / day • 30-50 deploys / day
  • 5. Company • Headquartered in Brooklyn • Over 700 employees • 7 offices around the world • 80+ dogs / 80+ cats
  • 7.
  • 8. Learning Org a company that facilitates the learning of its members and continuously transforms itself
  • 10. Systems Thinking process of understanding how people, structure, and processes influence one another within a larger system
  • 11. Personal Mastery an individual holds great importance in a learning organization
  • 12. Mental Models the assumptions held by individials and organizations
  • 13. Shared Vision creates a common identity that provides focus and energy for learning
  • 14. Team Learning the problem solving capacity of the organization is improved through better access to knowledge and expertise
  • 15. Learning About Failure • architecture reviews • operability reviews • blameless post mortems
  • 16. failure and success come from the same source
  • 18. can study the system at any time
  • 19. inflection points • architecture reviews • early feedback and discussion • operability reviews • held before launching • blameless post mortems • held after a failure
  • 21. Architecture Reviews understand the costs and benefits of a proposed solution, and discuss alternatives
  • 22. Etsy Tech Axioms • we use a small number of well known tools • all technology decisions come with trade offs • with new technology, many of those trade offs are unknown • we’re growing. things change
  • 23. with new technology many of those tradeoffs are unknown
  • 24. Departures a departure is when new technologies or patterns are introduced that deviate from the current known methods of operating the system and maintaining the software
  • 25. How do I know I need an architecture review? when there is a perceived departure from current technology choices or patterns
  • 26. How early do you hold them? early enough to be able to bail out or make major course corrections
  • 27. Who should come? • the people presenting the change • key stakeholders (sr. engineers, or arch review working group) • everyone else that wants to learn about the proposed changes to the system
  • 29. Preparation • a proposal is written in a shared document and circulated • comments are added, discussed, and potentially resolved in advance • initial questions for the meeting are collected in a tool such as google moderator
  • 30. Some General Questions • Do we understand the costs of this departure? • Have we asked hard questions about trade-offs? • What will this prohibit us from doing in the future?
  • 31. Some General Questions (cont) • Are we impacting visibility, measurability, debuggability and other operability concerns? • Are we impacting testability, security, translatability, performance and other product quality concerns? • Does it makes sense?
  • 32. The Arch Review • proposal is presented to the group • discuss questions and concerns • decide if we are moving forward or need further discussion
  • 33. you're saying my project might not move forward?
  • 34. Why might this end a project? • we learned through this discussion that an alternative is better • we find goals overlap with other projects that are in progress • we discover that it isn't worth the costs now that we have a better idea what they are
  • 35. At the end we should have • detailed notes from the conversation • agreement on tricky components and document them • a compilation of learnings and questions • a decision of whether to keep going with the project, stop and rethink, or gather more information
  • 37. Operability Reviews understand how the system could break, how we will know, and how we will react
  • 38. When do we do operability reviews? • after architecture reviews in the product lifecycle, generally right before launch • when we need to gain increased confidence for launch due to the technology, product, or communication choices being risky • if there's a chance you'd surprise teams that operate the software
  • 39. Who comes to the operability review? representatives from: • Product • Development • Operations • Community/Support • QA
  • 40. Some Questions • Has the feature been tested enough to deploy to production? • Does everyone know when it will go live, and who will push the feature? • Is there communication about the feature ready to go out with the feature? • Is it possible to turn up this feature on a percentage basis, dark launch, or gameday it?
  • 41. Some Questions (cont) • Does the launch involves any new production infrastructure? • If so, are those pieces in monitoring or metrics collection? • If so, is there a deployment pipeline in place? • If so, is there a development environment set up to make it work in dev? • If so, are there tests that can be and are run on CI?
  • 43. Contingency Checklist a list of things that could possibly go "wrong" with a new feature, what we could do about it
  • 44. Issue What could possibly go wrong with the feature launched in production?
  • 45. Likelihood What is the likelihood of each item going wrong?
  • 47. Impact This is just a measure of how impactful this will be if it does actually turn out to be a concern.
  • 48. Engineering What do we do to mitigate the issue with the item (i.e. can we gracefully degrade?)
  • 49. Onsite Messaging What is the messaging to the user in the forums, blog, and social media if this needs graceful degradation?
  • 50. PR Is PR needed for the contingency (i.e. larger scale failure)
  • 52. What is a post mortem? a postmortem is a facilitated meeting during which people involved/interested/close to an accident or incident debriefs together on how we think the event came about
  • 53. What does it cover? • walking through a timeline of events • learning how things are expected to work "normally", adding the context of everyone’s perspective • exploring what we might do to improve things for the future
  • 54. Local Rationality we want to know how it made sense for someone to do what they did at the time
  • 55. searching for second stories instead of human error • asking why is leading to who is responsible • asking how leads to what
  • 56. Avoiding Human Error Human error points directly to individuals in a complex system. But, in complex systems, system behaviour is driven fundamentally by the goals of the system and the system structure. People just provide the flexibility to make it work.
  • 57. Avoiding Human Error (cont) Human error implies deviation from “normal” or "ideal", but in complex situations and tasks there is often no normal ideal that can be precisely and exactly described, many variable interconnected touchpoints influence decisions that are made
  • 58. Recognizing Human Error • be aware of other terms for it: slip, lapse, distraction, mistake, deviation, carelessness, malpractice, recklessness, violation, misjudgement, etc • don’t point to individuals when you really want to understand system itself and the work • how do you feel when something goes wrong? • is it to find who did it / who screwed up, or to find how it happened?
  • 60. Root Cause • it leads to a simplistic and linear explanation of how events transpired • linear mental models of causality don’t capture what is needed to improve the safety of a system • ignores the complexity of an event, which is what should be explored if we are going to learn • leads directly to blaming things on human error
  • 61. Nietzschean anxiety when situations appear both threatening and ambiguous we seem to demand a clear causal agency; because if we cannot establish this agency then the "problem" is potentially irresolvable
  • 62. Hindsight Bias inclination, after an event has occurred, to see the event as having been predictable, despite there having been little or no objective basis for predicting it
  • 63. Counterfactuals the human tendency to create possible alternatives to life events that have already occurred; something that is contrary to what actually happened
  • 66. Meeting Format • Timeline • Discussion • Remediation Items
  • 67. Timeline • a rough timeline scaffolding is required • talk about facts that were known at the time, even if hindsight reveals misunderstandings in what we knew • look out for knowledge that some people were aware of, that others were not, and dig into that • no judgement about actions or knowledge (counterfactuals) • tell people to hold that thought if they jump to remediation items at this point
  • 68. Timeline (cont) • continually ask "What are we missing?" until those involved feel its complete • continually ask "Does everyone agree this is the order in which events took place?" • make sure to include important times for events that happened (alerts, discoveries) • reach a consensus on the timeline and move on to the discussion
  • 69. Discussion • When an action or decision was taken in the timeline, ask the person: "Think back to what you knew at the time, why did that action make sense to you at the time?" • Did we clean up anything after we were stable, how long did it take? • Was there any troubleshooting fatigue?
  • 70. Discussion (cont) • Did we do a good job with communication (site status, support, forums, etc)? • Were all tools on hand and working, ready to use when we needed them during the issue? Where there tools we would have liked to have? • Did we have enough metrics visibility to diagnose the issue? • Was there collaborative and thoughtful communication during the issue?
  • 71. Remediation • Remediation items should have tickets associated with them to follow up on • There can be further post meeting discussion on these but tasks should not linger
  • 72. Remediation questions • What things could we do to prevent this exact thing from happening in the future? • What things could we do to make troubleshooting similar incidents in the future easier?
  • 74. We Can Learn Before and After Failure
  • 75. Before • Architecture reviews for new technology • Operability reviews to gain launch confidence
  • 76. After • Postmortems are done soon after a failure • avoid human error, counterfactuals, hindsight bias, and root cause