The Etsy organization has grown by a significant amount over the last five years. As a company grows, more thought must be put into the techniques that it uses to communicate and deal with failures. This talk will cover several techniques that have helped foster a Just Culture, one in which an effort is made to balance both safety and accountability
19. inflection points
• architecture reviews
• early feedback and discussion
• operability reviews
• held before launching
• blameless post mortems
• held after a failure
22. Etsy Tech Axioms
• we use a small number of well known tools
• all technology decisions come with trade offs
• with new technology, many of those trade offs are
unknown
• we’re growing. things change
24. Departures
a departure is when new technologies or patterns are
introduced that deviate from the current known methods of
operating the system and maintaining the software
25. How do I know I need an
architecture review?
when there is a perceived departure from current technology
choices or patterns
26. How early do you hold them?
early enough to be able to bail out or make major course
corrections
27. Who should come?
• the people presenting the change
• key stakeholders (sr. engineers, or arch review working
group)
• everyone else that wants to learn about the proposed
changes to the system
29. Preparation
• a proposal is written in a shared document and circulated
• comments are added, discussed, and potentially resolved in
advance
• initial questions for the meeting are collected in a tool such
as google moderator
30. Some General Questions
• Do we understand the costs of this departure?
• Have we asked hard questions about trade-offs?
• What will this prohibit us from doing in the future?
31. Some General Questions (cont)
• Are we impacting visibility, measurability, debuggability and
other operability concerns?
• Are we impacting testability, security, translatability,
performance and other product quality concerns?
• Does it makes sense?
32. The Arch Review
• proposal is presented to the group
• discuss questions and concerns
• decide if we are moving forward or need further discussion
34. Why might this end a project?
• we learned through this discussion that an alternative is
better
• we find goals overlap with other projects that are in
progress
• we discover that it isn't worth the costs now that we have a
better idea what they are
35. At the end we should have
• detailed notes from the conversation
• agreement on tricky components and document them
• a compilation of learnings and questions
• a decision of whether to keep going with the project, stop
and rethink, or gather more information
38. When do we do operability
reviews?
• after architecture reviews in the product lifecycle, generally
right before launch
• when we need to gain increased confidence for launch due
to the technology, product, or communication choices
being risky
• if there's a chance you'd surprise teams that operate the
software
39. Who comes to the operability
review?
representatives from:
• Product
• Development
• Operations
• Community/Support
• QA
40. Some Questions
• Has the feature been tested enough to deploy to
production?
• Does everyone know when it will go live, and who will push
the feature?
• Is there communication about the feature ready to go out
with the feature?
• Is it possible to turn up this feature on a percentage basis,
dark launch, or gameday it?
41. Some Questions (cont)
• Does the launch involves any new production infrastructure?
• If so, are those pieces in monitoring or metrics collection?
• If so, is there a deployment pipeline in place?
• If so, is there a development environment set up to make
it work in dev?
• If so, are there tests that can be and are run on CI?
52. What is a post mortem?
a postmortem is a facilitated meeting during which people
involved/interested/close to an accident or incident debriefs
together on how we think the event came about
53. What does it cover?
• walking through a timeline of events
• learning how things are expected to work "normally",
adding the context of everyone’s perspective
• exploring what we might do to improve things for the future
55. searching for second stories
instead of human error
• asking why is leading to who is responsible
• asking how leads to what
56. Avoiding Human Error
Human error points directly to individuals in a complex
system. But, in complex systems, system behaviour is driven
fundamentally by the goals of the system and the system
structure. People just provide the flexibility to make it work.
57. Avoiding Human Error (cont)
Human error implies deviation from “normal” or "ideal", but in
complex situations and tasks there is often no normal ideal that
can be precisely and exactly described, many variable
interconnected touchpoints influence decisions that are made
58. Recognizing Human Error
• be aware of other terms for it: slip, lapse, distraction,
mistake, deviation, carelessness, malpractice, recklessness,
violation, misjudgement, etc
• don’t point to individuals when you really want to
understand system itself and the work
• how do you feel when something goes wrong?
• is it to find who did it / who screwed up, or to find how it
happened?
60. Root Cause
• it leads to a simplistic and linear explanation of how events
transpired
• linear mental models of causality don’t capture what is
needed to improve the safety of a system
• ignores the complexity of an event, which is what should be
explored if we are going to learn
• leads directly to blaming things on human error
61. Nietzschean anxiety
when situations appear both threatening and ambiguous we
seem to demand a clear causal agency; because if we cannot
establish this agency then the "problem" is potentially
irresolvable
62. Hindsight Bias
inclination, after an event has occurred, to see the event as
having been predictable, despite there having been little or no
objective basis for predicting it
63. Counterfactuals
the human tendency to create possible alternatives to life
events that have already occurred; something that is contrary
to what actually happened
67. Timeline
• a rough timeline scaffolding is required
• talk about facts that were known at the time, even if
hindsight reveals misunderstandings in what we knew
• look out for knowledge that some people were aware of,
that others were not, and dig into that
• no judgement about actions or knowledge (counterfactuals)
• tell people to hold that thought if they jump to remediation
items at this point
68. Timeline (cont)
• continually ask "What are we missing?" until those involved
feel its complete
• continually ask "Does everyone agree this is the order in
which events took place?"
• make sure to include important times for events that
happened (alerts, discoveries)
• reach a consensus on the timeline and move on to the
discussion
69. Discussion
• When an action or decision was taken in the timeline, ask
the person: "Think back to what you knew at the time, why
did that action make sense to you at the time?"
• Did we clean up anything after we were stable, how long
did it take?
• Was there any troubleshooting fatigue?
70. Discussion (cont)
• Did we do a good job with communication (site status,
support, forums, etc)?
• Were all tools on hand and working, ready to use when we
needed them during the issue? Where there tools we would
have liked to have?
• Did we have enough metrics visibility to diagnose the issue?
• Was there collaborative and thoughtful communication
during the issue?
71. Remediation
• Remediation items should have tickets associated with them
to follow up on
• There can be further post meeting discussion on these but
tasks should not linger
72. Remediation questions
• What things could we do to prevent this exact thing from
happening in the future?
• What things could we do to make troubleshooting similar
incidents in the future easier?