SlideShare una empresa de Scribd logo
1 de 49
Amplifying Sources of
Resilience
What the Research Says
John Allspaw (@allspaw)
Adaptive Capacity Labs (@adaptiveclabs)
InfoQ.com: News & Community Site
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
resilience-thinking-paradigm/
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon London
www.qconlondon.com
me
2009 Velocity Conf
Consortium for Resilient
Internet-Facing Business IT
Adaptive
Capacity
Labs
Resilience Engineering
Is a Field of Study
• Emerged from Cognitive Systems Engineering
• Early 2000s, largely in response to NASA events in 1999 and 2000
• 7 symposia over 12 years
Resilience Engineering
is a Community
is largely made up of practitioners and researchers from….
Cybernetics
Engineering*
Ecology
Safety Science
Biology
Control Systems
Human Factors & Ergonomics
Cognitive Systems Engineering
Complexity Science
Cognitive Psychology
Sociology
Operations Research
working in domains such as…
Rail Maritime
Surgery
Intelligence AgenciesLaw Enforcement
Aviation/ATM
Space
Mining
Construction
Explosives
FirefightingAnesthesia
Pediatrics
Power Grid & Distribution
Military Agencies
Software Engineering
Resilience Engineering
is a Community
Some of the cast of characters
David Woods
CSEL/OSU
Shawna Perry
Univ of Florida
Emergency Medicine
Dr. Richard Cook
Anesthesiologist
Researcher
Ivonne Andrade Herrera
SINTEF
Erik Hollnagel
Univ of S. Denmark
Anne-Sophie Nyssen
University de Liege
Johan Bergström
Lund University
Sidney Dekker
Griffith University
Asher Balkin
CSEL/OSU
Laura Maguire
CSEL/OSU
Some of the cast of characters
David Woods
CSEL/OSU
Shawna Perry
Univ of Florida
Emergency Medicine
Dr. Richard Cook
Anesthesiologist
Researcher
Ivonne Andrade Herrera
SINTEF
Erik Hollnagel
Univ of S. Denmark
Anne-Sophie Nyssen
University de Liege Johan Bergström
Lund University
Sidney Dekker
Griffith University
Asher Balkin
CSEL/OSU
Laura Maguire
CSEL/OSU
J. Paul ReedJessica DeVitaCasey RosenthalNora Jones (me)
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
internally sourced code
results
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
internally sourced code
results
macro
descriptions
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
internally sourced code
results
code repositories
macro
descriptions
testing/validation
suites
code
code stuff
meta
rules
scripts,
rules, etc.
test cases
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
internally sourced code
results
code repositories
macro
descriptions
testing/validation
suites
code
code stuff
meta
rules
scripts,
rules, etc.
test cases
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
internally sourced code
results
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
systemsystem framing
doing
code repositories
macro
descriptions
testing/validation
suites
code
code stuff
meta
rules
scripts,
rules, etc.
test cases
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
internally sourced code
results
deploy organization/
“monitoring”
Adding stuff
to the running
system
Getting stuff
ready to be part
of the running
system
architectural
& structural
framing
keeping track
of what “the
system” is
doing
code deploy
organization/
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
Adding stuff
to the running
system
Getting stuff
ready to be part
of the running
system
architectural
& structural
framing
keeping track
of what “the
system” is
doing
code repositories
macro
descriptions
testing/validation
suites
code
code stuff
meta
rules
scripts,
rules, etc.
test cases
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
internally sourced code
results
The Work Is Done
Here
Your Product Or
Service
The Stuff You Build and
Maintain With
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
Adding stuff
to the running
system
Getting stuff
ready to be part
of the running
system
architectural
& structural
framing
keeping track
of what “the
system” is
doing
code repositories
macro
descriptions
testing/validation
suites
code
code stuff
meta
rules
scripts,
rules, etc.
test cases
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
internally sourced code
results
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
Adding stuff
to the running
system
Getting stuff
ready to be part
of the running
system
architectural
& structural
framing
keeping track
of what “the
system” is
doing
code repositories
macro
descriptions
testing/validation
suites
code
code stuff
meta
rules
scripts,
rules, etc.
test cases
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
internally sourced code
results
Copyright © 2016 by R.I. Cook for ACL, all rights reserved
ack: Michael Angeles http://konigi.com/tools/
What matters. Why what matters matters.
code repositories
macro
descriptions
testing/validation
suites
code
code stuff
meta
rules
scripts,
rules, etc.
test cases
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
above
the line
below
the line
Why is it doing that?
What needs to change?
What does it mean?
How should this work?
What’s it doing?
What does it mean?
What is happening?
What should be happening
What does it mean?
Adding stuff
to the running
system
Getting stuff
ready to be part
of the running
system
architectural
& structural
framing
keeping track
of what “the
system” is
doing
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
internally sourced code
results
goals
purposes
risks
cognition
actions
interactions
speech
gestures
clicks
signals
representations
artifacts
the line of
representation
individuals have
unique models
of the “system”
Copyright © 2016 by R.I. Cook for ACL, all rights reserved
ack: Michael Angeles http://konigi.com/tools/
What matters. Why what matters matters.
code repositories
macro
descriptions
testing/validation
suites
code
code stuff
meta
rules
scripts,
rules, etc.
test cases
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
above
the line
below
the line
Why is it doing that?
What needs to change?
What does it mean?
How should this work?
What’s it doing?
What does it mean?
What is happening?
What should be happening
What does it mean?
Adding stuff
to the running
system
Getting stuff
ready to be part
of the running
system
architectural
& structural
framing
keeping track
of what “the
system” is
doing
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
internally sourced code
results
goals
purposes
risks
cognition
actions
interactions
speech
gestures
clicks
signals
representations
artifacts
the line of
representation
individuals have
unique models
of the “system”
observing
inferring
anticipating
planning
troubleshooting
diagnosing
correcting
modifying
reacting
Copyright © 2016 by R.I. Cook for ACL, all rights reserved
ack: Michael Angeles http://konigi.com/tools/
What matters. Why what matters matters.
code repositories
macro
descriptions
testing/validation
suites
code
code stuff
meta
rules
scripts,
rules, etc.
test cases
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
above
the line
below
the line
Why is it doing that?
What needs to change?
What does it mean?
How should this work?
What’s it doing?
What does it mean?
What is happening?
What should be happening
What does it mean?
Adding stuff
to the running
system
Getting stuff
ready to be part
of the running
system
architectural
& structural
framing
keeping track
of what “the
system” is
doing
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
internally sourced code
results
goals
purposes
risks
cognition
actions
interactions
speech
gestures
clicks
signals
representations
artifacts
the line of
representation
individuals have
unique models
of the “system”
Copyright © 2016 by R.I. Cook for ACL, all rights reserved
ack: Michael Angeles http://konigi.com/tools/
What matters. Why what matters matters.
code repositories
macro
descriptions
testing/validation
suites
code
code stuff
meta
rules
scripts,
rules, etc.
test cases
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
above
the line
below
the line
Why is it doing that?
What needs to change?
What does it mean?
How should this work?
What’s it doing?
What does it mean?
What is happening?
What should be happening
What does it mean?
Adding stuff
to the running
system
Getting stuff
ready to be part
of the running
system
architectural
& structural
framing
keeping track
of what “the
system” is
doing
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
internally sourced code
results
goals
purposes
risks
cognition
actions
interactions
speech
gestures
clicks
signals
representations
artifacts
the line of
representation
individuals have
unique models
of the “system”
observing
inferring
anticipating
planning
troubleshooting
diagnosing
correcting
modifying
reacting
What matters. Why what matters matters.
code deploy
organization/
encapsulation “monitoring”
Why is it doing that?
hat needs to change?
What does it mean?
How should this work?
What’s it doing?
What does it mean?
What is happening?
What should be happening
What does it mean?
Adding stuff
to the running
system
Getting stuff
ready to be part
of the running
system
architectural
& structural
framing
keeping track
of what “the
system” is
doing
go
purp
ris
cogn
act
intera
spe
ges
cli
sig
represe
What matters. Why what matters matters.
code deploy
organization/
encapsulation “monitoring”
Why is it doing that?
hat needs to change?
What does it mean?
How should this work?
What’s it doing?
What does it mean?
What is happening?
What should be happening
What does it mean?
Adding stuff
to the running
system
Getting stuff
ready to be part
of the running
system
architectural
& structural
framing
keeping track
of what “the
system” is
doing
go
purp
ris
cogn
act
intera
spe
ges
cli
sig
represe
observing
inferring
anticipating
planning
troubleshooting
diagnosing
correcting
modifying
reacting
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
Adding stuff
to the running
system
Getting stuff
ready to be part
of the running
system
architectural
& structural
framing
keeping track
of what “the
system” is
doing
code repositories
macro
descriptions
testing/validation
suites
code
code stuff
meta
rules
scripts,
rules, etc.
test cases
code
generating
tools
testing
tools
deploy
tools
organization/
encapsulation
tools
“monitoring”
tools
externally sourced
code (e.g. DB)
results
the using
world
delivery
technology
stack
internally sourced code
results
Time
…and things are
changing here
things are
changing
here…
Amplifying Sources of
Resilience
What the Research Says
John Allspaw
Adaptive Capacity Labs
clarifications about
what this actually is
can’t amplify
something if you
can’t find it first
what we find when we look closely
at incidents in software
logs
time of year
day of year
time of day
observations and
hypotheses others
share
what has been
investigated thus far
what’s been happening
in the world
(news, service provider
outages, etc.)
time-series
data
alerts
tracing/
observability tools
recent
changes in
existing tech
new
dependencies
who is on
vacation, at a
conference,
traveling, etc.
status of other
ongoing work
- tenure
- domain expertise
- past experience with details
multiple perspectives on
• what “it” is that is happening
• what can — and what cannot — be done
to “stem the bleeding” or “reduce the
blast radius”
• who has authority to take certain
actions
• what shouldn’t be tried to mitigate or
repair
- problem detection
- generating hypotheses
- diagnostic actions
- therapeutic actions
- sacrifice decisions
- coordinating
- (re) planning
- preparing for potential escalation/cascades
multiple threads of activity
some productive
some unproductive
time pressure
high consequences
this is not
“debugging”
“troubleshooting”
therefore…“resilience”?
software development may incur future
liability in order to achieve short-term goals
1992
Ward Cunningham THANK YOU, WARD!
resilience is not:
• preventative design
• fault-tolerance
• redundancy
• Chaos Engineering
• stuff about software or hardware
• a property that a system has
unforeseen
unanticipated
unexpected
fundamentally surprising
–Scott Sagan “The Limits of Safety”
“things that have never happened before
happen all the time”
resilience is:
• proactive activities aimed at preparing to be unprepared
— without an ability to justify it economically!
• sustaining the potential for future adaptive action when
conditions change
• something that a system does, not what it has
sustained
adaptive capacity
sustained
adaptive capacity
Poised To Adapt
1. Knowing what the platform is supposed to do
2. Knowing how the platform works
3. What the platform’s behavior means
4. Being able to devise a change that addresses 1, 2, & 3
5. Being able to predict the effects of that change
6. Being able to force the platform to change in that way
7. Being prepared to deal with the consequences
Dr. Richard Cook, Velocity Conf 2016 Santa Clara, CA
Finding sources of resilience means finding and
understanding cognitive work.
observing
inferring
anticipating
planning
troubleshooting
diagnosing
correcting
modifying
reacting
all incidents can be worse
what are things (people, maneuvers, knowledge, etc.) that went into
preventing it from being worse?
How can I find this “adaptive capacity”?
Find incidents that have:
• high degree of surprise
• whose consequences were not severe
• and look closely at the details about what went into
making it not nearly as bad as it could have been
• protect and acknowledge explicitly the sources you find
indications of surprise and novelty
<murphy> wtf happened here
<steve> I have no idea what is going on
<laurie> well that's terrifying
indications about contrasting mental models
<laurie> so you want to rebuild {server01} first?
<laurie> neither box has been touched yet
<laurie> and im a tad nervous to do both at once
<lisa> wait wait, i thought the X table was small
<jeremy> I'm still a bit confused why B and A are different
if A got to 0 and B is still at 3099
<tim>: oh I see.. the retry interval is pretty aggressive
why not look at incidents with
severe consequences?
• scrutiny from stakeholders with face-saving agenda tend to block deep
inquiry
• with “medium-severe” incidents the cost of getting details/descriptions
of people’s perspectives is low relative to the potential gain
• “Goldilocks” incidents are the ideal
some (contextual) sources
• esoteric knowledge and expertise in the organization

• flexible and dynamic staffing for novel situations

• authority that is expected to migrate across roles

• a “constant sense of unease” that drives explorations of “normal” work

• capture and dissemination of near-misses
Summary!
• Resilience is something a system does, not what a system has.
• Creating and sustaining adaptive capacity in an organization
while being unable to justify doing it specifically = resilient action.
• How people (the flexible elements of The System™) cope with
surprise is the path to finding sources of resilience
Resilience is the story of the
outage that didn’t happen.
Thank You!
@allspaw
Resilience Is A Verb (Woods, 2018)
http://bit.ly/ResilienceIsAVerb
Stella Report
http://stella.report
https://www.adaptivecapacitylabs.com/blog
@AdaptiveCLabs
SRE Cognitive Work
(chapter in Seeking SRE, O’Reilly Media)
http://bit.ly/SRECognitiveWork
How Complex Systems Fail (Cook, 1998)
http://bit.ly/ComplexSystemsFailure
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
resilience-thinking-paradigm/

Más contenido relacionado

Más de C4Media

Más de C4Media (20)

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with Brooklin
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Amplifying Sources of Resilience: What Research Says

  • 1. Amplifying Sources of Resilience What the Research Says John Allspaw (@allspaw) Adaptive Capacity Labs (@adaptiveclabs)
  • 2. InfoQ.com: News & Community Site Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ resilience-thinking-paradigm/ • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon London www.qconlondon.com
  • 4. me 2009 Velocity Conf Consortium for Resilient Internet-Facing Business IT Adaptive Capacity Labs
  • 5. Resilience Engineering Is a Field of Study • Emerged from Cognitive Systems Engineering • Early 2000s, largely in response to NASA events in 1999 and 2000 • 7 symposia over 12 years
  • 6. Resilience Engineering is a Community is largely made up of practitioners and researchers from…. Cybernetics Engineering* Ecology Safety Science Biology Control Systems Human Factors & Ergonomics Cognitive Systems Engineering Complexity Science Cognitive Psychology Sociology Operations Research
  • 7. working in domains such as… Rail Maritime Surgery Intelligence AgenciesLaw Enforcement Aviation/ATM Space Mining Construction Explosives FirefightingAnesthesia Pediatrics Power Grid & Distribution Military Agencies Software Engineering Resilience Engineering is a Community
  • 8. Some of the cast of characters David Woods CSEL/OSU Shawna Perry Univ of Florida Emergency Medicine Dr. Richard Cook Anesthesiologist Researcher Ivonne Andrade Herrera SINTEF Erik Hollnagel Univ of S. Denmark Anne-Sophie Nyssen University de Liege Johan Bergström Lund University Sidney Dekker Griffith University Asher Balkin CSEL/OSU Laura Maguire CSEL/OSU
  • 9. Some of the cast of characters David Woods CSEL/OSU Shawna Perry Univ of Florida Emergency Medicine Dr. Richard Cook Anesthesiologist Researcher Ivonne Andrade Herrera SINTEF Erik Hollnagel Univ of S. Denmark Anne-Sophie Nyssen University de Liege Johan Bergström Lund University Sidney Dekker Griffith University Asher Balkin CSEL/OSU Laura Maguire CSEL/OSU J. Paul ReedJessica DeVitaCasey RosenthalNora Jones (me)
  • 10.
  • 11. externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results
  • 12. externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results macro descriptions externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results
  • 13. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results
  • 14. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools systemsystem framing doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results
  • 15. deploy organization/ “monitoring” Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code deploy organization/
  • 16. code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results The Work Is Done Here Your Product Or Service The Stuff You Build and Maintain With
  • 17. code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results
  • 18. code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting
  • 19. Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting
  • 20. What matters. Why what matters matters. code deploy organization/ encapsulation “monitoring” Why is it doing that? hat needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing go purp ris cogn act intera spe ges cli sig represe What matters. Why what matters matters. code deploy organization/ encapsulation “monitoring” Why is it doing that? hat needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing go purp ris cogn act intera spe ges cli sig represe observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting
  • 21. code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results Time …and things are changing here things are changing here…
  • 22. Amplifying Sources of Resilience What the Research Says John Allspaw Adaptive Capacity Labs clarifications about what this actually is can’t amplify something if you can’t find it first
  • 23. what we find when we look closely at incidents in software
  • 24. logs time of year day of year time of day observations and hypotheses others share what has been investigated thus far what’s been happening in the world (news, service provider outages, etc.) time-series data alerts tracing/ observability tools recent changes in existing tech new dependencies who is on vacation, at a conference, traveling, etc. status of other ongoing work
  • 25. - tenure - domain expertise - past experience with details
  • 26. multiple perspectives on • what “it” is that is happening • what can — and what cannot — be done to “stem the bleeding” or “reduce the blast radius” • who has authority to take certain actions • what shouldn’t be tried to mitigate or repair
  • 27. - problem detection - generating hypotheses - diagnostic actions - therapeutic actions - sacrifice decisions - coordinating - (re) planning - preparing for potential escalation/cascades multiple threads of activity some productive some unproductive
  • 31. software development may incur future liability in order to achieve short-term goals 1992 Ward Cunningham THANK YOU, WARD!
  • 32. resilience is not: • preventative design • fault-tolerance • redundancy • Chaos Engineering • stuff about software or hardware • a property that a system has
  • 34. –Scott Sagan “The Limits of Safety” “things that have never happened before happen all the time”
  • 35. resilience is: • proactive activities aimed at preparing to be unprepared — without an ability to justify it economically! • sustaining the potential for future adaptive action when conditions change • something that a system does, not what it has
  • 38. Poised To Adapt 1. Knowing what the platform is supposed to do 2. Knowing how the platform works 3. What the platform’s behavior means 4. Being able to devise a change that addresses 1, 2, & 3 5. Being able to predict the effects of that change 6. Being able to force the platform to change in that way 7. Being prepared to deal with the consequences Dr. Richard Cook, Velocity Conf 2016 Santa Clara, CA
  • 39. Finding sources of resilience means finding and understanding cognitive work. observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting
  • 40. all incidents can be worse what are things (people, maneuvers, knowledge, etc.) that went into preventing it from being worse?
  • 41. How can I find this “adaptive capacity”? Find incidents that have: • high degree of surprise • whose consequences were not severe • and look closely at the details about what went into making it not nearly as bad as it could have been • protect and acknowledge explicitly the sources you find
  • 42. indications of surprise and novelty <murphy> wtf happened here <steve> I have no idea what is going on <laurie> well that's terrifying
  • 43. indications about contrasting mental models <laurie> so you want to rebuild {server01} first? <laurie> neither box has been touched yet <laurie> and im a tad nervous to do both at once <lisa> wait wait, i thought the X table was small <jeremy> I'm still a bit confused why B and A are different if A got to 0 and B is still at 3099 <tim>: oh I see.. the retry interval is pretty aggressive
  • 44. why not look at incidents with severe consequences? • scrutiny from stakeholders with face-saving agenda tend to block deep inquiry • with “medium-severe” incidents the cost of getting details/descriptions of people’s perspectives is low relative to the potential gain • “Goldilocks” incidents are the ideal
  • 45. some (contextual) sources • esoteric knowledge and expertise in the organization • flexible and dynamic staffing for novel situations • authority that is expected to migrate across roles • a “constant sense of unease” that drives explorations of “normal” work • capture and dissemination of near-misses
  • 46. Summary! • Resilience is something a system does, not what a system has. • Creating and sustaining adaptive capacity in an organization while being unable to justify doing it specifically = resilient action. • How people (the flexible elements of The System™) cope with surprise is the path to finding sources of resilience
  • 47. Resilience is the story of the outage that didn’t happen.
  • 48. Thank You! @allspaw Resilience Is A Verb (Woods, 2018) http://bit.ly/ResilienceIsAVerb Stella Report http://stella.report https://www.adaptivecapacitylabs.com/blog @AdaptiveCLabs SRE Cognitive Work (chapter in Seeking SRE, O’Reilly Media) http://bit.ly/SRECognitiveWork How Complex Systems Fail (Cook, 1998) http://bit.ly/ComplexSystemsFailure
  • 49. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ resilience-thinking-paradigm/