SlideShare una empresa de Scribd logo
1 de 95
Descargar para leer sin conexión
Failure Happens:
Improving Incident Response In Enterprises
Damon Edwards
@damonedwards
Damon
Edwards
Ops Improvement
DevOps Consulting
Ops Tools
Community
1. This is not a talk about monitoring
2. This talk is about the enterprise
Please note….
Let’s look at an (unfortunately) typical incident…
[Get PDF here: https://rundeck.co/incident_lisa2017]
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Let’s condense that…
Take a DevOps approach to improvement…
1. Use Lean thinking and analyze like any process
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
What got in the way? (Wastes)
PD - Partially Done
TS - Task Switching
W - Waiting
M - Motion / Manual
D - Defects
EP - Extra Process
EF - Extra Features
H - Heroics
Structured analysis building on DevOps
adaptation of “7 deadly wastes” from Lean / Agile:
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
What got in the way? (Wastes)
PD - Partially Done
TS - Task Switching
W - Waiting
M - Motion / Manual
D - Defects
EP - Extra Process
EF - Extra Features
H - Heroics
Structured analysis building on DevOps
adaptation of “7 deadly wastes” from Lean / Agile:
M
M
M
M
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
What got in the way? (Wastes)
PD - Partially Done
TS - Task Switching
W - Waiting
M - Motion / Manual
D - Defects
EP - Extra Process
EF - Extra Features
H - Heroics
Structured analysis building on DevOps
adaptation of “7 deadly wastes” from Lean / Agile:
TS
TS
TS
TS
TS
TS TSM
M
M
M
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
What got in the way? (Wastes)
PD - Partially Done
TS - Task Switching
W - Waiting
M - Motion / Manual
D - Defects
EP - Extra Process
EF - Extra Features
H - Heroics
Structured analysis building on DevOps
adaptation of “7 deadly wastes” from Lean / Agile:
TS
TS
TS
TS
TS
TS TS
W
W
W W
W
M
M
M
M
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
What got in the way? (Wastes)
PD - Partially Done
TS - Task Switching
W - Waiting
M - Motion / Manual
D - Defects
EP - Extra Process
EF - Extra Features
H - Heroics
Structured analysis building on DevOps
adaptation of “7 deadly wastes” from Lean / Agile:
TS
TS
TS
TS
TS
TS TS
W
W
W W
W
EP
EP
M
M
M
M
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
What got in the way? (Wastes)
PD - Partially Done
TS - Task Switching
W - Waiting
M - Motion / Manual
D - Defects
EP - Extra Process
EF - Extra Features
H - Heroics
Structured analysis building on DevOps
adaptation of “7 deadly wastes” from Lean / Agile:
TS
TS
TS
TS
TS
TS TS
W
W
W W
W
EP
EP
H HH
M
M
M
M
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
What got in the way? (Wastes)
PD - Partially Done
TS - Task Switching
W - Waiting
M - Motion / Manual
D - Defects
EP - Extra Process
EF - Extra Features
H - Heroics
Structured analysis building on DevOps
adaptation of “7 deadly wastes” from Lean / Agile:
TS
TS
TS
TS
TS
TS TS
PD
PD
PD
W
W
W W
W
EP
EP
H HH
M
M
M
M
Tip: Try techniques like Value Stream Mapping
1. Use Lean thinking and analyze like any process
2. Make sure you are measuring the right things
MTTD?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
MTTD?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Detect?
MTTD?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Detect? Diagnose!
MTTD?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Detect? Diagnose!
Diagnose > Detect
MTTR?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
MTTR?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Restore?
MTTR?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Restore?
Repair!
DEV
MTTR?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Restore?
Restore is good. Repair (fix) is ultimate measure.
Repair!
DEV
Cost?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
HC: 2 HC: 9 HC: 2 HC: 3 HC: 4 HC: 1.5 HC: 1.5 HC: 1.5
Cost?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
HC: 2 HC: 9 HC: 2 HC: 3 HC: 4 HC: 1.5 HC: 1.5 HC: 1.5
Context:
2
Context:
10
Context:
6
Context:
7
Context:
10
Context:
12
Context:
13
Context:
13
Cost?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Don’t underestimate the expense of context switching!
HC: 2 HC: 9 HC: 2 HC: 3 HC: 4 HC: 1.5 HC: 1.5 HC: 1.5
Context:
2
Context:
10
Context:
6
Context:
7
Context:
10
Context:
12
Context:
13
Context:
13
1. Use Lean thinking and analyze like any process
2. Make sure you are measuring the right things
3. “Shift Left” control and decision making
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Shift-Left
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Solve here or here
Shift-Left
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Solve here or here NOT HERE
Shift-Left
Empower those closest to the issue
escalate escalate
1° 2° 3°
escalate
4°
Empower those closest to the issue
escalate escalate
1° 2° 3°
escalate
4°
Push the ability to take action this direction
Empower those closest to the issue
escalate escalate
1° 2° 3°
escalate
4°
Push the ability to take action this direction
But what gets
in the way?
Empower those closest to the issue
escalate escalate
1° 2° 3°
escalate
4°
Push the ability to take action this direction
SilosBut what gets
in the way?
1. Use Lean thinking and analyze like any process
2. Make sure you are measuring the right things
3. “Shift Left” control and decision making
4. Get rid of silos
Silos ruin everything
Backlog Context
I need X
Backlog
I do X
Requests
for X
Silo A
Priorities
Context
Priorities
Silo B
Tools Tools
Silos ruin everything
Backlog Context
I need X
Backlog
I do X
Requests
for X
Silo A
Priorities
Context
Priorities
Silo B
Tools Tools
Ticket-Driven Request Queues Are Best Indicator of Silos
Team A
(Dev)
Team B
(Ops)
Ticket
System
??
Team A
(Dev)
Team B
(Ops)
Ticket
System
??
Silo Builder
Ticket-Driven Request Queues Are Best Indicator of Silos
Team A
(Dev)
Team B
(Ops)
Ticket
System
??
Silo Builder Snowflake Maker
Ticket-Driven Request Queues Are Best Indicator of Silos
Popular: Replace Silos with Cross Functional Team
Dev/Test Release OperatePlanning
Popular: Replace Silos with Cross Functional Team
Dev/Test Release OperatePlanning
Cross-Functional Teams
Cross-Functional Teams
Cross-Functional Teams
Dev/Test Release OperatePlanning
Popular: Replace Silos with Cross Functional Team
Cross-Functional Teams
Cross-Functional Teams
Cross-Functional Teams
EnvironmentsDBAs Network Security NOC
Dev/Test Release OperatePlanning
Popular: Replace Silos with Cross Functional Team
Cross-Functional Teams
Cross-Functional Teams
Cross-Functional Teams
EnvironmentsDBAs Network Security NOC
Dev/Test Release OperatePlanning
Team A
(Dev)
Team B
(Ops)
Ticket
System
??
Popular: Replace Silos with Cross Functional Team
Cross-Functional Teams
Cross-Functional Teams
Cross-Functional Teams
1. Use Lean thinking and analyze like any process
2. Make sure you are measuring the right things
3. “Shift Left” control and decision making
4. Get rid of silos
5. Establish Operations as a Service
Team A
(Dev)
Team B
(Ops)
Ticket
System
??
Silo Builder
Get rid of as many ticket-driven request queues as possible
Team A
(Dev)
Team B
(Ops)
Ticket
System
??
Silo Builder Snowflake Maker
Get rid of as many ticket-driven request queues as possible
… replace with Operations as a Service design pattern
Team A
(Dev)
Team B
(Ops)Ticket
System
Operations
as a
Service
Execute
On Demand
Define
Procedures
Vet
Procedures
Define
Policies
Actual Exceptions
Execute
On Demand
Changes how your organization thinks
about automated procedures…
Automated procedures are comprised of three parts
Definition of the automated procedure
Execution of the automated procedure
Governance of the automated procedure
Define
Execute
Govern
Automated procedures are comprised of three parts
Definition of the automated procedure
Execution of the automated procedure
Governance of the automated procedure
Define
Execute
Govern
(security, oversight, compliance, etc.)
Traditional Ops Silo
Define
Execute
Govern
“Consumers of Ops”
(Dev, QA, Release, NOC, Security, etc.)
Ops
Rigid Self-Service
Define
Execute
Govern
“Consumers of Ops”
(Dev, QA, Release, NOC, Security, etc.)
Ops
Define
Execute
Govern
Execute
“Consumers of Ops”
(Dev, QA, Release, NOC, Security, etc.)
Ops
Rigid Self-Service (limited)
High-Velocity Handoffs
Define
Govern
Execute
“Consumers of Ops”
(Dev, QA, Release, NOC, Security, etc.)
Ops
Self-Service Operations
Define
Govern
Execute
“Consumers of Ops”
(Dev, QA, Release, NOC, Security, etc.)
Ops
Self-Service Operations
Define
Govern
Execute
Govern
“Consumers of Ops”
(Dev, QA, Release, NOC, Security, etc.)
Ops
fdfd
Operations as a Service
Operations
as a
Service
ED G
Team B
(Ops)
Vet
Procedures
Define
Policies
Execute
On Demand
Team A
(Dev)
Define
Procedures
Execute
On Demand
fdfd
Operations as a Service
Move definition, execution, and governance to where you
get the most effective use of labor and best flow of work
Operations
as a
Service
ED G
Team B
(Ops)
Vet
Procedures
Define
Policies
Execute
On Demand
Team A
(Dev)
Define
Procedures
Execute
On Demand
Rundeck: Open Source Platform For Operations as a Service
#! ! "# $
Scripts APIs Tools Cloud VMs Containers
Orchestration &
Scheduling of Workflows
Collect and
Process Output
Infrastructure
details and state
from multiple
sources
Config.
Man.
CMDB
Monitor.
Metrics
Cloud
Corp
Directory
Authentication
and roles
ITSM Tickets, work
status, approvals
>_
Create workflows ● Define ACL policies ● Execute workflows
Web GUI API CLI
Common implementation pattern
for Operations as a Service…
Step 1: Establish a Secure Ops Hub
Operations as a Service
Engineers get visibility
and controlled self-service
Secrets
Ops Procedures
“Status”
“Firewall Change”
"Restart"
deny
allow
Identity Audit Logs
Infrastructure view
Service health
System metrics
Ops Support use for
remediation procedures
Inventory and Health
Execute
+ Monitoring Tools
Security and Ops manages
access, configuration, and compliance
Step 2: Establish a SDLC for Ops Procedures
Operations as a Service
Engineers get visibility
and controlled self-service
Secrets
Ops Procedures
“Status”
“Firewall Change”
"Restart"
deny
allow
Identity Audit Logs
Infrastructure view
Service health
System metrics
Ops Support use for
remediation procedures
Inventory and Health
Execute
Source Code
Repo
if (($state==wait))
then
kill -9 $PID
fi
Change
Product Engineers
produce automated
procedures and health
checks.
RISKY
Automated Procedures
and Health Checks
FIX
Code review
+ Monitoring Tools
Security and Ops manages
access, configuration, and compliance
Step 3: Connect with Enterprise Management Systems
Service Desk
CustomersOps Support get
visibility and audit trail
updated by support tools
Service Ticket
Execute
Software
Supply Chain
Ops integrate
with artifact
flow
Operations as a Service
Engineers get visibility
and controlled self-service
Secrets
Ops Procedures
“Status”
“Firewall Change”
"Restart"
deny
allow
Identity Audit Logs
Infrastructure view
Service health
System metrics
Ops Support use for
remediation procedures
Inventory and Health
Source Code
Repo
if (($state==wait))
then
kill -9 $PID
fi
Change
Product Engineers
produce automated
procedures and health
checks.
RISKY
Automated Procedures
and Health Checks
FIX
Code review
+ Monitoring Tools
Security and Ops manages
access, configuration, and compliance
Step 4: Make Compliance Really Happy
Service Desk
CustomersOps Support get
visibility and audit trail
updated by support tools
Service Ticket
Execute
Software
Supply Chain
Ops integrate
with artifact
flow
Who reviewed it? Who ran it? When? Where? Approval trail?
Who created the procedure?
Who created the policy?
Operations as a Service
Engineers get visibility
and controlled self-service
Secrets
Ops Procedures
“Status”
“Firewall Change”
"Restart"
deny
allow
Identity Audit Logs
Infrastructure view
Service health
System metrics
Ops Support use for
remediation procedures
Inventory and Health
Source Code
Repo
if (($state==wait))
then
kill -9 $PID
fi
Change
Product Engineers
produce automated
procedures and health
checks.
RISKY
Automated Procedures
and Health Checks
FIX
Code review
+ Monitoring Tools
Security and Ops manages
access, configuration, and compliance
1. Use Lean thinking and analyze like any process
2. Make sure you are measuring the right things
3. “Shift Left” control and decision making
4. Get rid of silos
5. Establish Operations as a Service
6. Encourage Developers to think “Operable First”
Encourage Developers to think “Operable First”
Operations is the business (unless you literally sell packaged software)
Encourage Developers to think “Operable First”
Operations is the business (unless you literally sell packaged software)
Deployability, configurability, monitoring are service features
Encourage Developers to think “Operable First”
Operations is the business (unless you literally sell packaged software)
Deployability, configurability, monitoring are service features
Build configurability into the service, don’t externalize it
Encourage Developers to think “Operable First”
Operations is the business (unless you literally sell packaged software)
Deployability, configurability, monitoring are service features
Build configurability into the service, don’t externalize it
Demand “prod-like” environments everywhere
Encourage Developers to think “Operable First”
Operations is the business (unless you literally sell packaged software)
Deployability, configurability, monitoring are service features
Build configurability into the service, don’t externalize it
Demand “prod-like” environments everywhere
Make any handoff between teams “verification-driven”
Encourage Developers to think “Operable First”
Operations is the business (unless you literally sell packaged software)
Deployability, configurability, monitoring are service features
Build configurability into the service, don’t externalize it
Demand “prod-like” environments everywhere
Make any handoff between teams “verification-driven”
Create immutable versioned artifacts and use standard packaging
Encourage Developers to think “Operable First”
Operations is the business (unless you literally sell packaged software)
Deployability, configurability, monitoring are service features
Build configurability into the service, don’t externalize it
Demand “prod-like” environments everywhere
Make any handoff between teams “verification-driven”
Create immutable versioned artifacts and use standard packaging
Integration tests over unit tests
Encourage Developers to think “Operable First”
Let’s look at a company improving it’s
incident response capability…
Mark
Maun
Jody
Mulkey
Justin
Dean
90% Reduction in MTTR
50% Reduction in escalations
55% Reduction of overall support costs
1. New org, support, and escalation model
escalate
1° 2° 3° 4°
escalate escalate
1. New org, support, and escalation model
escalate
1° 2° 3° 4°
escalate escalate
1. New org, support, and escalation model
EMT ER Trauma
Surgeon
Specialist
Surgeon
TOC (NOC) SRE
Production Eng.
Scrum Teams
Data Services
Platform Eng.
Global Network
> 15 min > 30 min > 60 min
2. Key: Push the ability to take action closest to the problem
escalate
1° 2° 3° 4°
escalate escalate
1. New org, support, and escalation model
EMT ER Trauma
Surgeon
Specialist
Surgeon
TOC (NOC) SRE
Production Eng.
Scrum Teams
Data Services
Platform Eng.
Global Network
> 15 min > 30 min > 60 min
2. Key: Push the ability to take action closest to the problem
escalate
1° 2° 3° 4°
escalate escalate
1. New org, support, and escalation model
EMT ER Trauma
Surgeon
Specialist
Surgeon
TOC (NOC) SRE
Production Eng.
Scrum Teams
Data Services
Platform Eng.
Global Network
> 15 min > 30 min > 60 min
3. Longterm investment in operability
(deployment, configuration, monitoring, automated runbooks)
“Support at the Edge”
Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ
http://rundeck.org/stories/mark_maun.html
“Support at the Edge”
Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ
http://rundeck.org/stories/mark_maun.html
• Automated Ops procedures written/
vetted by the delivery teams
“Support at the Edge”
Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ
http://rundeck.org/stories/mark_maun.html
• Automated Ops procedures written/
vetted by the delivery teams
• Ops remained in full control of what
can run and security policy
“Support at the Edge”
Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ
http://rundeck.org/stories/mark_maun.html
• Automated Ops procedures written/
vetted by the delivery teams
• Ops remained in full control of what
can run and security policy
• Empowered NOC and other support
teams with self-service ops tasks
“Support at the Edge”
Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ
http://rundeck.org/stories/mark_maun.html
• Automated Ops procedures written/
vetted by the delivery teams
• Ops remained in full control of what
can run and security policy
• Empowered NOC and other support
teams with self-service ops tasks
• Empowered developers with limited
self-service operations
“Support at the Edge”
Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ
http://rundeck.org/stories/mark_maun.html
• Automated Ops procedures written/
vetted by the delivery teams
• Ops remained in full control of what
can run and security policy
• Empowered NOC and other support
teams with self-service ops tasks
• Empowered developers with limited
self-service operations
• Combined with new incident response
and escalation model
1. Use Lean thinking and analyze like any process
2. Make sure you are measuring the right things
3. “Shift Left” control and decision making
4. Get rid of silos
5. Establish Operations as a Service
6. Encourage Developers to think “Operable First”
Recap!
Let’s talk…
@damonedwards
damon@rundeck.com

Más contenido relacionado

La actualidad más candente

SysAdmin to SRE: Solving the Last Mile Problem
SysAdmin to SRE: Solving the Last Mile ProblemSysAdmin to SRE: Solving the Last Mile Problem
SysAdmin to SRE: Solving the Last Mile ProblemRundeck
 
Self-Service Operations: Because Ops Still Happens
Self-Service Operations: Because Ops Still HappensSelf-Service Operations: Because Ops Still Happens
Self-Service Operations: Because Ops Still HappensRundeck
 
Operations: The Last Mile
Operations: The Last Mile Operations: The Last Mile
Operations: The Last Mile Rundeck
 
The Last Mile Continued: Incident Management
The Last Mile Continued: Incident Management The Last Mile Continued: Incident Management
The Last Mile Continued: Incident Management Rundeck
 
Helping Ops Help You: Development’s Role in Enabling Self-Service Operations
Helping Ops Help You:  Development’s Role in Enabling Self-Service OperationsHelping Ops Help You:  Development’s Role in Enabling Self-Service Operations
Helping Ops Help You: Development’s Role in Enabling Self-Service OperationsRundeck
 
Tickets Make Operations Work Unnecessarily Miserable
Tickets Make Operations Work Unnecessarily MiserableTickets Make Operations Work Unnecessarily Miserable
Tickets Make Operations Work Unnecessarily MiserableRundeck
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Rundeck
 
Clearing the Way For SRE In the Enterprise
Clearing the Way For SRE In the Enterprise Clearing the Way For SRE In the Enterprise
Clearing the Way For SRE In the Enterprise Rundeck
 
Making Tomorrow Better than Today - Unlocking the Full Potential of Operations
Making Tomorrow Better than Today - Unlocking the Full Potential of OperationsMaking Tomorrow Better than Today - Unlocking the Full Potential of Operations
Making Tomorrow Better than Today - Unlocking the Full Potential of OperationsRundeck
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Rundeck
 
Operations: The Last Mile
Operations: The Last Mile Operations: The Last Mile
Operations: The Last Mile Rundeck
 
SRE Lessons for the Enterprise
SRE Lessons for the Enterprise SRE Lessons for the Enterprise
SRE Lessons for the Enterprise Rundeck
 
SysAdmin to SRE: Creating Capacity to Make Tomorrow Better Than Today
SysAdmin to SRE: Creating Capacity to Make Tomorrow Better Than Today  SysAdmin to SRE: Creating Capacity to Make Tomorrow Better Than Today
SysAdmin to SRE: Creating Capacity to Make Tomorrow Better Than Today Rundeck
 
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital TransformationEmpower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital TransformationRundeck
 
Operations: The Last Mile Problem For DevOps
Operations: The Last Mile Problem For DevOpsOperations: The Last Mile Problem For DevOps
Operations: The Last Mile Problem For DevOpsRundeck
 
Continuously Deploying Culture: Scaling Culture at Etsy - Velocity Europe 2012
Continuously Deploying Culture: Scaling Culture at Etsy - Velocity Europe 2012Continuously Deploying Culture: Scaling Culture at Etsy - Velocity Europe 2012
Continuously Deploying Culture: Scaling Culture at Etsy - Velocity Europe 2012Patrick McDonnell
 
Agile Infrastructure Velocity 09
Agile Infrastructure Velocity 09Agile Infrastructure Velocity 09
Agile Infrastructure Velocity 09Andrew Shafer
 
All daydevops 2016 - Turning Human Capital into High Performance Organizati...
All daydevops   2016 - Turning Human Capital into High Performance Organizati...All daydevops   2016 - Turning Human Capital into High Performance Organizati...
All daydevops 2016 - Turning Human Capital into High Performance Organizati...John Willis
 
Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsChaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsC4Media
 

La actualidad más candente (20)

SysAdmin to SRE: Solving the Last Mile Problem
SysAdmin to SRE: Solving the Last Mile ProblemSysAdmin to SRE: Solving the Last Mile Problem
SysAdmin to SRE: Solving the Last Mile Problem
 
Self-Service Operations: Because Ops Still Happens
Self-Service Operations: Because Ops Still HappensSelf-Service Operations: Because Ops Still Happens
Self-Service Operations: Because Ops Still Happens
 
Operations: The Last Mile
Operations: The Last Mile Operations: The Last Mile
Operations: The Last Mile
 
The Last Mile Continued: Incident Management
The Last Mile Continued: Incident Management The Last Mile Continued: Incident Management
The Last Mile Continued: Incident Management
 
Helping Ops Help You: Development’s Role in Enabling Self-Service Operations
Helping Ops Help You:  Development’s Role in Enabling Self-Service OperationsHelping Ops Help You:  Development’s Role in Enabling Self-Service Operations
Helping Ops Help You: Development’s Role in Enabling Self-Service Operations
 
Tickets Make Operations Work Unnecessarily Miserable
Tickets Make Operations Work Unnecessarily MiserableTickets Make Operations Work Unnecessarily Miserable
Tickets Make Operations Work Unnecessarily Miserable
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE
 
Clearing the Way For SRE In the Enterprise
Clearing the Way For SRE In the Enterprise Clearing the Way For SRE In the Enterprise
Clearing the Way For SRE In the Enterprise
 
Making Tomorrow Better than Today - Unlocking the Full Potential of Operations
Making Tomorrow Better than Today - Unlocking the Full Potential of OperationsMaking Tomorrow Better than Today - Unlocking the Full Potential of Operations
Making Tomorrow Better than Today - Unlocking the Full Potential of Operations
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE
 
Operations: The Last Mile
Operations: The Last Mile Operations: The Last Mile
Operations: The Last Mile
 
SRE Lessons for the Enterprise
SRE Lessons for the Enterprise SRE Lessons for the Enterprise
SRE Lessons for the Enterprise
 
SRE From Scratch
SRE From ScratchSRE From Scratch
SRE From Scratch
 
SysAdmin to SRE: Creating Capacity to Make Tomorrow Better Than Today
SysAdmin to SRE: Creating Capacity to Make Tomorrow Better Than Today  SysAdmin to SRE: Creating Capacity to Make Tomorrow Better Than Today
SysAdmin to SRE: Creating Capacity to Make Tomorrow Better Than Today
 
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital TransformationEmpower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
 
Operations: The Last Mile Problem For DevOps
Operations: The Last Mile Problem For DevOpsOperations: The Last Mile Problem For DevOps
Operations: The Last Mile Problem For DevOps
 
Continuously Deploying Culture: Scaling Culture at Etsy - Velocity Europe 2012
Continuously Deploying Culture: Scaling Culture at Etsy - Velocity Europe 2012Continuously Deploying Culture: Scaling Culture at Etsy - Velocity Europe 2012
Continuously Deploying Culture: Scaling Culture at Etsy - Velocity Europe 2012
 
Agile Infrastructure Velocity 09
Agile Infrastructure Velocity 09Agile Infrastructure Velocity 09
Agile Infrastructure Velocity 09
 
All daydevops 2016 - Turning Human Capital into High Performance Organizati...
All daydevops   2016 - Turning Human Capital into High Performance Organizati...All daydevops   2016 - Turning Human Capital into High Performance Organizati...
All daydevops 2016 - Turning Human Capital into High Performance Organizati...
 
Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsChaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient Systems
 

Similar a Failure Happens: Improving Incident Response In Enterprises

DevOps Training - Ho Chi Minh City
DevOps Training - Ho Chi Minh CityDevOps Training - Ho Chi Minh City
DevOps Training - Ho Chi Minh CityChristian Trabold
 
Devops is (not ) a buzzword
Devops is (not ) a buzzwordDevops is (not ) a buzzword
Devops is (not ) a buzzwordMiguel Fonseca
 
Lead Developer: A Day in the Life
Lead Developer: A Day in the LifeLead Developer: A Day in the Life
Lead Developer: A Day in the LifeJohn Valentino
 
BizOps Done Right: Breaking DevOps Silos to Deliver Great User Experiences
BizOps Done Right: Breaking DevOps Silos to Deliver Great User ExperiencesBizOps Done Right: Breaking DevOps Silos to Deliver Great User Experiences
BizOps Done Right: Breaking DevOps Silos to Deliver Great User ExperiencesKlaus Enzenhofer
 
Chicago Docker Meetup Presentation - Mediafly
Chicago Docker Meetup Presentation - MediaflyChicago Docker Meetup Presentation - Mediafly
Chicago Docker Meetup Presentation - MediaflyMediafly
 
SaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
SaltConf 2015: Salt stack at web scale: Better, Stronger, FasterSaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
SaltConf 2015: Salt stack at web scale: Better, Stronger, FasterThomas Jackson
 
The DevOps Pay Raise: Quantifying Your Value to Move Up the Ladder
The DevOps Pay Raise: Quantifying Your Value to Move Up the LadderThe DevOps Pay Raise: Quantifying Your Value to Move Up the Ladder
The DevOps Pay Raise: Quantifying Your Value to Move Up the Laddertlevey
 
Continuous Delivery for Python Developers – PyCon Otto
Continuous Delivery for Python Developers – PyCon OttoContinuous Delivery for Python Developers – PyCon Otto
Continuous Delivery for Python Developers – PyCon OttoPeter Bittner
 
Build It and They Will Come-Pliant
Build It and They Will Come-PliantBuild It and They Will Come-Pliant
Build It and They Will Come-PliantJulie Tsai
 
Atlassian - Software For Every Team
Atlassian - Software For Every TeamAtlassian - Software For Every Team
Atlassian - Software For Every TeamSven Peters
 
⛳️ Votre API passe-t-elle le contrôle technique ?
⛳️ Votre API passe-t-elle le contrôle technique ?⛳️ Votre API passe-t-elle le contrôle technique ?
⛳️ Votre API passe-t-elle le contrôle technique ?François-Guillaume Ribreau
 
Monitoring 改造計畫:流程觀點
Monitoring 改造計畫:流程觀點Monitoring 改造計畫:流程觀點
Monitoring 改造計畫:流程觀點William Yeh
 
Continuous Delivery and Automated Operations on k8s with keptn
Continuous Delivery and Automated Operations on k8s with keptnContinuous Delivery and Automated Operations on k8s with keptn
Continuous Delivery and Automated Operations on k8s with keptnAndreas Grabner
 
Realising the true value of DevOps
Realising the true value of DevOpsRealising the true value of DevOps
Realising the true value of DevOpstlevey
 
DevOps with Serverless
DevOps with ServerlessDevOps with Serverless
DevOps with ServerlessYan Cui
 
Challenges and best practices of database continuous delivery
Challenges and best practices of database continuous deliveryChallenges and best practices of database continuous delivery
Challenges and best practices of database continuous deliveryDBmaestro - Database DevOps
 
Serverless 101 in Montreal
Serverless 101 in MontrealServerless 101 in Montreal
Serverless 101 in MontrealAaron Williams
 
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.02014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0Joakim Lindbom
 

Similar a Failure Happens: Improving Incident Response In Enterprises (20)

DevOps Training - Ho Chi Minh City
DevOps Training - Ho Chi Minh CityDevOps Training - Ho Chi Minh City
DevOps Training - Ho Chi Minh City
 
Devops is (not ) a buzzword
Devops is (not ) a buzzwordDevops is (not ) a buzzword
Devops is (not ) a buzzword
 
Lead Developer: A Day in the Life
Lead Developer: A Day in the LifeLead Developer: A Day in the Life
Lead Developer: A Day in the Life
 
BizOps Done Right: Breaking DevOps Silos to Deliver Great User Experiences
BizOps Done Right: Breaking DevOps Silos to Deliver Great User ExperiencesBizOps Done Right: Breaking DevOps Silos to Deliver Great User Experiences
BizOps Done Right: Breaking DevOps Silos to Deliver Great User Experiences
 
Chicago Docker Meetup Presentation - Mediafly
Chicago Docker Meetup Presentation - MediaflyChicago Docker Meetup Presentation - Mediafly
Chicago Docker Meetup Presentation - Mediafly
 
SaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
SaltConf 2015: Salt stack at web scale: Better, Stronger, FasterSaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
SaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
 
The DevOps Pay Raise: Quantifying Your Value to Move Up the Ladder
The DevOps Pay Raise: Quantifying Your Value to Move Up the LadderThe DevOps Pay Raise: Quantifying Your Value to Move Up the Ladder
The DevOps Pay Raise: Quantifying Your Value to Move Up the Ladder
 
Continuous Delivery for Python Developers – PyCon Otto
Continuous Delivery for Python Developers – PyCon OttoContinuous Delivery for Python Developers – PyCon Otto
Continuous Delivery for Python Developers – PyCon Otto
 
Build It and They Will Come-Pliant
Build It and They Will Come-PliantBuild It and They Will Come-Pliant
Build It and They Will Come-Pliant
 
Atlassian - Software For Every Team
Atlassian - Software For Every TeamAtlassian - Software For Every Team
Atlassian - Software For Every Team
 
⛳️ Votre API passe-t-elle le contrôle technique ?
⛳️ Votre API passe-t-elle le contrôle technique ?⛳️ Votre API passe-t-elle le contrôle technique ?
⛳️ Votre API passe-t-elle le contrôle technique ?
 
Monitoring 改造計畫:流程觀點
Monitoring 改造計畫:流程觀點Monitoring 改造計畫:流程觀點
Monitoring 改造計畫:流程觀點
 
Continuous Delivery and Automated Operations on k8s with keptn
Continuous Delivery and Automated Operations on k8s with keptnContinuous Delivery and Automated Operations on k8s with keptn
Continuous Delivery and Automated Operations on k8s with keptn
 
Realising the true value of DevOps
Realising the true value of DevOpsRealising the true value of DevOps
Realising the true value of DevOps
 
DevOps with Serverless
DevOps with ServerlessDevOps with Serverless
DevOps with Serverless
 
Node fips
Node fipsNode fips
Node fips
 
Challenges and best practices of database continuous delivery
Challenges and best practices of database continuous deliveryChallenges and best practices of database continuous delivery
Challenges and best practices of database continuous delivery
 
Serverless 101 in Montreal
Serverless 101 in MontrealServerless 101 in Montreal
Serverless 101 in Montreal
 
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.02014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
 
Dash presentation
Dash presentationDash presentation
Dash presentation
 

Más de Rundeck

Rundeck Community Office Hours: Using Variables with Job Steps
Rundeck Community Office Hours:  Using Variables with Job Steps Rundeck Community Office Hours:  Using Variables with Job Steps
Rundeck Community Office Hours: Using Variables with Job Steps Rundeck
 
Introducing PagerDuty Process Automation
Introducing PagerDuty Process AutomationIntroducing PagerDuty Process Automation
Introducing PagerDuty Process AutomationRundeck
 
How to Build a Custom Plugin in Rundeck
How to Build a Custom Plugin in RundeckHow to Build a Custom Plugin in Rundeck
How to Build a Custom Plugin in RundeckRundeck
 
Lunch and learn: Getting started with Rundeck & Ansible
Lunch and learn:  Getting started with Rundeck & AnsibleLunch and learn:  Getting started with Rundeck & Ansible
Lunch and learn: Getting started with Rundeck & AnsibleRundeck
 
Self Service Cloud Operations: Safely Delegate the Management of your Cloud ...
Self Service Cloud Operations:  Safely Delegate the Management of your Cloud ...Self Service Cloud Operations:  Safely Delegate the Management of your Cloud ...
Self Service Cloud Operations: Safely Delegate the Management of your Cloud ...Rundeck
 
Rundeck Office Hours: Best Practices Access Control Policies
Rundeck Office Hours:  Best Practices Access Control PoliciesRundeck Office Hours:  Best Practices Access Control Policies
Rundeck Office Hours: Best Practices Access Control PoliciesRundeck
 
Mastering Secrets Management in Rundeck
Mastering Secrets Management in RundeckMastering Secrets Management in Rundeck
Mastering Secrets Management in RundeckRundeck
 
What's New in Rundeck 3.4
What's New in Rundeck 3.4   What's New in Rundeck 3.4
What's New in Rundeck 3.4 Rundeck
 
Automate Yourself Out of a Job: Safely Delegate the Management of your Azure...
Automate Yourself Out of a Job:  Safely Delegate the Management of your Azure...Automate Yourself Out of a Job:  Safely Delegate the Management of your Azure...
Automate Yourself Out of a Job: Safely Delegate the Management of your Azure...Rundeck
 
Super-Charge Your Site Reliability Practices with Runbook Automation
Super-Charge Your Site Reliability Practices with Runbook Automation Super-Charge Your Site Reliability Practices with Runbook Automation
Super-Charge Your Site Reliability Practices with Runbook Automation Rundeck
 
Introduction to Rundeck
Introduction to Rundeck Introduction to Rundeck
Introduction to Rundeck Rundeck
 
Automated Remediation with Rundeck + Sensu
Automated Remediation with Rundeck + SensuAutomated Remediation with Rundeck + Sensu
Automated Remediation with Rundeck + SensuRundeck
 
Modernizing Incident Response
Modernizing Incident Response Modernizing Incident Response
Modernizing Incident Response Rundeck
 
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]Rundeck
 
Datadog + Rundeck at DASH 2020
Datadog + Rundeck at DASH 2020Datadog + Rundeck at DASH 2020
Datadog + Rundeck at DASH 2020Rundeck
 
Rundeck Overview
Rundeck OverviewRundeck Overview
Rundeck OverviewRundeck
 
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital TransformationEmpower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital TransformationRundeck
 
Advanced Cluster Settings
Advanced Cluster Settings Advanced Cluster Settings
Advanced Cluster Settings Rundeck
 
Maximizing Your Rundeck Migration
Maximizing Your Rundeck Migration Maximizing Your Rundeck Migration
Maximizing Your Rundeck Migration Rundeck
 
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...Rundeck
 

Más de Rundeck (20)

Rundeck Community Office Hours: Using Variables with Job Steps
Rundeck Community Office Hours:  Using Variables with Job Steps Rundeck Community Office Hours:  Using Variables with Job Steps
Rundeck Community Office Hours: Using Variables with Job Steps
 
Introducing PagerDuty Process Automation
Introducing PagerDuty Process AutomationIntroducing PagerDuty Process Automation
Introducing PagerDuty Process Automation
 
How to Build a Custom Plugin in Rundeck
How to Build a Custom Plugin in RundeckHow to Build a Custom Plugin in Rundeck
How to Build a Custom Plugin in Rundeck
 
Lunch and learn: Getting started with Rundeck & Ansible
Lunch and learn:  Getting started with Rundeck & AnsibleLunch and learn:  Getting started with Rundeck & Ansible
Lunch and learn: Getting started with Rundeck & Ansible
 
Self Service Cloud Operations: Safely Delegate the Management of your Cloud ...
Self Service Cloud Operations:  Safely Delegate the Management of your Cloud ...Self Service Cloud Operations:  Safely Delegate the Management of your Cloud ...
Self Service Cloud Operations: Safely Delegate the Management of your Cloud ...
 
Rundeck Office Hours: Best Practices Access Control Policies
Rundeck Office Hours:  Best Practices Access Control PoliciesRundeck Office Hours:  Best Practices Access Control Policies
Rundeck Office Hours: Best Practices Access Control Policies
 
Mastering Secrets Management in Rundeck
Mastering Secrets Management in RundeckMastering Secrets Management in Rundeck
Mastering Secrets Management in Rundeck
 
What's New in Rundeck 3.4
What's New in Rundeck 3.4   What's New in Rundeck 3.4
What's New in Rundeck 3.4
 
Automate Yourself Out of a Job: Safely Delegate the Management of your Azure...
Automate Yourself Out of a Job:  Safely Delegate the Management of your Azure...Automate Yourself Out of a Job:  Safely Delegate the Management of your Azure...
Automate Yourself Out of a Job: Safely Delegate the Management of your Azure...
 
Super-Charge Your Site Reliability Practices with Runbook Automation
Super-Charge Your Site Reliability Practices with Runbook Automation Super-Charge Your Site Reliability Practices with Runbook Automation
Super-Charge Your Site Reliability Practices with Runbook Automation
 
Introduction to Rundeck
Introduction to Rundeck Introduction to Rundeck
Introduction to Rundeck
 
Automated Remediation with Rundeck + Sensu
Automated Remediation with Rundeck + SensuAutomated Remediation with Rundeck + Sensu
Automated Remediation with Rundeck + Sensu
 
Modernizing Incident Response
Modernizing Incident Response Modernizing Incident Response
Modernizing Incident Response
 
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
 
Datadog + Rundeck at DASH 2020
Datadog + Rundeck at DASH 2020Datadog + Rundeck at DASH 2020
Datadog + Rundeck at DASH 2020
 
Rundeck Overview
Rundeck OverviewRundeck Overview
Rundeck Overview
 
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital TransformationEmpower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
 
Advanced Cluster Settings
Advanced Cluster Settings Advanced Cluster Settings
Advanced Cluster Settings
 
Maximizing Your Rundeck Migration
Maximizing Your Rundeck Migration Maximizing Your Rundeck Migration
Maximizing Your Rundeck Migration
 
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
 

Último

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Último (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Failure Happens: Improving Incident Response In Enterprises

  • 1. Failure Happens: Improving Incident Response In Enterprises Damon Edwards @damonedwards
  • 3. 1. This is not a talk about monitoring 2. This talk is about the enterprise Please note….
  • 4. Let’s look at an (unfortunately) typical incident… [Get PDF here: https://rundeck.co/incident_lisa2017]
  • 5. 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) Let’s condense that…
  • 6. Take a DevOps approach to improvement…
  • 7. 1. Use Lean thinking and analyze like any process
  • 8. 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) What got in the way? (Wastes) PD - Partially Done TS - Task Switching W - Waiting M - Motion / Manual D - Defects EP - Extra Process EF - Extra Features H - Heroics Structured analysis building on DevOps adaptation of “7 deadly wastes” from Lean / Agile:
  • 9. 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) What got in the way? (Wastes) PD - Partially Done TS - Task Switching W - Waiting M - Motion / Manual D - Defects EP - Extra Process EF - Extra Features H - Heroics Structured analysis building on DevOps adaptation of “7 deadly wastes” from Lean / Agile: M M M M
  • 10. 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) What got in the way? (Wastes) PD - Partially Done TS - Task Switching W - Waiting M - Motion / Manual D - Defects EP - Extra Process EF - Extra Features H - Heroics Structured analysis building on DevOps adaptation of “7 deadly wastes” from Lean / Agile: TS TS TS TS TS TS TSM M M M
  • 11. 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) What got in the way? (Wastes) PD - Partially Done TS - Task Switching W - Waiting M - Motion / Manual D - Defects EP - Extra Process EF - Extra Features H - Heroics Structured analysis building on DevOps adaptation of “7 deadly wastes” from Lean / Agile: TS TS TS TS TS TS TS W W W W W M M M M
  • 12. 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) What got in the way? (Wastes) PD - Partially Done TS - Task Switching W - Waiting M - Motion / Manual D - Defects EP - Extra Process EF - Extra Features H - Heroics Structured analysis building on DevOps adaptation of “7 deadly wastes” from Lean / Agile: TS TS TS TS TS TS TS W W W W W EP EP M M M M
  • 13. 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) What got in the way? (Wastes) PD - Partially Done TS - Task Switching W - Waiting M - Motion / Manual D - Defects EP - Extra Process EF - Extra Features H - Heroics Structured analysis building on DevOps adaptation of “7 deadly wastes” from Lean / Agile: TS TS TS TS TS TS TS W W W W W EP EP H HH M M M M
  • 14. 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) What got in the way? (Wastes) PD - Partially Done TS - Task Switching W - Waiting M - Motion / Manual D - Defects EP - Extra Process EF - Extra Features H - Heroics Structured analysis building on DevOps adaptation of “7 deadly wastes” from Lean / Agile: TS TS TS TS TS TS TS PD PD PD W W W W W EP EP H HH M M M M
  • 15. Tip: Try techniques like Value Stream Mapping
  • 16. 1. Use Lean thinking and analyze like any process 2. Make sure you are measuring the right things
  • 17. MTTD? 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha)
  • 18. MTTD? 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) Detect?
  • 19. MTTD? 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) Detect? Diagnose!
  • 20. MTTD? 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) Detect? Diagnose! Diagnose > Detect
  • 21. MTTR? 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha)
  • 22. MTTR? 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) Restore?
  • 23. MTTR? 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) Restore? Repair! DEV
  • 24. MTTR? 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) Restore? Restore is good. Repair (fix) is ultimate measure. Repair! DEV
  • 25. Cost? 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) HC: 2 HC: 9 HC: 2 HC: 3 HC: 4 HC: 1.5 HC: 1.5 HC: 1.5
  • 26. Cost? 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) HC: 2 HC: 9 HC: 2 HC: 3 HC: 4 HC: 1.5 HC: 1.5 HC: 1.5 Context: 2 Context: 10 Context: 6 Context: 7 Context: 10 Context: 12 Context: 13 Context: 13
  • 27. Cost? 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) Don’t underestimate the expense of context switching! HC: 2 HC: 9 HC: 2 HC: 3 HC: 4 HC: 1.5 HC: 1.5 HC: 1.5 Context: 2 Context: 10 Context: 6 Context: 7 Context: 10 Context: 12 Context: 13 Context: 13
  • 28. 1. Use Lean thinking and analyze like any process 2. Make sure you are measuring the right things 3. “Shift Left” control and decision making
  • 29. 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) Shift-Left
  • 30. 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) Solve here or here Shift-Left
  • 31. 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) Solve here or here NOT HERE Shift-Left
  • 32. Empower those closest to the issue escalate escalate 1° 2° 3° escalate 4°
  • 33. Empower those closest to the issue escalate escalate 1° 2° 3° escalate 4° Push the ability to take action this direction
  • 34. Empower those closest to the issue escalate escalate 1° 2° 3° escalate 4° Push the ability to take action this direction But what gets in the way?
  • 35. Empower those closest to the issue escalate escalate 1° 2° 3° escalate 4° Push the ability to take action this direction SilosBut what gets in the way?
  • 36. 1. Use Lean thinking and analyze like any process 2. Make sure you are measuring the right things 3. “Shift Left” control and decision making 4. Get rid of silos
  • 37. Silos ruin everything Backlog Context I need X Backlog I do X Requests for X Silo A Priorities Context Priorities Silo B Tools Tools
  • 38. Silos ruin everything Backlog Context I need X Backlog I do X Requests for X Silo A Priorities Context Priorities Silo B Tools Tools
  • 39. Ticket-Driven Request Queues Are Best Indicator of Silos Team A (Dev) Team B (Ops) Ticket System ??
  • 40. Team A (Dev) Team B (Ops) Ticket System ?? Silo Builder Ticket-Driven Request Queues Are Best Indicator of Silos
  • 41. Team A (Dev) Team B (Ops) Ticket System ?? Silo Builder Snowflake Maker Ticket-Driven Request Queues Are Best Indicator of Silos
  • 42. Popular: Replace Silos with Cross Functional Team Dev/Test Release OperatePlanning
  • 43. Popular: Replace Silos with Cross Functional Team Dev/Test Release OperatePlanning Cross-Functional Teams Cross-Functional Teams Cross-Functional Teams
  • 44. Dev/Test Release OperatePlanning Popular: Replace Silos with Cross Functional Team Cross-Functional Teams Cross-Functional Teams Cross-Functional Teams
  • 45. EnvironmentsDBAs Network Security NOC Dev/Test Release OperatePlanning Popular: Replace Silos with Cross Functional Team Cross-Functional Teams Cross-Functional Teams Cross-Functional Teams
  • 46. EnvironmentsDBAs Network Security NOC Dev/Test Release OperatePlanning Team A (Dev) Team B (Ops) Ticket System ?? Popular: Replace Silos with Cross Functional Team Cross-Functional Teams Cross-Functional Teams Cross-Functional Teams
  • 47. 1. Use Lean thinking and analyze like any process 2. Make sure you are measuring the right things 3. “Shift Left” control and decision making 4. Get rid of silos 5. Establish Operations as a Service
  • 48. Team A (Dev) Team B (Ops) Ticket System ?? Silo Builder Get rid of as many ticket-driven request queues as possible
  • 49. Team A (Dev) Team B (Ops) Ticket System ?? Silo Builder Snowflake Maker Get rid of as many ticket-driven request queues as possible
  • 50. … replace with Operations as a Service design pattern Team A (Dev) Team B (Ops)Ticket System Operations as a Service Execute On Demand Define Procedures Vet Procedures Define Policies Actual Exceptions Execute On Demand
  • 51. Changes how your organization thinks about automated procedures…
  • 52. Automated procedures are comprised of three parts Definition of the automated procedure Execution of the automated procedure Governance of the automated procedure Define Execute Govern
  • 53. Automated procedures are comprised of three parts Definition of the automated procedure Execution of the automated procedure Governance of the automated procedure Define Execute Govern (security, oversight, compliance, etc.)
  • 54. Traditional Ops Silo Define Execute Govern “Consumers of Ops” (Dev, QA, Release, NOC, Security, etc.) Ops
  • 55. Rigid Self-Service Define Execute Govern “Consumers of Ops” (Dev, QA, Release, NOC, Security, etc.) Ops
  • 56. Define Execute Govern Execute “Consumers of Ops” (Dev, QA, Release, NOC, Security, etc.) Ops Rigid Self-Service (limited)
  • 57. High-Velocity Handoffs Define Govern Execute “Consumers of Ops” (Dev, QA, Release, NOC, Security, etc.) Ops
  • 58. Self-Service Operations Define Govern Execute “Consumers of Ops” (Dev, QA, Release, NOC, Security, etc.) Ops
  • 59. Self-Service Operations Define Govern Execute Govern “Consumers of Ops” (Dev, QA, Release, NOC, Security, etc.) Ops
  • 60. fdfd Operations as a Service Operations as a Service ED G Team B (Ops) Vet Procedures Define Policies Execute On Demand Team A (Dev) Define Procedures Execute On Demand
  • 61. fdfd Operations as a Service Move definition, execution, and governance to where you get the most effective use of labor and best flow of work Operations as a Service ED G Team B (Ops) Vet Procedures Define Policies Execute On Demand Team A (Dev) Define Procedures Execute On Demand
  • 62. Rundeck: Open Source Platform For Operations as a Service #! ! "# $ Scripts APIs Tools Cloud VMs Containers Orchestration & Scheduling of Workflows Collect and Process Output Infrastructure details and state from multiple sources Config. Man. CMDB Monitor. Metrics Cloud Corp Directory Authentication and roles ITSM Tickets, work status, approvals >_ Create workflows ● Define ACL policies ● Execute workflows Web GUI API CLI
  • 63. Common implementation pattern for Operations as a Service…
  • 64. Step 1: Establish a Secure Ops Hub Operations as a Service Engineers get visibility and controlled self-service Secrets Ops Procedures “Status” “Firewall Change” "Restart" deny allow Identity Audit Logs Infrastructure view Service health System metrics Ops Support use for remediation procedures Inventory and Health Execute + Monitoring Tools Security and Ops manages access, configuration, and compliance
  • 65. Step 2: Establish a SDLC for Ops Procedures Operations as a Service Engineers get visibility and controlled self-service Secrets Ops Procedures “Status” “Firewall Change” "Restart" deny allow Identity Audit Logs Infrastructure view Service health System metrics Ops Support use for remediation procedures Inventory and Health Execute Source Code Repo if (($state==wait)) then kill -9 $PID fi Change Product Engineers produce automated procedures and health checks. RISKY Automated Procedures and Health Checks FIX Code review + Monitoring Tools Security and Ops manages access, configuration, and compliance
  • 66. Step 3: Connect with Enterprise Management Systems Service Desk CustomersOps Support get visibility and audit trail updated by support tools Service Ticket Execute Software Supply Chain Ops integrate with artifact flow Operations as a Service Engineers get visibility and controlled self-service Secrets Ops Procedures “Status” “Firewall Change” "Restart" deny allow Identity Audit Logs Infrastructure view Service health System metrics Ops Support use for remediation procedures Inventory and Health Source Code Repo if (($state==wait)) then kill -9 $PID fi Change Product Engineers produce automated procedures and health checks. RISKY Automated Procedures and Health Checks FIX Code review + Monitoring Tools Security and Ops manages access, configuration, and compliance
  • 67. Step 4: Make Compliance Really Happy Service Desk CustomersOps Support get visibility and audit trail updated by support tools Service Ticket Execute Software Supply Chain Ops integrate with artifact flow Who reviewed it? Who ran it? When? Where? Approval trail? Who created the procedure? Who created the policy? Operations as a Service Engineers get visibility and controlled self-service Secrets Ops Procedures “Status” “Firewall Change” "Restart" deny allow Identity Audit Logs Infrastructure view Service health System metrics Ops Support use for remediation procedures Inventory and Health Source Code Repo if (($state==wait)) then kill -9 $PID fi Change Product Engineers produce automated procedures and health checks. RISKY Automated Procedures and Health Checks FIX Code review + Monitoring Tools Security and Ops manages access, configuration, and compliance
  • 68. 1. Use Lean thinking and analyze like any process 2. Make sure you are measuring the right things 3. “Shift Left” control and decision making 4. Get rid of silos 5. Establish Operations as a Service 6. Encourage Developers to think “Operable First”
  • 69. Encourage Developers to think “Operable First”
  • 70. Operations is the business (unless you literally sell packaged software) Encourage Developers to think “Operable First”
  • 71. Operations is the business (unless you literally sell packaged software) Deployability, configurability, monitoring are service features Encourage Developers to think “Operable First”
  • 72. Operations is the business (unless you literally sell packaged software) Deployability, configurability, monitoring are service features Build configurability into the service, don’t externalize it Encourage Developers to think “Operable First”
  • 73. Operations is the business (unless you literally sell packaged software) Deployability, configurability, monitoring are service features Build configurability into the service, don’t externalize it Demand “prod-like” environments everywhere Encourage Developers to think “Operable First”
  • 74. Operations is the business (unless you literally sell packaged software) Deployability, configurability, monitoring are service features Build configurability into the service, don’t externalize it Demand “prod-like” environments everywhere Make any handoff between teams “verification-driven” Encourage Developers to think “Operable First”
  • 75. Operations is the business (unless you literally sell packaged software) Deployability, configurability, monitoring are service features Build configurability into the service, don’t externalize it Demand “prod-like” environments everywhere Make any handoff between teams “verification-driven” Create immutable versioned artifacts and use standard packaging Encourage Developers to think “Operable First”
  • 76. Operations is the business (unless you literally sell packaged software) Deployability, configurability, monitoring are service features Build configurability into the service, don’t externalize it Demand “prod-like” environments everywhere Make any handoff between teams “verification-driven” Create immutable versioned artifacts and use standard packaging Integration tests over unit tests Encourage Developers to think “Operable First”
  • 77. Let’s look at a company improving it’s incident response capability…
  • 78.
  • 79.
  • 81. 90% Reduction in MTTR 50% Reduction in escalations 55% Reduction of overall support costs
  • 82.
  • 83. 1. New org, support, and escalation model
  • 84. escalate 1° 2° 3° 4° escalate escalate 1. New org, support, and escalation model
  • 85. escalate 1° 2° 3° 4° escalate escalate 1. New org, support, and escalation model EMT ER Trauma Surgeon Specialist Surgeon TOC (NOC) SRE Production Eng. Scrum Teams Data Services Platform Eng. Global Network > 15 min > 30 min > 60 min
  • 86. 2. Key: Push the ability to take action closest to the problem escalate 1° 2° 3° 4° escalate escalate 1. New org, support, and escalation model EMT ER Trauma Surgeon Specialist Surgeon TOC (NOC) SRE Production Eng. Scrum Teams Data Services Platform Eng. Global Network > 15 min > 30 min > 60 min
  • 87. 2. Key: Push the ability to take action closest to the problem escalate 1° 2° 3° 4° escalate escalate 1. New org, support, and escalation model EMT ER Trauma Surgeon Specialist Surgeon TOC (NOC) SRE Production Eng. Scrum Teams Data Services Platform Eng. Global Network > 15 min > 30 min > 60 min 3. Longterm investment in operability (deployment, configuration, monitoring, automated runbooks)
  • 88. “Support at the Edge” Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ http://rundeck.org/stories/mark_maun.html
  • 89. “Support at the Edge” Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ http://rundeck.org/stories/mark_maun.html • Automated Ops procedures written/ vetted by the delivery teams
  • 90. “Support at the Edge” Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ http://rundeck.org/stories/mark_maun.html • Automated Ops procedures written/ vetted by the delivery teams • Ops remained in full control of what can run and security policy
  • 91. “Support at the Edge” Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ http://rundeck.org/stories/mark_maun.html • Automated Ops procedures written/ vetted by the delivery teams • Ops remained in full control of what can run and security policy • Empowered NOC and other support teams with self-service ops tasks
  • 92. “Support at the Edge” Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ http://rundeck.org/stories/mark_maun.html • Automated Ops procedures written/ vetted by the delivery teams • Ops remained in full control of what can run and security policy • Empowered NOC and other support teams with self-service ops tasks • Empowered developers with limited self-service operations
  • 93. “Support at the Edge” Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ http://rundeck.org/stories/mark_maun.html • Automated Ops procedures written/ vetted by the delivery teams • Ops remained in full control of what can run and security policy • Empowered NOC and other support teams with self-service ops tasks • Empowered developers with limited self-service operations • Combined with new incident response and escalation model
  • 94. 1. Use Lean thinking and analyze like any process 2. Make sure you are measuring the right things 3. “Shift Left” control and decision making 4. Get rid of silos 5. Establish Operations as a Service 6. Encourage Developers to think “Operable First” Recap!