Presentation by Damon Edwards, co-founder of Rundeck at USENIX LISA in San Francisco, CA on November 3, 2017
See a Demo of Rundeck Enterprise :
https://www.rundeck.com/see-demo
--or--
Download Rundeck Open Source here:
https://rundeck.com/open-source
Connect:
Stack Overflow community: https://stackoverflow.com/questions/tagged/rundeck
Github: https://github.com/rundeck/rundeck/issues
Twitter: https://twitter.com/Rundeck
Facebook: https://www.facebook.com/RundeckInc/
LinkedIn: www.linkedin.com › company › rundeck-inc
7. 1. Use Lean thinking and analyze like any process
8. 1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
What got in the way? (Wastes)
PD - Partially Done
TS - Task Switching
W - Waiting
M - Motion / Manual
D - Defects
EP - Extra Process
EF - Extra Features
H - Heroics
Structured analysis building on DevOps
adaptation of “7 deadly wastes” from Lean / Agile:
9. 1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
What got in the way? (Wastes)
PD - Partially Done
TS - Task Switching
W - Waiting
M - Motion / Manual
D - Defects
EP - Extra Process
EF - Extra Features
H - Heroics
Structured analysis building on DevOps
adaptation of “7 deadly wastes” from Lean / Agile:
M
M
M
M
10. 1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
What got in the way? (Wastes)
PD - Partially Done
TS - Task Switching
W - Waiting
M - Motion / Manual
D - Defects
EP - Extra Process
EF - Extra Features
H - Heroics
Structured analysis building on DevOps
adaptation of “7 deadly wastes” from Lean / Agile:
TS
TS
TS
TS
TS
TS TSM
M
M
M
11. 1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
What got in the way? (Wastes)
PD - Partially Done
TS - Task Switching
W - Waiting
M - Motion / Manual
D - Defects
EP - Extra Process
EF - Extra Features
H - Heroics
Structured analysis building on DevOps
adaptation of “7 deadly wastes” from Lean / Agile:
TS
TS
TS
TS
TS
TS TS
W
W
W W
W
M
M
M
M
12. 1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
What got in the way? (Wastes)
PD - Partially Done
TS - Task Switching
W - Waiting
M - Motion / Manual
D - Defects
EP - Extra Process
EF - Extra Features
H - Heroics
Structured analysis building on DevOps
adaptation of “7 deadly wastes” from Lean / Agile:
TS
TS
TS
TS
TS
TS TS
W
W
W W
W
EP
EP
M
M
M
M
13. 1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
What got in the way? (Wastes)
PD - Partially Done
TS - Task Switching
W - Waiting
M - Motion / Manual
D - Defects
EP - Extra Process
EF - Extra Features
H - Heroics
Structured analysis building on DevOps
adaptation of “7 deadly wastes” from Lean / Agile:
TS
TS
TS
TS
TS
TS TS
W
W
W W
W
EP
EP
H HH
M
M
M
M
14. 1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
What got in the way? (Wastes)
PD - Partially Done
TS - Task Switching
W - Waiting
M - Motion / Manual
D - Defects
EP - Extra Process
EF - Extra Features
H - Heroics
Structured analysis building on DevOps
adaptation of “7 deadly wastes” from Lean / Agile:
TS
TS
TS
TS
TS
TS TS
PD
PD
PD
W
W
W W
W
EP
EP
H HH
M
M
M
M
16. 1. Use Lean thinking and analyze like any process
2. Make sure you are measuring the right things
17. MTTD?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
18. MTTD?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Detect?
19. MTTD?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Detect? Diagnose!
20. MTTD?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Detect? Diagnose!
Diagnose > Detect
21. MTTR?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
22. MTTR?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Restore?
23. MTTR?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Restore?
Repair!
DEV
24. MTTR?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Restore?
Restore is good. Repair (fix) is ultimate measure.
Repair!
DEV
25. Cost?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
HC: 2 HC: 9 HC: 2 HC: 3 HC: 4 HC: 1.5 HC: 1.5 HC: 1.5
26. Cost?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
HC: 2 HC: 9 HC: 2 HC: 3 HC: 4 HC: 1.5 HC: 1.5 HC: 1.5
Context:
2
Context:
10
Context:
6
Context:
7
Context:
10
Context:
12
Context:
13
Context:
13
27. Cost?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Don’t underestimate the expense of context switching!
HC: 2 HC: 9 HC: 2 HC: 3 HC: 4 HC: 1.5 HC: 1.5 HC: 1.5
Context:
2
Context:
10
Context:
6
Context:
7
Context:
10
Context:
12
Context:
13
Context:
13
28. 1. Use Lean thinking and analyze like any process
2. Make sure you are measuring the right things
3. “Shift Left” control and decision making
29. 1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Shift-Left
30. 1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Solve here or here
Shift-Left
31. 1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Solve here or here NOT HERE
Shift-Left
33. Empower those closest to the issue
escalate escalate
1° 2° 3°
escalate
4°
Push the ability to take action this direction
34. Empower those closest to the issue
escalate escalate
1° 2° 3°
escalate
4°
Push the ability to take action this direction
But what gets
in the way?
35. Empower those closest to the issue
escalate escalate
1° 2° 3°
escalate
4°
Push the ability to take action this direction
SilosBut what gets
in the way?
36. 1. Use Lean thinking and analyze like any process
2. Make sure you are measuring the right things
3. “Shift Left” control and decision making
4. Get rid of silos
37. Silos ruin everything
Backlog Context
I need X
Backlog
I do X
Requests
for X
Silo A
Priorities
Context
Priorities
Silo B
Tools Tools
38. Silos ruin everything
Backlog Context
I need X
Backlog
I do X
Requests
for X
Silo A
Priorities
Context
Priorities
Silo B
Tools Tools
43. Popular: Replace Silos with Cross Functional Team
Dev/Test Release OperatePlanning
Cross-Functional Teams
Cross-Functional Teams
Cross-Functional Teams
45. EnvironmentsDBAs Network Security NOC
Dev/Test Release OperatePlanning
Popular: Replace Silos with Cross Functional Team
Cross-Functional Teams
Cross-Functional Teams
Cross-Functional Teams
46. EnvironmentsDBAs Network Security NOC
Dev/Test Release OperatePlanning
Team A
(Dev)
Team B
(Ops)
Ticket
System
??
Popular: Replace Silos with Cross Functional Team
Cross-Functional Teams
Cross-Functional Teams
Cross-Functional Teams
47. 1. Use Lean thinking and analyze like any process
2. Make sure you are measuring the right things
3. “Shift Left” control and decision making
4. Get rid of silos
5. Establish Operations as a Service
50. … replace with Operations as a Service design pattern
Team A
(Dev)
Team B
(Ops)Ticket
System
Operations
as a
Service
Execute
On Demand
Define
Procedures
Vet
Procedures
Define
Policies
Actual Exceptions
Execute
On Demand
51. Changes how your organization thinks
about automated procedures…
52. Automated procedures are comprised of three parts
Definition of the automated procedure
Execution of the automated procedure
Governance of the automated procedure
Define
Execute
Govern
53. Automated procedures are comprised of three parts
Definition of the automated procedure
Execution of the automated procedure
Governance of the automated procedure
Define
Execute
Govern
(security, oversight, compliance, etc.)
60. fdfd
Operations as a Service
Operations
as a
Service
ED G
Team B
(Ops)
Vet
Procedures
Define
Policies
Execute
On Demand
Team A
(Dev)
Define
Procedures
Execute
On Demand
61. fdfd
Operations as a Service
Move definition, execution, and governance to where you
get the most effective use of labor and best flow of work
Operations
as a
Service
ED G
Team B
(Ops)
Vet
Procedures
Define
Policies
Execute
On Demand
Team A
(Dev)
Define
Procedures
Execute
On Demand
62. Rundeck: Open Source Platform For Operations as a Service
#! ! "# $
Scripts APIs Tools Cloud VMs Containers
Orchestration &
Scheduling of Workflows
Collect and
Process Output
Infrastructure
details and state
from multiple
sources
Config.
Man.
CMDB
Monitor.
Metrics
Cloud
Corp
Directory
Authentication
and roles
ITSM Tickets, work
status, approvals
>_
Create workflows ● Define ACL policies ● Execute workflows
Web GUI API CLI
64. Step 1: Establish a Secure Ops Hub
Operations as a Service
Engineers get visibility
and controlled self-service
Secrets
Ops Procedures
“Status”
“Firewall Change”
"Restart"
deny
allow
Identity Audit Logs
Infrastructure view
Service health
System metrics
Ops Support use for
remediation procedures
Inventory and Health
Execute
+ Monitoring Tools
Security and Ops manages
access, configuration, and compliance
65. Step 2: Establish a SDLC for Ops Procedures
Operations as a Service
Engineers get visibility
and controlled self-service
Secrets
Ops Procedures
“Status”
“Firewall Change”
"Restart"
deny
allow
Identity Audit Logs
Infrastructure view
Service health
System metrics
Ops Support use for
remediation procedures
Inventory and Health
Execute
Source Code
Repo
if (($state==wait))
then
kill -9 $PID
fi
Change
Product Engineers
produce automated
procedures and health
checks.
RISKY
Automated Procedures
and Health Checks
FIX
Code review
+ Monitoring Tools
Security and Ops manages
access, configuration, and compliance
66. Step 3: Connect with Enterprise Management Systems
Service Desk
CustomersOps Support get
visibility and audit trail
updated by support tools
Service Ticket
Execute
Software
Supply Chain
Ops integrate
with artifact
flow
Operations as a Service
Engineers get visibility
and controlled self-service
Secrets
Ops Procedures
“Status”
“Firewall Change”
"Restart"
deny
allow
Identity Audit Logs
Infrastructure view
Service health
System metrics
Ops Support use for
remediation procedures
Inventory and Health
Source Code
Repo
if (($state==wait))
then
kill -9 $PID
fi
Change
Product Engineers
produce automated
procedures and health
checks.
RISKY
Automated Procedures
and Health Checks
FIX
Code review
+ Monitoring Tools
Security and Ops manages
access, configuration, and compliance
67. Step 4: Make Compliance Really Happy
Service Desk
CustomersOps Support get
visibility and audit trail
updated by support tools
Service Ticket
Execute
Software
Supply Chain
Ops integrate
with artifact
flow
Who reviewed it? Who ran it? When? Where? Approval trail?
Who created the procedure?
Who created the policy?
Operations as a Service
Engineers get visibility
and controlled self-service
Secrets
Ops Procedures
“Status”
“Firewall Change”
"Restart"
deny
allow
Identity Audit Logs
Infrastructure view
Service health
System metrics
Ops Support use for
remediation procedures
Inventory and Health
Source Code
Repo
if (($state==wait))
then
kill -9 $PID
fi
Change
Product Engineers
produce automated
procedures and health
checks.
RISKY
Automated Procedures
and Health Checks
FIX
Code review
+ Monitoring Tools
Security and Ops manages
access, configuration, and compliance
68. 1. Use Lean thinking and analyze like any process
2. Make sure you are measuring the right things
3. “Shift Left” control and decision making
4. Get rid of silos
5. Establish Operations as a Service
6. Encourage Developers to think “Operable First”
70. Operations is the business (unless you literally sell packaged software)
Encourage Developers to think “Operable First”
71. Operations is the business (unless you literally sell packaged software)
Deployability, configurability, monitoring are service features
Encourage Developers to think “Operable First”
72. Operations is the business (unless you literally sell packaged software)
Deployability, configurability, monitoring are service features
Build configurability into the service, don’t externalize it
Encourage Developers to think “Operable First”
73. Operations is the business (unless you literally sell packaged software)
Deployability, configurability, monitoring are service features
Build configurability into the service, don’t externalize it
Demand “prod-like” environments everywhere
Encourage Developers to think “Operable First”
74. Operations is the business (unless you literally sell packaged software)
Deployability, configurability, monitoring are service features
Build configurability into the service, don’t externalize it
Demand “prod-like” environments everywhere
Make any handoff between teams “verification-driven”
Encourage Developers to think “Operable First”
75. Operations is the business (unless you literally sell packaged software)
Deployability, configurability, monitoring are service features
Build configurability into the service, don’t externalize it
Demand “prod-like” environments everywhere
Make any handoff between teams “verification-driven”
Create immutable versioned artifacts and use standard packaging
Encourage Developers to think “Operable First”
76. Operations is the business (unless you literally sell packaged software)
Deployability, configurability, monitoring are service features
Build configurability into the service, don’t externalize it
Demand “prod-like” environments everywhere
Make any handoff between teams “verification-driven”
Create immutable versioned artifacts and use standard packaging
Integration tests over unit tests
Encourage Developers to think “Operable First”
77. Let’s look at a company improving it’s
incident response capability…
84. escalate
1° 2° 3° 4°
escalate escalate
1. New org, support, and escalation model
85. escalate
1° 2° 3° 4°
escalate escalate
1. New org, support, and escalation model
EMT ER Trauma
Surgeon
Specialist
Surgeon
TOC (NOC) SRE
Production Eng.
Scrum Teams
Data Services
Platform Eng.
Global Network
> 15 min > 30 min > 60 min
86. 2. Key: Push the ability to take action closest to the problem
escalate
1° 2° 3° 4°
escalate escalate
1. New org, support, and escalation model
EMT ER Trauma
Surgeon
Specialist
Surgeon
TOC (NOC) SRE
Production Eng.
Scrum Teams
Data Services
Platform Eng.
Global Network
> 15 min > 30 min > 60 min
87. 2. Key: Push the ability to take action closest to the problem
escalate
1° 2° 3° 4°
escalate escalate
1. New org, support, and escalation model
EMT ER Trauma
Surgeon
Specialist
Surgeon
TOC (NOC) SRE
Production Eng.
Scrum Teams
Data Services
Platform Eng.
Global Network
> 15 min > 30 min > 60 min
3. Longterm investment in operability
(deployment, configuration, monitoring, automated runbooks)
88. “Support at the Edge”
Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ
http://rundeck.org/stories/mark_maun.html
89. “Support at the Edge”
Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ
http://rundeck.org/stories/mark_maun.html
• Automated Ops procedures written/
vetted by the delivery teams
90. “Support at the Edge”
Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ
http://rundeck.org/stories/mark_maun.html
• Automated Ops procedures written/
vetted by the delivery teams
• Ops remained in full control of what
can run and security policy
91. “Support at the Edge”
Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ
http://rundeck.org/stories/mark_maun.html
• Automated Ops procedures written/
vetted by the delivery teams
• Ops remained in full control of what
can run and security policy
• Empowered NOC and other support
teams with self-service ops tasks
92. “Support at the Edge”
Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ
http://rundeck.org/stories/mark_maun.html
• Automated Ops procedures written/
vetted by the delivery teams
• Ops remained in full control of what
can run and security policy
• Empowered NOC and other support
teams with self-service ops tasks
• Empowered developers with limited
self-service operations
93. “Support at the Edge”
Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ
http://rundeck.org/stories/mark_maun.html
• Automated Ops procedures written/
vetted by the delivery teams
• Ops remained in full control of what
can run and security policy
• Empowered NOC and other support
teams with self-service ops tasks
• Empowered developers with limited
self-service operations
• Combined with new incident response
and escalation model
94. 1. Use Lean thinking and analyze like any process
2. Make sure you are measuring the right things
3. “Shift Left” control and decision making
4. Get rid of silos
5. Establish Operations as a Service
6. Encourage Developers to think “Operable First”
Recap!