Making software work well in production (through good software operability) is one of the goals of DevOps. Collaboration between Dev and Ops on the 'run book' or operation manual is one way to open up communication channels between Dev and Ops, leading to improved software operability.
This is the slide deck I used at DevOps Summit, Bangalore, on 18th December 2013.
Software operability and run book collaboration - DevOps Summit, Bangalore
1. #unidevops
Software Operability,
Run Book Collaboration,
and DevOps
Matthew Skelton
18th December 2013
DevOps Summit,
Bangalore, India
www.devops-summit.org
@matthewpskelton
softwareoperability.com
2. • Software Operability
• Run Book Collaboration
• Making Operability Work
• Questions
#unidevops
Agenda
3. • Software systems since 1998
• Software build & deployment
specialist & DevOps enthusiast
• London Continuous Delivery
meetup group - londoncd.org.uk
• Experience DevOps workshops
#unidevops
Background
11. • David Copeland (@davetron5000):
“How your software runs in
production is all that matters. The
most amazing abstractions, cleanest
code, or beautiful algorithms are
meaningless if your code doesn’t run
well on production.”
•
http://www.naildrivin5.com/blog/2013/06/16/production-is-all-that-matters.html
#unidevops
Software Operability
22. • DevOps is one way to address
poor operability
• Improved collaboration and
communication between Dev
teams and Ops teams
• Example: Run Book Collaboration
#unidevops
How DevOps can help
28. #unidevops
Example
•
•
1 Table of Contents
2 System Overview
–
–
–
–
–
–
–
–
–
2.1 Service Overview
2.2 Contributing Applications, Daemons, and
Windows Services
2.3 Hours of Operation
2.4 Execution Design
2.5 Infrastructure and Network Design
2.6 Resilience, Fault Tolerance and HighAvailability
2.7 Throttling and Partial Shutdown
2.8 Required Resources
2.9 Expected Traffic and Load
•
•
•
4.1 Configuration Management
–
–
–
•
7.5 Troubleshooting
–
8.1 Maintenance Procedures
7 Operational Tasks
•
•
5.2 Backup Procedures
5.3 Restore Procedures
–
–
6.1 Error Messages
6.2 Events
6 Monitoring and Alerting
8.1.1 Patching
–
–
•
•
–
•
–
8.1.3.1 Log Rotation
8.2.1 Technical Testing
8.2.2 Post-Deployment
9 Failure and Recovery Procedures
–
–
–
•
8.1.1.1 Normal Cycle
8.1.1.2 Zero-Day Vulnerabilities
8.1.2 GMT/BST time changes
8.1.3 Cleardown Activities
8.2 Testing
•
•
5 System Backup and Restore
5.1.1 Special Files
7.4.1 System Rebuilds
8 Maintenance Tasks
•
5.1 Backup Requirements
3 Security and Access Control
4 System Configuration
•
7.1 Deployment
7.2 Batch Processing
7.3 Power Procedures
7.4 Routine Checks
–
2.10 Environmental Differences
2.11 Tools
–
•
–
–
–
–
•
6.3 Health Checks
6.4 Other Messages
2.9.1 Hot or Peak Periods
2.9.2 Warm Periods
2.9.3 Cool or Quiet Periods
–
–
•
•
–
–
9.1 Failover
9.2 Recovery
9.3 Troubleshooting Failover and Recovery
10 Contact Details
29. #unidevops
Example
•
•
1 Table of Contents
2 System Overview
– 2.1 Service Overview
– 2.2 Contributing Applications,
Daemons, and Windows
Services
– 2.3 Hours of Operation
– 2.4 Execution Design
– 2.5 Infrastructure and Network
Design
– 2.6 Resilience, Fault Tolerance
and High-Availability
– 2.7 Throttling and Partial
Shutdown
– 2.8 Required Resources
– 2.9 Expected Traffic and Load
•
•
•
•
•
•
•
•
3 Security and Access
Control
4 System Configuration
5 System Backup and
Restore
6 Monitoring and Alerting
7 Operational Tasks
8 Maintenance Tasks
9 Failure and Recovery
Procedures
10 Contact Details
30. #unidevops
Example
2.1 Service Overview
2.2 Contributing
Applications,
Daemons, and
Windows Services
2.3 Hours of
Operation
2.4 Execution Design
2.5 Infrastructure and
Network Design
2.6 Resilience, Fault
Tolerance and
High-Availability
2.7 Throttling and
Partial Shutdown
2.8 Required
Resources
2.9 Expected Traffic
and Load
34. •
•
•
•
Focus on the collaboration
Run book is a means, not an end
Throw it away when complete (?)
Aim to automate more over time
• See http://runbookcollab.info/
#unidevops
Run Book as Collaboration
43. • “I’ll need to ask my manager first”
• Lack of autonomy
• Remove reporting barriers to regular,
effective communication
• More at
http://bit.ly/DevOpsTopologies
#unidevops
Organisation changes
47. •
•
•
•
•
•
Too much overtime pay
Too little overtime pay
Rota team too small
No training in incident response
No team ownership of product
No team autonomy for changes
#unidevops
On-call Anti-Patterns
48. • Team members want to help
make things better
• Empowered to fix problems
• Reduce the times they are woken
up
#unidevops
On call - Goal
49. •
•
•
•
•
Operational Features, not “NFRs”
Sustainable collaboration
Sensible, fair on-call rotas
Over-compensate in time off
Avoid burn-out
#unidevops
The operability of operability
55. • Patterns for
Performance and
Operability
– Ford, Gileadi, Purba,
Moerman
• http://whoownsmyoperability.com/
– Recommended reading lists
#unidevops
Further Reading
56. • Software Operability – How to make
software work well in Production
– Due early 2014
• Sign up at OperabilityBook.com
• Discount code for DevOps Summit
attendees
#unidevops
Operability Book
57. • A hands-on workshop for DevOps
culture
• Forthcoming dates:
– Bangalore: 19th December 2013
– London: February 2014 (tbc)
• http://experiencedevops.org/
#unidevops
Experience DevOps