6. Problem Report
● Expected behavior
● Actual behavior
● How to reproduce
● Store in a searchable location, e.g. bug tracking system
7. Triage
● Measure the severity
Initial Response Update Time Resolution Time /
SLO
P0 15 min Real time 4 hours
P1 30 min 2 hours 8 hours
P2 1 hour 6 hours 36 hours
Triage incident in Trendmicro
10. Workaround
● Rollback
● Restart / Reboot
● Deploy new node
● Scale up/out the instance
● Failover database
● Shutdown the service
My experience
11. Examine
● Graphing time-series metrics
● Logging
○ Structured binary format
○ Multiple verbosity level, change it on the fly
○ Searchable
● Exposing current state
○ Endpoint to show error rate and latency
○ Configuration
12. Diagnose
● Simplify and reduce
○ Divide and conquering
● Ask “what”, “where”, and “why”
● What touched it last
● Diagnose tool
14. Test and Treat
● Mutually exclusive
● Consider the obvious first
● Experiment may provide misleading results
● Active test may have side effect
● Take clear notes before performing active testing
15. Cure
● Prove the root cause may be hard
○ System are complex, multiple factors
○ Reproduce the problem in live production may not be an option
● Postmortem is important
16. Postmortem
● Problem report
● Business impact
● Workaround Fix
● Root cause
● Technical details
● Action items
My experience