2. @diego_pacheco
❏ Cat's Father
❏ Head of Software Architecture
❏ Agile Coach
❏ SOA/Microservices Expert
❏ DevOps Practitioner
❏ Speaker
❏ Author
diegopacheco
http://diego-pacheco.blogspot.com.br/
About me...
https://diegopacheco.github.io/tinyurl.com/diegopacheco
3. Things we will
talk about...
Things you should avoid.
DON'TS01
Science 101
How to run a investigation02
Observability, Debug Tricks,
Linux Tools...
Tools & Tricks03
Post Mortems, RCAs and
Retrospectives
Closing the Case04
5. Don't: FREAK OUT
❏ Avoid extra pressure on yourself
❏ Avoid worry about time.
❏ Avoid making comparisons like this
should be done in 1h or so.
❏ As time pass is easy to Freak Out
❏ Making progress is your friend, don't
nail it down only about solving the
mystery.
❏ Your eyes can trick you, use tools.
❏ Make sure you do things properly.
❏ Comparing Strings
❏ Read Errors Carefully.
❏ Don't forget to Breath
6. Don't: Do two many things at time.
❏ Don't change 2 things at the time.
❏ Code
❏ Vars / Config
❏ Trys
❏ One Hypotesy at the time
❏ Otherwise how do you know what
did what?
❏ You need to be:
❏ Methodical
❏ Boring
❏ Slow
7. How to Run a investigation
Science 101
Have Theories
Write Down Facts
Have a
Partner
(Pair Programing)
Minimize Investigation
Efforts
11. Have Theories
❏ It's a like a guessing game, how the murder happen?
❏ Think on most likely thing it could be.
❏ There are classical Offenders like:
❏ NPE Lack of validation/test
❏ Code Change
❏ Config Change
❏ Credentials Change
❏ Typo somewhere
❏ Lack of Security groups
❏ Not enough Resources (memory, space)
❏ OOM Killer
12. Write down Facts
❏ Write down (paper, txt file, evernote, google docs)
❏ Write Important fact like:
❏ Class Names, Methods, Variables
❏ Make sure you don't get lots.
❏ It's easy to forget important pieces of information
when you are worried and want to fix the issue fast.
21. Write your own tools
❏ Help to nail down problems faster.
❏ Simple Utilitary.
❏ Program or scripts (Groovy, Python , Rust, Go).
❏ Quick diagnostics.
22. Could you compare with something else?
❏ It's very likely the issue is a code change, so maybe was working
before. Some digging is needed.
❏ Not always possible (i.g new feature)
❏ Use diff tools.
❏ Compare previous versions that worked.
❏ Run previous tests that worked.
❏ Debug and write down differences.
23. Closing the Case
Fix, Share & Improve
❏ Adding More Tests
❏ Post Mortems
❏ RCAs
❏ Retrospectives
24. Adding more Tests
❏ Boys Scout rules.
❏ Closing the Case means you found the issue.
❏ So we need to create test to simulate the issue.
❏ Having the test will proof we don't suffer again.
26. Closing the Case: RCA
❏ Similar to Post Mortems.
❏ Lean tool. Can be used with: 5 whys, fishbone and other
system thinking tools.
❏ Excel file:
❏ Issue
❏ classification
❏ Why it happen
❏ How we make sure it does not happen again