12. Real world
It works ; it does not work ; it kinda works ; it maybe
works ; no one uses it ; it is broken ; some things
are broken ; it should work but it does not ; where
are my users? help me...
13. The Technical bias
By looking at technical service, we miss the
actual point
Are we serving our users correctly?
Just looking at the traffic light will not tell you
about the traffic jams.
18. CPU usage is no money
Creative Commons Attribution-ShareAlike 2.0
https://www.flickr.com/photos/nox_noctis_silentium/3960497840
19. What are business metrics?
how you fullfil your customers' requests
quality and level of business service
20. Where are we?
Creative Commons Attribution-ShareAlike 2.0 https://www.flickr.com/photos/hernanpc/6259950189
21. What do we have?
metrics that tell us if business works
DB, Frontends, balancers, queing systems...
They don't come from the troublesome
component!
24. High Availability Nowadays
Multiple workers
Health exposed by the app
Load balancer balances to healthy nodes
Unhealty nodes are restarted automatically
25. When HA is not enough...
the processes are not "really" crashing
the component that has issues does not really
know about them (metrics available from DB ,
frontends, clients..)
there is no HA in place... (but still need 24x7
availability)
27. Alerting
alert is fired = someone to take action
Runbooks to follow, depending on the alert
knowledge is built, then runbook == (ansible)
playbook
28. An ideal world
Creative Commons Attribution 2.0 https://www.flickr.com/photos/athomeinscottsdale/3247600886/
29. Ideally...
Memory leaks are fixed (quickly)
Multi master, redundant, in service discovery
You build it, you run it
Full control over 3rd parties (and their bugs..)
30. Ironically
Developers often just payed for features
Ops not working closely with devs
No "bugfix money" for stuff that do not happen
really often
Code base is 20y old and "it will be
decommisioned soon"
41. Webhook that call Ansible
How/where to get the credentials? Where to
run Ansible?
Duration of the ansible run?
Which server to act upon?
Concurrent playbooks?
46. Recap
In an ideal world we do not need this. Bugs are
fixed, techno is up to date, infra and apps are
reduntant.
47. Achievement
Metrics from 1, 2, X sources generate alerts that
are triggering automated resolution within minutes
towards different systems.
Common incidents get solved more quickly than
with people intervention.
People are woken up less often for known issue
with clear runbook.
48. Safeties
Needs monitoring to be up
Needs the last ansible run to be green
Whitelist upfrond
Discards "old alerts"
No concurrent execution
Alerts someone if not resolved in time
49. Use Cases
Must happen infrequently
Must not be predictable
Must not do more harm
Must impact daily work or on call