If you are interested in monitoring, and successfully set up a system (whether home-grown or custom-off-the-shelf) for your own use, there comes a moment when you go from monitoring only the systems you care about, to monitoring systems that other people care about. Monitoring for yourself is all about having the best data for the least effort. Monitoring for others? That's when your job becomes a game of "what just happened" whack-a-mole.
4. Hello!
▧ Working in IT 30+ years
▧ 20+ years in monitoring
○ CASE $CompanySize
■ <=100
■ >100 && <1000
■ >1000 && <5000
■ >250,000
▧ Currently “Head Geek” at SolarWinds
○ Head Geek <> Developer
○ Head Geek != Marketing
○ Head Geek ≠ Sales
○ “Head Geek” LIKE “%Advocate%”
○ Head Geek == STORYTELLER
Leon Adato
5. Where To Find Me
Twitter: @LeonAdato
THWACK.com: AdatoLe
WWWeb www.AdatoSystems.com
Podcast TechnicallyReligious.com
5
7. The Four Questions of Alerting
Why didn’t I get
an alert?
Why did I get
this alert?
What will alert
on my system?
What’s being
monitored on
my system?
adatole
@LeonAdato
8. The Jewish Roots of…
Questions
▧ “Du fregst a gutte kashe”
▧ Nobel Lauriat in Physics Dr. Isidor Rabi
adatole
@LeonAdato
9. “My mother made me a scientist without ever intending to.
Every other mother in Brooklyn would ask her child:
“So? Did you learn anything today?”
But not my mother.
“Izzy,” she would say, “did you ask a good question today?”
That difference — asking good questions — made me
become a scientist.
9
adatole
@LeonAdato
10. The Jewish Roots of…
Questions
▧ “Du fregst a gutte kashe”
▧ Nobel Lauriat in Physics Dr. Isidor Rabi
▧ THE Four Questions
ָּהנ ַת ְׁשִּנ ה ַמ,ֵּילֹותל ַה ָּלכ ִּמ ֶּהז ַה ָּהלְׁיַל ַה
Why is this night different from all other nights?
adatole
@LeonAdato
11. What’s the Teretz?*
▧ We need the same open-ness to questions
▧ Relish the experience of asking, of discovery
▧ We don’t work in tech because
I already know that
▧ We work in tech because we love
I’ll find out
*Teretz = answer
adatole
@LeonAdato
12.
“Your system is down.”
Question #1: Why did I get that alert?
☺
CPU on the Windows device owned
by Accounting named
Mnth_Reporting (IP: 10.2.3.4, DNS:
MonRep.MyCorp.Net) has been over
80% for more than 15 minutes. CPU
at 2:16am EST is 96%.
Device details: http://blahblah.
Acknowledge this alert: http://ackme
This message brought to you by the
alert: CPU_CRIT_PROD and the
polling engine Poller7
12
adatole
@LeonAdato
13. What’s the Teretz?
▧ Name of the system
▧ Specific component or sub-
element
▧ Current statistic or status
▧ Time the event occurred
▧ Time the alert was sent
▧ Custom fields like location,
owner, etc.
▧ OS type and version
▧ IP address
▧ DNS name or Sysname
▧ The threshold
▧ The duration
▧ A link to the device or
metric
▧ The name of the alert
▧ The polling engine
adatole
@LeonAdato
14. ▧ The story
… before the story
…… before the story
▧ Context matters!
▧ History matters!
▧ Not only “why did I get this alert”
▧ But “why do these alerts exist at
all?”
Jewish Roots:
My Father Was a Wandering Aramean
adatole
@LeonAdato
16. Question #2: Why DIDN’T I Get That Alert?
▧ It was designed like that
○ Alert windows
○ Problem duration
○ Ticket not reset
○ Mute/unmanage/shut-up
○ Parent-child
adatole
@LeonAdato
17. Question #2: Why DIDN’T I Get That Alert?
▧ Change Un-Control
○ Credential changed
○ Network changed
○ Custom Property
○ Element removed
○ Physical to Virtual
adatole
@LeonAdato
18. Question #2: Why DIDN’T I Get That Alert?
▧ Monitoring Failed
○ Polling stopped
○ Agent stopped
○ Data throttled
○ Db is out of sync
○ New code/image missing tracing
○ Monitoring “supply chain” failed (email)
○ Event correlation rules
adatole
@LeonAdato
19. What’s the Teretz?
▧ Understand (and communicate) exceptions
▧ Save your receipts
▧ Save other people’s receipts too, if you can
▧ Monitor your monitoring
▧ Test your notification delivery infrastructure
▧ Have validation steps ready
adatole
@LeonAdato
23. What’s the Teretz?
▧ One size fits… some?
▧ Skillcheck: SQL
▧ Skillcheck: wireshark
▧ Look at the screens
adatole
@LeonAdato
24. Jewish Roots:
Burning Hail and Black Swans
▧ Why do we remember the plagues?
○ Visceral, unexpected, unique
▧ Let’s talk about “black swans”
▧ The plagues as black swan events
adatole
@LeonAdato
26. Question #4: What COULD alert for my
systems?
▧ What *IS* an alert?
○ Emergency
○ Interruption
○ Unplanned Work
▧ What does alerting NEED to be
○ Timely
○ Meaningful
○ Actionable
adatole
@LeonAdato
27. Question #4: What COULD alert for my
systems?
▧ Why does this matter?
○ # of systems
○ # of alerts that can trigger for those systems
○ # of staff hours to address those alerts
○ # of alerts that could trigger simultaneously
adatole
@LeonAdato
28. What’s the Teretz?
▧ This can be a VERY difficult question to answer
▧ But it’s difficulty is in proportion to importance
▧ Speaks to potential impact to the company,
workload, interruptions.
adatole
@LeonAdato
30. Jewish Roots:
Are You Ready For the Hard Questions?
▧ Scholar, Skeptic, Simple, & Silent
▧ Meet each user where they are
▧ Let’s talk about the Skeptic (“the wicked son”)
▧ Listen past the snark for the question
adatole
@LeonAdato
33. Jewish Roots:
Four or Five cups?
▧ Symbolism of wine as joy
▧ We need to remember to pause for joyful moments
▧ Despite rigorous Talmudic analysis, there are still questions
without clear answers.
▧ BUT… that doesn’t mean we disengage.
▧ We return to these questions over and over, try new
approaches.
A lot like IT problems.
adatole
@LeonAdato
34. OK, So That Fifth Question:
What Do You Monitor “Standard”?
▧ When you load up a box into monitoring, what do
consumers automatically get?
▧ If you can’t describe this, how will anyone know
what to ask for “extra”?
adatole
@LeonAdato
35. The Mostly Un-Necessary Summary
Being prepared for the 4 (ok 5) questions
▧ Your monitoring will be (better) prepared for the stresses it
will be exposed to.
▧ You will be (better) prepared as an advocate for monitoring
▧ You’ll spend less time answering repetitive questions and
more time doing to the work of a monitoring engineer.
(i.e.: the GOOD stuff!)
adatole
@LeonAdato
36. If you still have
questions…
36
adatole
@LeonAdato