You can't spell "monitoring" without "monoid"

I’M KEVIN
I WORK AT NEW RELIC
I LIKE MATH
I’m Kevin Scaldeferri. I work for New Relic as a Principal Engineer and distributed systems architect and I’m sort of a math geek, which leads to writing talk titles like

YOU CAN’T SPELL
MONITORING
WITHOUT
MONOID Kevin Scaldeferri
New Relic
Before jumping into the math, ﬁrst some motivation

I DON’T REALLY LIKE
“METRIC TIME SERIES”
I have a confession to make: I don’t really like metric time series. “But they’re so simple”

EASY IS NOT THE SAME AS SIMPLE
Rich Hickey
No, they are easy, not simple. Doing the easy thing often intertwines multiple concepts in way that complicate thinking about them.

STORY TIME
Let me tell you a Story. We’re having an incident, people are looking for the root cause.

“Aha! CPU on this DB is going up and to the right. Page the database team!”

DB Team: “nope, that’s normal, that metric’s not a gauge, it’s an accumulative counter, it’s always up and to the right”.

More unhelpful charts of accumulative counters. Why are all these instances different? Is there a real difference or were they just restarted at different times?

PDX -> SEA
10 ✈ @ 40 min
1000 🚙 @ 3.5 hr
How long does it take to get from portland to seattle on average?

PDX -> SEA
10 ✈ @ 40 min
1000 🚙 @ 3.5 hr
Avg = (10*40 + 1000*210) / 1010
= 208 min
Is this right? Not really.

PDX -> SEA
10 ✈ @ 40 min <— 120 people
1000 🚙 @ 3.5 hr <— 1 person
Avg = (1200*40 + 1000*210) / 2200
= 117 min
You need to weight the average correctly.

But what’s that got to do with metrics?

AVERAGE RESPONSE TIME
Host A: 10ms
Host B: 12ms
Host C: 80ms
What the average response time of this app, where one host is slow for some reason.

Host A: 10ms
Host B: 12ms
Host C: 80ms
Average: 34ms?

Host A: 10ms
Host B: 12ms
Host C: 80ms
Average: 34ms? NO!

Host A: 10ms
Host B: 12ms
Host C: 80ms <— LB sends less
Average: 34ms? NO!
Don’t average averages.

PERCENTILES
Everyone knows percentiles are better than averages anyway.

“MyResource.post-requests”: {
"p50": 0.001,
"p75": 0.002,
"p95": 0.006,
"p98": 0.007,
"p99": 0.008,
"p999": 0.018,
}
p99 per host is easy to come by, but my SL[IOA] is the p99 for the app overall.

Gil Tene
shameless appeal to authority

UNIQUE COUNTS
Businesses really care about unique counts. How many unique users are coming to the site? How many unique users have tried a new feature?

UNIQUE USERS
10 18 20 19 17 15 12
Weekly Unique Users?
But we’re in trouble if we have daily unique counts and try to get a weekly value.

UNIQUE USERS
10 18 20 19 17 15 12
20 ≤ Weekly Unique Users ≤ 111
Could be anywhere from 20 to 111, which isn’t very satisfying to your business owner.

WELL THIS IS SORT OF
DEPRESSING

A MONOID IS AN ALGEBRAIC STRUCTURE
WITH A SINGLE ASSOCIATIVE BINARY
OPERATION AND AN IDENTITY ELEMENT.
Wikipedia
What is a monoid? … what’s that mean?

“ALGEBRAIC STRUCTURE”
=
DATA TYPE

“ASSOCIATIVE BINARY
OPERATION”
=
SOMETHING LIKE ADDITION

“IDENTITY ELEMENT”
=
SOMETHING LIKE ZERO

interface Monoid<T> {
// (x + y) + z = x + (y + z)
add(x:T, y:T) : T
// 0 + x = x = x + 0
zero() : T
}
As an interface deﬁnition. But it’s not just addition. For example, multiplication or string concatenation satisfy these rules.

HOW DOES THIS HELP?
How does this simple concept help ﬁx the problems with our easy approach?

TEMPORAL AND
DIMENSIONAL AGGREGATION

AGGREGATION
1 2 3 4 5 6 7 8 9 10 11 12
host
A
10 14 15 19 17 15 12 11 12 15 14 17
host
B
9 13 12 15 16 17 16 14 9 11 12 15
host
C
10 15 13 16 13 19 15 16 13 13 12 14
host
D
10 13 13 17 14 20 13 15 12 12 13 15
10 second resolution is great for tactical debugging.

AGGREGATION
1-4 5-8 9-12
host
A
58 55 58
host
B
49 63 47
host
C
54 61 52
host
D
53 62 52
but for long term analysis it’s too expensive and we want time roll-ups.

AGGREGATION
1 2 3 4 5 6 7 8 9 10 11 12
host
A
10 14 15 19 17 15 12 11 12 15 14 17
host
B
9 13 12 15 16 17 16 14 9 11 12 15
host
C
10 15 13 16 13 19 15 16 13 13 12 14
host
D
10 13 13 17 14 20 13 15 12 12 13 15
Similarly we want all those high-cardinality dimensions to track down problems and answer ad-hoc question.

AGGREGATION
1 2 3 4 5 6 7 8 9 10 11 12
all
hosts 39 55 53 67 60 71 56 56 46 51 51 62
But you also need to measure SLIs. And a year from now you won’t care about that container ID.

ACCUMULATIVE
COUNTERS
replace accumulative counters with

12 REQUESTS TO THIS ENDPOINT WERE
RECEIVED BY THIS HOST DURING THIS
TIME INTERVAL
Useful Monitoring
Some of our sources of telemetry insist on giving us accumulators, but as quickly as possible we need to convert them to something like this.

(AND YOU SHOULD SUM THEM)
Useful Monitoring
And that measurement needs to tell us how to combine multiple data points.

A MONOID IS
BOTH THE DATA
AND THE OPERATION
There’s more than one monoid on longs and doubles, and we need to be clear about what’s sensible to do with a particular metric.

MIN / MAX
GAUGES
Don’t sum or average a max or min. Take the max of all your maxes and the min of all your mins.

THE MAX MEMORY USED BY THIS HOST
DURING THIS TIME INTERVAL WAS
1.2GB; AND AGGREGATE USING MAX
Useful Monitoring
This should be explicit, not something you have to extract from the metric name.

Host A: 10ms
Host B: 12ms
Host C: 80ms
Average: ???
How do we do this right?

Host A: 10s / 1000 reqs = 10ms avg
Host B: 10.8s / 900 reqs = 12ms avg
Host C: 9.6s / 120 reqs = 80ms avg
Average: ???
Break it into two sum monoids for the total time of requests and the total number of requests.

Host A: 10s / 1000 reqs = 10ms avg
Host B: 10.8s / 900 reqs = 12ms avg
Host C: 9.6s / 120 reqs = 80ms avg
Avg: 30.4s / 2020 reqs = 15ms avg
Now we can aggregate correctly and get the right answer.

The Prometheus histogram Bryan showed yesterday is a more complicated example where you have to know exactly how to combine all those individual lines together.
We can do better. Structured logs, why not structured metrics?

APPROXIMATION
WITH RIGOR
monoids tell us how to design approximate algorithms which are still mathematically sound

UNIQUE COUNTS
Let’s revisit our unique count example.

UNIQUE USERS
10 18 20 19 17 15 12
Weekly Unique Users?
We know that the unique counts for each day aren’t suﬃcient to let us calculate the unique users for the week, so what should we do? This is not at all obvious.

HYPERLOGLOG: THE ANALYSIS OF A
NEAR-OPTIMAL CARDINALITY
ESTIMATION ALGORITHM
Flajolet, et al
Lots of research, but at this point everyone pretty much agrees HyperLogLog is the way to go.

UNIQUE USERS - HYPERLOGLOG
1000110
0111101
01…
1010011
0010110
001…
1110010
0101101
00…
0001110
0001101
01…
1010010
0100101
00…
1110110
0001101
11…
1100100
0111010
111…
Weekly Unique Users = 25
Takes about 700 bytes so you don’t want to track a ton of these, but reasonable for high-value business metrics.

PERCENTILES
What about percentiles? The good news is that there’s lots of ways to approximate percentiles monoidally. But the bad news is also that there’s lots of ways to
approximate percentiles monoidally.

RE-AGGREGATABLE PERCENTILES
▸MomentSketch
▸Q-Digest
▸T-Digest
▸GK-Array
▸HDRHistogram
▸Spectator histogram
▸CLWY “Random”
Algorithm
▸DDSketch
This is not a complete list, this is just some of the most well known approaches. These all make tradeoﬀs in a multi-dimensional space of speed, size, and accuracy and
this is still an active area of research. Unlike unique counts, we don’t have a consensus about what approach all our monitoring tools should use. Hard to compare
across data from multiple systems.

Gratuitous dog photo in case you were getting overwhelmed by math about now.

METRIC TIME SERIES
▸ Can be misleading / surprising
▸ Accumulative Counters: please stop!
▸ Easy to do mathematical nonsense
▸ Accurate aggregation often impossible
Metric time series have been the easy and dominant paradigm for monitoring data over the last decade or so, but they present challenges in today’s environment.

MONOIDS
▸ Data that tells us what math makes sense
▸ Collect high-resolution, high-cardinality data
▸ Aggregate after the fact as needed
▸ Composable
▸ Guides the design of approximate algorithms
Monoids provide a simple framework which allows us to build mathematically sound monitoring systems.

CHALLENGES
▸ Self-describing data that includes how to aggregate
▸ Composite data types
▸ Universal support for HyperLogLogs
▸ Consensus on quantile estimation
If we’re adding units and descriptions for humans to our metrics (a la Open Census), why not richer type annotations?

Quantiles are hard, maybe Open Telemetry should tackle this.

THANK YOU
KEVIN SCALDEFERRI
@KSCALDEF

You can't spell "monitoring" without "monoid"

Recomendados

Recomendados

Más contenido relacionado

Similar a You can't spell "monitoring" without "monoid"

Similar a You can't spell "monitoring" without "monoid" (20)

Último

Último (20)

You can't spell "monitoring" without "monoid"