15. Aggregates are lossy compression
We must decide in advance how we’ll use the
metric
16. Aggregates
▪ Max, Min, Sum, Average, etc
▪ Last, random point
▪ Percentiles (quantiles)
▪ Historgrams, reverse quantiles
▪ Each is suitable for a particular use case
18. Percentiles
p99 - The sampled value that is larger than other 99% of samples
▪ O(n) memory complexity
▪ O(n*log n) computation complexity
▪ Some shortcuts for p50 (median), p100 (max), p0 (min)
Use when clients experience individual values
20. Histograms
Distribution visualization of sample
▪ Count of events in each bin
▪ Beans are usually evenly spaced
▪ Use logarithmically spaced bins for
long tails
▪ Additive
21. Histograms :-(
So why aren’t we all using this?
▪ Storage
▪ Have to decide on bins schema
▪ Not many tools support this
22. Choosing the right aggregate
▪ Percentiles/histograms for latency
▪ Max/min for latency and sizes
▪ Histogram analysis for sizes and latency
▪ Sums/averages for capacity and money
▪ Aggregate per domain
▪ Look for deviations
23. Resolution
▪ Humans need ~5 data points to see a trend
▪ Hides faster changes
▪ Rollups/downscaling is hard
▪ Multi tier FTW!
24. It ain't what you don’t know that gets
you into trouble.
It's what you know for sure that just
ain’t so.
“
“
25. Peak Erasure/Spike erosion
■ When lowering resolution, data points are
aggregated
■ Default aggregation is average
■ Peaks are erased
■ This can happen in storage or visualization
26. Peak Erasure/Spike erosion
■ Storages down-sample to save space
■ Aggregation function may be configurable
■ Metric collectors aggregate too
○ carbon-cache uses last value
○ StatsD - gauges, timers, counters
27. Counters vs Gauges
Behaviour in low res time window
■ Low res sampling erases fast changes
■ “Round numbers” syndrom
■ Counters smear changes, but don’t erase them
TLDR: use counters when possible
28. Mixed modes
Aggregating multiple modes reduces usability of aggregates
■ Different transaction types differ in latencies/sizes
■ Errors, successes have very different latencies/sizes
■ Makes your graphs weird
TLDR: use separate metrics for different things
30. Visualization
■ Timeframe
■ No more than 3 series
■ Be weary of multiple Y scales, but scale if needed
■ Only related series on the same graph
■ Never mix X scales
■ Visual references: bounds, Y min/max values, legend
31. Metric design
■ Choose your aggregates wisely
■ Decide on a proper resolution, sampling rate, aggregation time
windows
■ Explore the distribution
■ Separate known modes to independent metrics
32. Separate signal from noise
■ Use low-pass filters to smooth
■ Trend changes
■ Timeshifts
■ Filter out outliers
33. Working with clusters
■ Most-deviant/outliers
■ Max/Min
■ Sum (capacity)
■ Pre-aggregate percentiles