4. Stephen Falken: Uh, uh, General, what you see on these screens
up here is a fantasy; a computer-enhanced hallucination. Those
blips are not real missiles. They're phantoms. (War Games, 1983)
9. “Alert if a request takes longer than 200 ms”
10,10,10,10,10,10,10,10,10,5000
Alerts on one outlier in 10
Threshold Alerting
10. “Alert if request average over one minute
is longer than 200 ms”
avg(10,10,210,210,210,210) = 143 (860/6)
Does not alert on multiple high samples
Threshold Alerting
11. ‘average’ eq ‘arithmetic mean’
A=S/N
A = average
N = the number of samples
S = the sum of the samples in the set
Math Refresher
12. median = midpoint of data set
The 50th percentile is 555 - q(0.5)
Value 111 222 333 444
555 666 777 888 999
Sample # 1 2 3 4 5 6 7 8 9
Math Refresher
13. 90th percentile - 90% of samples below it
The 90th percentile is 1,000 - q(0.9)
Value 111 222 333 444 555 666 777 888 999
1,000 1,111
Sample # 1 2 3 4 5 6 7 8 9 10 11
Math Refresher
14. 100th Percentile - the maximum value
The 100th percentile is 1,111 - q(1)
Value 111 222 333 444 555 666 777 888 999 1,000
1,111
Sample # 1 2 3 4 5 6 7 8 9 10 11
Math Refresher
22. Request latency
“We keep hearing from people that the
website is slow. But it is fine when we test it,
and the request latency graph is constant”
You are only looking at part of the picture.
25. Practical Percentiles
Bandwidth usage is often billed at 95th percentile usage
Record 5 minute data usage intervals
Sort samples by value of sample
Throw out the highest 5% of samples
Charge usage based on the remaining top sample, i.e. 300
MB transferred over 5 minutes = 1 MB/s rate billing
26. Practical Percentiles
If I measure 95th percentile per 5 minutes all
month long,
I CANNOT calculate 95th percentile over the
month.
29. “Alert me if request latency 90th percentile
over one minute is exceeded”
Percentile based alerting
q(0.9)[10,10,10,10,10,10,10,10,5000] == 10
Alert IS NOT triggered
Do you want to be woken up for this? NO!
30. “Alert me if request latency 90th percentile
over one minute is exceeded”
Percentile based alerting
q(0.9)[10,10,10,10,10,10,250,300] = ~270
Alert IS triggered
Do you want to be woken up for this? YES!
32. Who’s using this approach?
Google.com - in house monitoring systems
Circonus.com - hosted histogram monitoring
You? (I’ve written my own histograms but use
Circonus for production systems)
33. Questions?
Thanks to Circonus for tools and help with math
http://www.circonus.com/free-account/
Look for future monitoring talks here soon
http://meetup.com/monitorSF
Notas del editor
Who here has been on an on call rotation? Who here has been woken up for monitoring false positives? What is the rate of real alerts to false positives in your monitoring system? How many real alerts does your monitoring system fail to identify? These are common questions that we’ve all had to answer, whether we use Ganglia, Nagios, Zabbix, Graphite, DataDog, Website Pulse, Pingdom, Circonus, or something else. Waking up in the middle of the night is fine as long as something has gone wrong with our production system. Waking up in the middle of the night when something has gone wrong with our monitoring system is not ok.
What’s a synthetic? A synthetic is basically a bot check against your system. One of the benefits (perhaps the only benefit) of the synthetic is that it’s generally more highly available than the application you are monitoring. Some synthetics are as simple as doing a ping against your website and alerting you if it doesn’t get a response. More complicated synthetics are capable of logging into your website, following links, and determining if a result matches a value that you have preset. If a part of that doesn’t succeed, you get an alert. You can configure synthetics to access your site from multiple geographically distributed locations.
This is a time series graph of response times from synthetic login checks against a website. The results are remarkably consistent for the most part, as they should be.
It gives you the viewpoint of one user - a computer somewhere dispatches a request over the same network route to your server. It records several metrics about how your application responds; time to start the ssl connection, time to the first byte served, etc.
The response from synthetic requests don’t tell you anything meaningful about how actual users experience your application. It gives you the experience of one bot located at what is likely a low latency, high bandwidth connection, making the same request over and over. Those metrics are not only useless (unless anyone here runs a service just for one user… in that case, kudos), they lie to you. These are LIES. The falsely represent the health of your application. It’s a binary - is the service up, or is the service down? That’s all you get.
Real users are different than synthetics. Your user base will likely have a distribution of ages, genders, devices, network connections. Mike Brady has a 16 core Mac Pro at his office with a gigabit business line that has a 5 ms latency to the nearest POP. Cindi has an iPad mini with several hundred apps over a 2 megabit 3g connection since last month she used 20 gigs of bandwidth and had to be metered. Carol uses an iMac in the family room; the 2.4 Ghz airport is in the living room, and every time the microwave in the kitchen comes on, she gets 25% packet loss. Greg has a Nexus 6p on the Google Fi network, so his connection speeds vary depending on if he’s routing through Sprint or T-Mobile. Your real users see a different user experience than a bot hosted at a data center.
This is a view of average request time for real users. The synthetic check used an external user agent, but you can use collection tools like statsd or log analysis to record request times for real users. This is better than only using a synthetic check, but this technique still has a number of shortcomings. The first shortcoming is that collection data is averaged over an interval (generally 10 seconds to a minute). Different monitoring providers offer different resolutions, but the end result is similar when you are examining averages.
So if Cyndi, Bobby, and Mike are all shopping at your website at the same time, you only see the average of their request times over a given interval. Mike might be having a great experience because his office network is 1 Gig, but Bobby is on 10 meg, and Cindi on 3g, you’ll only see Bobby’s view of the website user experience if you look at an average view.
The second short shortcoming of a time series average value graph is spike erosion, also known as downsampling. Spike erosion is what happens when you zoom in on specific areas of a time series graph. As you zoom in, the data is averaged over intervals closer to the actual collection intervals. As you can see on this graph, when we zoom into a 2 hour view of the graph we just looked at, the maximum value we see now is 2,000 milliseconds instead of 500 milliseconds. That’s a 400% increase!
If you alert based on values you get from the graphs I’ve shown, you must choose a metric threshold to configure when you are alerted. An example of this would be to set 200 milliseconds as a request time threshold that would alert you if average request time exceeded that threshold.
What threshold do you choose? Spike erosion makes this very difficult. If you look at a wide time range to try to find a reasonable threshold, the spike erosion will hide potentially valuable data. If you zoom in on a narrow time range, you could miss high values that occur at certain regular intervals. As you’ve seen, avoiding false positives is impossible with threshold based alerting. Anyone who has tried this can testify to playing whackamole with trying to set threshold alerting values.
If we alert on a single value, we’ll alerts on outliers, but those aren’t that useful. Every system has outliers, these are part of normal operations. Here we can see that one request timed out at 5,000 milliseconds. That doesn’t mean there’s something wrong with the service. You can’t get rid of outliers, and in most cases you don’t really want to be alerted by them. If you have multiple values here that are 5,000 milliseconds, now that’s not an outlier, but a pattern indicating there’s probably something wrong with your service.
Let’s try alerting on averages over a time period, say one minute. 200 ms is too slow for our users, so we alert if one minute request average exceeds that.
In this example though, 66% of the sample population is over 200 milliseconds - something is clearly wrong. But the other two samples are normal enough that the average comes out to be under our alerting threshold. Whoops - using this threshold alert here causes us to not be alerted when we want to be.
Let’s go through a quick math refresher. First let’s look at averages, which are also known as arithmetic means. To calculate the average, we take the sum of samples and divide by the number of samples. We do this for a set time period, which in monitoring systems often varies between 10 seconds and 5 minutes.
The median is the midpoint of a data set. Here we see that the median of this data set is 555. There are four samples greater than 555, and four samples less than it. This is also known as the 50th percentile, and can be represented by q(0.5). q(0.5) is showing the 50th percentile in quantile notation. In this example, the 0th quantile is the first element, 111.
The 90th percentile is the sample where 90% of the samples are less than it. In this example, 90% of the samples are below 1,000. I used 11 samples here to explicity demonstrate this point. If we had 9 samples here instead, the 90% percentile would be a weighted average between the 8th and 9th sample. I’ll skip the details of those exact calculations and leave it to more qualified statistics explanations.
The maximum value of the sample set is the 100th percentile, or q(1). There are also inverse percen
Let’s talk about histograms. A histogram is one of the seven basic tools of quality. The Y axis indicates the number of samples, where the X axis indicates the sample value. One use of a histogram that you may have seen is plotting human height vs number of people who are that tall.
Human height follows what is called a normal distribution (also known as a Gaussian distribution). The majority of the population tends to group around one value, and tapers off at the high and low sample values. With a perfect normal distribution, the arithmetic mean (the average) and the median are one in the same.
The mode is also equal to the median. You’ve heard the term standard deviation before most likely. With a normal distribution, 68% of the values lie within one standard deviation for both sides of the median. 95% within 2 standard deviations, 99.7% within 3 sigma. The smaller a standard deviation, the closer the data is to the mean. The larger one sigma is, the farther the data is away from the mean. It is important to note that these metrics only make sense for normal distribution, where there is a single mode. You’ve probably heard about six sigma in manufacturing processes. That’s six standard deviations - 99.99966% of samples will be within that range.
This is a non normal distribution. In this example, there are large numbers of samples grouped at the highest and lowest sample values. Because there are two distinct peaks, this is called a bimodal distribution (or multi-modal distribution). In a multimodal distribution like this, standard deviation and multi-sigma values are useless. They don’t mean anything. Remember the percentiles and quantiles discussed earlier? Those are how you describe distributions of values in a non normal distribution.
This is another non-normal distribution. As you can see, it only has one mode, and is a skewed distribution. Standard deviation has little to no meaning here, nor do multiple sigmas. You’ll see histograms like this from manmade as well as natural phenomena. For example, distribution of elements in the universe. Lots of hydrogen - a lot less heavy metals.
Here is a histogram of web page request time. The higher the bar, the more users are affected. This is a highly skewed distribution - notice the grouping between the spike at ~150 milliseconds, and the long tail past there. There’s another smaller spike at ~25 ms, so this is mostly a bimodal distribution.
In terms of website performance, people will generally get angry if request times take longer than 250 milliseconds. So what we see here is a bunch of users who are getting acceptable response times, and a long tail of pissed off users.
In terms of website performance, people will generally get angry if request times take longer than 250 milliseconds. So what we see here is a bunch of users who are getting acceptable response times, and a long tail of pissed off users.
People on left side are having a great experience, people on right side are leaving the site. Note that this is for a time slice, say 5 minutes. What does this look like if we integrate over time?
How many people here have been in this situation? You get support tickets for your application coming in saying the it’s slow. You try it - it works fine. You have QA run their automations against it. Again - performance is acceptable, even great. What’s going on? Works for me, right?
The problem here is that like the previous slide, you’re the user on the left side of the red line. The users who are complaining are on the right side of the red line. They’re hitting codepaths that your synthetics don’t, and that most of your real users don’t. Often these can be large important customers who write big checks. They have thousands of users under one account, and when they use your application, it’s slooooow.
Heat maps are visual representations of histograms over time windows. It gives you a visualization of data distributions over time.
With heat maps, you can add percentile overlays to show the 50th, 95, and any other percentile distribution over time slices
A percentile is a barrier where to the left the samples are 95%, to the right are the remaining 5%. There is a caveat with the barrier hitting in the middle of data points. If you measure on the right including the barrier, >= 95th percentile of whole data set, if you measure to the left of the barrier, <= 95%. If you have two samples, median is every value between those two samples. Samples on the barrier are counted twice. Divide data set into two sets.
Have a slide that says - bespoke things you probably didn’t know about histograms. For the purpose of our examples, we’ll avoid these edge cases. If you see a histogram where the ⅓ quantile and ⅔ quantile are equal value, they add up to > 100%. Histogram of 1 value is one example (everything is measured twice).
1,2 - 1,2,3.
Percentiles cannot be averaged. You have to calculate them from the raw usage data. There are several monitoring solutions out there that will let you average percentiles - this is flat out WRONG
What’s your SLA? If you set your 95% percentile at 250 ms, and you meet your SLA, you’re pissing off 5% of your users. They’re going to your competitor. Let’s try to calculate how many users you are screwing.
Take the number of requests outside your 95 percentile (the 5th percent inverse quantile), and integrate that over time to get a cumulative number of users that you’ve screwed. Multiply that times the dollar value of each lost request - that’s how much money you’re losing.
Here we are setting alerts if request latency 90th percentile over one minute is exceeded. This allows us to be woken up if our SLA is exceeded.
Circonus.com allows you to set percentile based alerts, so that you’ll be alerted if users start getting pissed off. Here is a percentile based alert - you can expand that to alert based on number of users pissed off per hour. Or even translate that to a dollar value using CAQL (circonus analytics query language). So you can say ‘alert me if we are losing more than $500 worth of users per hour’. This is something you’ll never be able to do with threshold based alerting. Thus, you can set a limit that is essentially normalized to traffic loads, say holiday sale surges.
Circonus.com allows you to set percentile based alerts, so that you’ll be alerted if users start getting pissed off. Here is a percentile based alert - you can expand that to alert based on number of users pissed off per hour. Or even translate that to a dollar value using CAQL (circonus analytics query language). So you can say ‘alert me if we are losing more than $500 worth of users per hour’. This is something you’ll never be able to do with threshold based alerting. Thus, you can set a limit that is essentially normalized to traffic loads, say holiday sale surges.
Circonus.com allows you to set percentile based alerts, so that you’ll be alerted if users start getting pissed off. Here is a percentile based alert - you can expand that to alert based on number of users pissed off per hour. Or even translate that to a dollar value using CAQL (circonus analytics query language). So you can say ‘alert me if we are losing more than $500 worth of users per hour’. This is something you’ll never be able to do with threshold based alerting. Thus, you can set a limit that is essentially normalized to traffic loads, say holiday sale surges.