Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Better service monitoring
through histograms
Fred Moyer - @phredmoyer
San Francisco Perl Mongers, 07-26-2016
Systems break while we sleep
How often are you woken up for false alarms?
Welcome
Synthetics
Easy to setup, but
not a real user
Synthetics
Stephen Falken: Uh, uh, General, what you see on these screens up
here is a fantasy; a computer-enhanced halluc...
Real Users
These are your
users, right?
Real data
Real Users
500 ms is really 2,000 ms
Spike Erosion
What threshold do you choose?
Threshold Alerting
“Alert me if requests take longer than 200 ms”
10,10,10,10,10,10,10,10,10,5000
Alerts on one outlier in 10
Threshold Alert...
“Alert if request average over one minute
is longer than 200 ms”
avg(10,10,210,210,210,210) = 143 (860/6)
Does not alert o...
‘average’ eq ‘arithmetic mean’
A=S/N
A = average
N = the number of terms
S = the sum of the numbers in the set
Math Refres...
median = midpoint of data set
The 50th percentile is 555 - q(0.5)
Value 111 222 333 444
555 666 777 888 999
Sample # 1 2 3...
90th percentile - 90% of samples below it
The 90th percentile is 1,000 - q(0.9)
Value 111 222 333 444 555 666 777 888 999
...
100th Percentile - the maximum value
The 100th percentile is 1,111 - q(1)
Value 111 222 333 444 555 666 777 888 999 1,000
...
Sample value
Number of
samples
Histogram
Sample value
Number of
samples
Normal Distribution
Sample value
Number of
samples
Normal Distribution
34% within
one sigma (σ)
Sample value
Number of
samples
Non-Normal Distribution
Sample value
Number of
samples
Non-Normal Distribution
Non-Normal Distribution
Operations data groups at different points
Non-Normal Distribution
Users to the right of the red line are gone
Request latency
“We keep hearing from people that the
website is slow. But it is fine when we test it,
and the request lat...
Heat Map
Histograms over time windows
Percentiles
Practical Percentiles
Bandwidth usage is often billed at 95th percentile usage
Record 5 minute data usage intervals
Sort s...
Practical Percentiles
If I measure 95th percentile per 5 minutes all
month long,
I CANNOT calculate 95th percentile over t...
Angry users
How many users are you pissing off?
Angry users
“Alert me if request latency 90th percentile
over one minute is exceeded”
Percentile based alerting
q(0.9)[10,10,10,10,10,...
“Alert me if request latency 90th percentile
over one minute is exceeded”
Percentile based alerting
q(0.9)[10,10,10,10,10,...
Percentile based alerting
Who’s using this approach?
Google.com
Circonus.com
You?
Questions?
Thanks to Circonus.com for the tools and help
with the math
http://www.circonus.com/free-account/
Próxima SlideShare
Cargando en…5
×

Better service monitoring through histograms

1.414 visualizaciones

Publicado el

Talk given to San Francisco Perl Mongers about service monitoring with histograms

Publicado en: Software
  • Sé el primero en comentar

Better service monitoring through histograms

  1. 1. Better service monitoring through histograms Fred Moyer - @phredmoyer San Francisco Perl Mongers, 07-26-2016
  2. 2. Systems break while we sleep How often are you woken up for false alarms? Welcome
  3. 3. Synthetics Easy to setup, but not a real user
  4. 4. Synthetics Stephen Falken: Uh, uh, General, what you see on these screens up here is a fantasy; a computer-enhanced hallucination. Those blips are not real missiles. They're phantoms. (War Games, 1983)
  5. 5. Real Users These are your users, right?
  6. 6. Real data Real Users
  7. 7. 500 ms is really 2,000 ms Spike Erosion
  8. 8. What threshold do you choose? Threshold Alerting
  9. 9. “Alert me if requests take longer than 200 ms” 10,10,10,10,10,10,10,10,10,5000 Alerts on one outlier in 10 Threshold Alerting
  10. 10. “Alert if request average over one minute is longer than 200 ms” avg(10,10,210,210,210,210) = 143 (860/6) Does not alert on multiple high samples Threshold Alerting
  11. 11. ‘average’ eq ‘arithmetic mean’ A=S/N A = average N = the number of terms S = the sum of the numbers in the set Math Refresher
  12. 12. median = midpoint of data set The 50th percentile is 555 - q(0.5) Value 111 222 333 444 555 666 777 888 999 Sample # 1 2 3 4 5 6 7 8 9 Math Refresher
  13. 13. 90th percentile - 90% of samples below it The 90th percentile is 1,000 - q(0.9) Value 111 222 333 444 555 666 777 888 999 1,000 1,111 Sample # 1 2 3 4 5 6 7 8 9 10 11 Math Refresher
  14. 14. 100th Percentile - the maximum value The 100th percentile is 1,111 - q(1) Value 111 222 333 444 555 666 777 888 999 1,000 1,111 Sample # 1 2 3 4 5 6 7 8 9 10 11 Math Refresher
  15. 15. Sample value Number of samples Histogram
  16. 16. Sample value Number of samples Normal Distribution
  17. 17. Sample value Number of samples Normal Distribution 34% within one sigma (σ)
  18. 18. Sample value Number of samples Non-Normal Distribution
  19. 19. Sample value Number of samples Non-Normal Distribution
  20. 20. Non-Normal Distribution Operations data groups at different points
  21. 21. Non-Normal Distribution Users to the right of the red line are gone
  22. 22. Request latency “We keep hearing from people that the website is slow. But it is fine when we test it, and the request latency graph is constant” You are only looking at part of the picture.
  23. 23. Heat Map Histograms over time windows
  24. 24. Percentiles
  25. 25. Practical Percentiles Bandwidth usage is often billed at 95th percentile usage Record 5 minute data usage intervals Sort samples by value of sample Throw out the highest 5% of samples Charge usage based on the remaining top sample, i.e. 300 MB transferred over 5 minutes = 1 MB/s rate billing
  26. 26. Practical Percentiles If I measure 95th percentile per 5 minutes all month long, I CANNOT calculate 95th percentile over the month.
  27. 27. Angry users How many users are you pissing off?
  28. 28. Angry users
  29. 29. “Alert me if request latency 90th percentile over one minute is exceeded” Percentile based alerting q(0.9)[10,10,10,10,10,10,10,10,5000] == 10 Alert IS NOT triggered Do you want to be woken up for this? NO!
  30. 30. “Alert me if request latency 90th percentile over one minute is exceeded” Percentile based alerting q(0.9)[10,10,10,10,10,10,250,300] = ~270 Alert IS triggered Do you want to be woken up for this? YES!
  31. 31. Percentile based alerting
  32. 32. Who’s using this approach? Google.com Circonus.com You?
  33. 33. Questions? Thanks to Circonus.com for the tools and help with the math http://www.circonus.com/free-account/

×