SlideShare una empresa de Scribd logo
1 de 33
Better service monitoring
through histograms
Fred Moyer - @phredmoyer
Silicon Valley Perl, 09-01-2016
Who likes to wake up for false positives?
Synthetics
Easy to setup, but
not a real user
Stephen Falken: Uh, uh, General, what you see on these screens
up here is a fantasy; a computer-enhanced hallucination. Those
blips are not real missiles. They're phantoms. (War Games, 1983)
Real Users
Real Users
500 ms is really 2,000 ms
Spike Erosion
Threshold Based Alerting
“Alert if a request takes longer than 200 ms”
10,10,10,10,10,10,10,10,10,5000
Alerts on one outlier in 10
Threshold Alerting
“Alert if request average over one minute
is longer than 200 ms”
avg(10,10,210,210,210,210) = 143 (860/6)
Does not alert on multiple high samples
Threshold Alerting
‘average’ eq ‘arithmetic mean’
A=S/N
A = average
N = the number of samples
S = the sum of the samples in the set
Math Refresher
median = midpoint of data set
The 50th percentile is 555 - q(0.5)
Value 111 222 333 444
555 666 777 888 999
Sample # 1 2 3 4 5 6 7 8 9
Math Refresher
90th percentile - 90% of samples below it
The 90th percentile is 1,000 - q(0.9)
Value 111 222 333 444 555 666 777 888 999
1,000 1,111
Sample # 1 2 3 4 5 6 7 8 9 10 11
Math Refresher
100th Percentile - the maximum value
The 100th percentile is 1,111 - q(1)
Value 111 222 333 444 555 666 777 888 999 1,000
1,111
Sample # 1 2 3 4 5 6 7 8 9 10 11
Math Refresher
Sample value
Number of
samples
Histogram
Sample value
Number of
samples
Normal Distribution
Sample value
Number of
samples
Normal Distribution
68% within
one sigma (σ)
Sample value
Number of
samples
Non-Normal Distribution
Sample value
Number of
samples
Non-Normal Distribution
Non-Normal Distribution
Operations data groups at different points
Non-Normal Distribution
Users to the right of the red line are gone
Request latency
“We keep hearing from people that the
website is slow. But it is fine when we test it,
and the request latency graph is constant”
You are only looking at part of the picture.
Heat Map
Histograms over time windows
Percentiles
Practical Percentiles
Bandwidth usage is often billed at 95th percentile usage
Record 5 minute data usage intervals
Sort samples by value of sample
Throw out the highest 5% of samples
Charge usage based on the remaining top sample, i.e. 300
MB transferred over 5 minutes = 1 MB/s rate billing
Practical Percentiles
If I measure 95th percentile per 5 minutes all
month long,
I CANNOT calculate 95th percentile over the
month.
Angry users
How many users are you pissing off?
Angry users
“Alert me if request latency 90th percentile
over one minute is exceeded”
Percentile based alerting
q(0.9)[10,10,10,10,10,10,10,10,5000] == 10
Alert IS NOT triggered
Do you want to be woken up for this? NO!
“Alert me if request latency 90th percentile
over one minute is exceeded”
Percentile based alerting
q(0.9)[10,10,10,10,10,10,250,300] = ~270
Alert IS triggered
Do you want to be woken up for this? YES!
Percentile based alerting
Who’s using this approach?
Google.com - in house monitoring systems
Circonus.com - hosted histogram monitoring
You? (I’ve written my own histograms but use
Circonus for production systems)
Questions?
Thanks to Circonus for tools and help with math
http://www.circonus.com/free-account/
Look for future monitoring talks here soon
http://meetup.com/monitorSF

Más contenido relacionado

Similar a Better service monitoring through histograms sv perl 09012016

Artificial intelligence - A Teaser to the Topic.
Artificial intelligence - A Teaser to the Topic.Artificial intelligence - A Teaser to the Topic.
Artificial intelligence - A Teaser to the Topic.Dr. Kim (Kyllesbech Larsen)
 
Application Metrics (with Prometheus examples) #PHPDD18
Application Metrics (with Prometheus examples) #PHPDD18Application Metrics (with Prometheus examples) #PHPDD18
Application Metrics (with Prometheus examples) #PHPDD18Rafael Dohms
 
A sentient network - How High-velocity Data and Machine Learning will Shape t...
A sentient network - How High-velocity Data and Machine Learning will Shape t...A sentient network - How High-velocity Data and Machine Learning will Shape t...
A sentient network - How High-velocity Data and Machine Learning will Shape t...Wenjing Chu
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Brian Brazil
 
Polymer Brush Data Processor
Polymer Brush Data ProcessorPolymer Brush Data Processor
Polymer Brush Data ProcessorCory Bethrant
 
Convolutional Neural Network for Text Classification
Convolutional Neural Network for Text ClassificationConvolutional Neural Network for Text Classification
Convolutional Neural Network for Text ClassificationAnaïs Addad
 
Application Metrics - IPC2023
Application Metrics - IPC2023Application Metrics - IPC2023
Application Metrics - IPC2023Rafael Dohms
 
HBaseCon 2015: Running ML Infrastructure on HBase
HBaseCon 2015: Running ML Infrastructure on HBaseHBaseCon 2015: Running ML Infrastructure on HBase
HBaseCon 2015: Running ML Infrastructure on HBaseHBaseCon
 
Hypothesis Testing: Statistical Laws and Confidence Intervals
Hypothesis Testing: Statistical Laws and Confidence IntervalsHypothesis Testing: Statistical Laws and Confidence Intervals
Hypothesis Testing: Statistical Laws and Confidence IntervalsMatt Hansen
 
BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for...
BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for...BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for...
BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for...BlueHat Security Conference
 
Characteristics of PVS-Studio Analyzer by the Example of EFL Core Libraries, ...
Characteristics of PVS-Studio Analyzer by the Example of EFL Core Libraries, ...Characteristics of PVS-Studio Analyzer by the Example of EFL Core Libraries, ...
Characteristics of PVS-Studio Analyzer by the Example of EFL Core Libraries, ...PVS-Studio
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)Abhimanyu Dwivedi
 
Catching the most with high-throughput screening
Catching the most with high-throughput screeningCatching the most with high-throughput screening
Catching the most with high-throughput screeningErin Shellman
 
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Pete Burnap
 
Application Metrics (with Prometheus examples)
Application Metrics (with Prometheus examples)Application Metrics (with Prometheus examples)
Application Metrics (with Prometheus examples)Rafael Dohms
 
Application metrics - Confoo 2019
Application metrics - Confoo 2019Application metrics - Confoo 2019
Application metrics - Confoo 2019Rafael Dohms
 
Application metrics with Prometheus - DPC18
Application metrics with Prometheus - DPC18Application metrics with Prometheus - DPC18
Application metrics with Prometheus - DPC18Rafael Dohms
 
Finding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactArun Kejariwal
 
Machine Learning Intro Session
Machine Learning Intro SessionMachine Learning Intro Session
Machine Learning Intro SessionNaveen Rajan
 

Similar a Better service monitoring through histograms sv perl 09012016 (20)

Artificial intelligence - A Teaser to the Topic.
Artificial intelligence - A Teaser to the Topic.Artificial intelligence - A Teaser to the Topic.
Artificial intelligence - A Teaser to the Topic.
 
Application Metrics (with Prometheus examples) #PHPDD18
Application Metrics (with Prometheus examples) #PHPDD18Application Metrics (with Prometheus examples) #PHPDD18
Application Metrics (with Prometheus examples) #PHPDD18
 
A sentient network - How High-velocity Data and Machine Learning will Shape t...
A sentient network - How High-velocity Data and Machine Learning will Shape t...A sentient network - How High-velocity Data and Machine Learning will Shape t...
A sentient network - How High-velocity Data and Machine Learning will Shape t...
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
Polymer Brush Data Processor
Polymer Brush Data ProcessorPolymer Brush Data Processor
Polymer Brush Data Processor
 
Convolutional Neural Network for Text Classification
Convolutional Neural Network for Text ClassificationConvolutional Neural Network for Text Classification
Convolutional Neural Network for Text Classification
 
Application Metrics - IPC2023
Application Metrics - IPC2023Application Metrics - IPC2023
Application Metrics - IPC2023
 
HBaseCon 2015: Running ML Infrastructure on HBase
HBaseCon 2015: Running ML Infrastructure on HBaseHBaseCon 2015: Running ML Infrastructure on HBase
HBaseCon 2015: Running ML Infrastructure on HBase
 
Hypothesis Testing: Statistical Laws and Confidence Intervals
Hypothesis Testing: Statistical Laws and Confidence IntervalsHypothesis Testing: Statistical Laws and Confidence Intervals
Hypothesis Testing: Statistical Laws and Confidence Intervals
 
BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for...
BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for...BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for...
BlueHat v18 || Crafting synthetic attack examples from past cyber-attacks for...
 
Characteristics of PVS-Studio Analyzer by the Example of EFL Core Libraries, ...
Characteristics of PVS-Studio Analyzer by the Example of EFL Core Libraries, ...Characteristics of PVS-Studio Analyzer by the Example of EFL Core Libraries, ...
Characteristics of PVS-Studio Analyzer by the Example of EFL Core Libraries, ...
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
 
Catching the most with high-throughput screening
Catching the most with high-throughput screeningCatching the most with high-throughput screening
Catching the most with high-throughput screening
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
 
Application Metrics (with Prometheus examples)
Application Metrics (with Prometheus examples)Application Metrics (with Prometheus examples)
Application Metrics (with Prometheus examples)
 
Application metrics - Confoo 2019
Application metrics - Confoo 2019Application metrics - Confoo 2019
Application metrics - Confoo 2019
 
Application metrics with Prometheus - DPC18
Application metrics with Prometheus - DPC18Application metrics with Prometheus - DPC18
Application metrics with Prometheus - DPC18
 
Finding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impact
 
Machine Learning Intro Session
Machine Learning Intro SessionMachine Learning Intro Session
Machine Learning Intro Session
 

Más de Fred Moyer

Reliable observability at scale: Error Budgets for 1,000+
Reliable observability at scale: Error Budgets for 1,000+Reliable observability at scale: Error Budgets for 1,000+
Reliable observability at scale: Error Budgets for 1,000+Fred Moyer
 
Practical service level objectives with error budgeting
Practical service level objectives with error budgetingPractical service level objectives with error budgeting
Practical service level objectives with error budgetingFred Moyer
 
SREcon americas 2019 - Latency SLOs Done Right
SREcon americas 2019 - Latency SLOs Done RightSREcon americas 2019 - Latency SLOs Done Right
SREcon americas 2019 - Latency SLOs Done RightFred Moyer
 
Scale17x - Latency SLOs Done Right
Scale17x - Latency SLOs Done RightScale17x - Latency SLOs Done Right
Scale17x - Latency SLOs Done RightFred Moyer
 
Latency SLOs Done Right
Latency SLOs Done RightLatency SLOs Done Right
Latency SLOs Done RightFred Moyer
 
Latency SLOs done right
Latency SLOs done rightLatency SLOs done right
Latency SLOs done rightFred Moyer
 
Comprehensive Container Based Service Monitoring with Kubernetes and Istio
Comprehensive Container Based Service Monitoring with Kubernetes and IstioComprehensive Container Based Service Monitoring with Kubernetes and Istio
Comprehensive Container Based Service Monitoring with Kubernetes and IstioFred Moyer
 
Comprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istioComprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istioFred Moyer
 
Effective management of high volume numeric data with histograms
Effective management of high volume numeric data with histogramsEffective management of high volume numeric data with histograms
Effective management of high volume numeric data with histogramsFred Moyer
 
Statistics for dummies
Statistics for dummiesStatistics for dummies
Statistics for dummiesFred Moyer
 
GrafanaCon EU 2018
GrafanaCon EU 2018GrafanaCon EU 2018
GrafanaCon EU 2018Fred Moyer
 
Fredmoyer postgresopen 2017
Fredmoyer postgresopen 2017Fredmoyer postgresopen 2017
Fredmoyer postgresopen 2017Fred Moyer
 
The Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL DatabaseThe Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL DatabaseFred Moyer
 
Learning go for perl programmers
Learning go for perl programmersLearning go for perl programmers
Learning go for perl programmersFred Moyer
 
Surge 2012 fred_moyer_lightning
Surge 2012 fred_moyer_lightningSurge 2012 fred_moyer_lightning
Surge 2012 fred_moyer_lightningFred Moyer
 
Apache Dispatch
Apache DispatchApache Dispatch
Apache DispatchFred Moyer
 
Ball Of Mud Yapc 2008
Ball Of Mud Yapc 2008Ball Of Mud Yapc 2008
Ball Of Mud Yapc 2008Fred Moyer
 
Data::FormValidator Simplified
Data::FormValidator SimplifiedData::FormValidator Simplified
Data::FormValidator SimplifiedFred Moyer
 

Más de Fred Moyer (19)

Reliable observability at scale: Error Budgets for 1,000+
Reliable observability at scale: Error Budgets for 1,000+Reliable observability at scale: Error Budgets for 1,000+
Reliable observability at scale: Error Budgets for 1,000+
 
Practical service level objectives with error budgeting
Practical service level objectives with error budgetingPractical service level objectives with error budgeting
Practical service level objectives with error budgeting
 
SREcon americas 2019 - Latency SLOs Done Right
SREcon americas 2019 - Latency SLOs Done RightSREcon americas 2019 - Latency SLOs Done Right
SREcon americas 2019 - Latency SLOs Done Right
 
Scale17x - Latency SLOs Done Right
Scale17x - Latency SLOs Done RightScale17x - Latency SLOs Done Right
Scale17x - Latency SLOs Done Right
 
Latency SLOs Done Right
Latency SLOs Done RightLatency SLOs Done Right
Latency SLOs Done Right
 
Latency SLOs done right
Latency SLOs done rightLatency SLOs done right
Latency SLOs done right
 
Comprehensive Container Based Service Monitoring with Kubernetes and Istio
Comprehensive Container Based Service Monitoring with Kubernetes and IstioComprehensive Container Based Service Monitoring with Kubernetes and Istio
Comprehensive Container Based Service Monitoring with Kubernetes and Istio
 
Comprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istioComprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istio
 
Effective management of high volume numeric data with histograms
Effective management of high volume numeric data with histogramsEffective management of high volume numeric data with histograms
Effective management of high volume numeric data with histograms
 
Statistics for dummies
Statistics for dummiesStatistics for dummies
Statistics for dummies
 
GrafanaCon EU 2018
GrafanaCon EU 2018GrafanaCon EU 2018
GrafanaCon EU 2018
 
Fredmoyer postgresopen 2017
Fredmoyer postgresopen 2017Fredmoyer postgresopen 2017
Fredmoyer postgresopen 2017
 
The Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL DatabaseThe Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL Database
 
Learning go for perl programmers
Learning go for perl programmersLearning go for perl programmers
Learning go for perl programmers
 
Surge 2012 fred_moyer_lightning
Surge 2012 fred_moyer_lightningSurge 2012 fred_moyer_lightning
Surge 2012 fred_moyer_lightning
 
Qpsmtpd
QpsmtpdQpsmtpd
Qpsmtpd
 
Apache Dispatch
Apache DispatchApache Dispatch
Apache Dispatch
 
Ball Of Mud Yapc 2008
Ball Of Mud Yapc 2008Ball Of Mud Yapc 2008
Ball Of Mud Yapc 2008
 
Data::FormValidator Simplified
Data::FormValidator SimplifiedData::FormValidator Simplified
Data::FormValidator Simplified
 

Último

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 

Último (20)

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 

Better service monitoring through histograms sv perl 09012016

Notas del editor

  1. Who here has been on an on call rotation? Who here has been woken up for monitoring false positives? What is the rate of real alerts to false positives in your monitoring system? How many real alerts does your monitoring system fail to identify? These are common questions that we’ve all had to answer, whether we use Ganglia, Nagios, Zabbix, Graphite, DataDog, Website Pulse, Pingdom, Circonus, or something else. Waking up in the middle of the night is fine as long as something has gone wrong with our production system. Waking up in the middle of the night when something has gone wrong with our monitoring system is not ok.
  2. What’s a synthetic? A synthetic is basically a bot check against your system. One of the benefits (perhaps the only benefit) of the synthetic is that it’s generally more highly available than the application you are monitoring. Some synthetics are as simple as doing a ping against your website and alerting you if it doesn’t get a response. More complicated synthetics are capable of logging into your website, following links, and determining if a result matches a value that you have preset. If a part of that doesn’t succeed, you get an alert. You can configure synthetics to access your site from multiple geographically distributed locations.
  3. This is a time series graph of response times from synthetic login checks against a website. The results are remarkably consistent for the most part, as they should be. It gives you the viewpoint of one user - a computer somewhere dispatches a request over the same network route to your server. It records several metrics about how your application responds; time to start the ssl connection, time to the first byte served, etc. The response from synthetic requests don’t tell you anything meaningful about how actual users experience your application. It gives you the experience of one bot located at what is likely a low latency, high bandwidth connection, making the same request over and over. Those metrics are not only useless (unless anyone here runs a service just for one user… in that case, kudos), they lie to you. These are LIES. The falsely represent the health of your application. It’s a binary - is the service up, or is the service down? That’s all you get.
  4. Real users are different than synthetics. Your user base will likely have a distribution of ages, genders, devices, network connections. Mike Brady has a 16 core Mac Pro at his office with a gigabit business line that has a 5 ms latency to the nearest POP. Cindi has an iPad mini with several hundred apps over a 2 megabit 3g connection since last month she used 20 gigs of bandwidth and had to be metered. Carol uses an iMac in the family room; the 2.4 Ghz airport is in the living room, and every time the microwave in the kitchen comes on, she gets 25% packet loss. Greg has a Nexus 6p on the Google Fi network, so his connection speeds vary depending on if he’s routing through Sprint or T-Mobile. Your real users see a different user experience than a bot hosted at a data center.
  5. This is a view of average request time for real users. The synthetic check used an external user agent, but you can use collection tools like statsd or log analysis to record request times for real users. This is better than only using a synthetic check, but this technique still has a number of shortcomings. The first shortcoming is that collection data is averaged over an interval (generally 10 seconds to a minute). Different monitoring providers offer different resolutions, but the end result is similar when you are examining averages. So if Cyndi, Bobby, and Mike are all shopping at your website at the same time, you only see the average of their request times over a given interval. Mike might be having a great experience because his office network is 1 Gig, but Bobby is on 10 meg, and Cindi on 3g, you’ll only see Bobby’s view of the website user experience if you look at an average view.
  6. The second short shortcoming of a time series average value graph is spike erosion, also known as downsampling. Spike erosion is what happens when you zoom in on specific areas of a time series graph. As you zoom in, the data is averaged over intervals closer to the actual collection intervals. As you can see on this graph, when we zoom into a 2 hour view of the graph we just looked at, the maximum value we see now is 2,000 milliseconds instead of 500 milliseconds. That’s a 400% increase!
  7. If you alert based on values you get from the graphs I’ve shown, you must choose a metric threshold to configure when you are alerted. An example of this would be to set 200 milliseconds as a request time threshold that would alert you if average request time exceeded that threshold. What threshold do you choose? Spike erosion makes this very difficult. If you look at a wide time range to try to find a reasonable threshold, the spike erosion will hide potentially valuable data. If you zoom in on a narrow time range, you could miss high values that occur at certain regular intervals. As you’ve seen, avoiding false positives is impossible with threshold based alerting. Anyone who has tried this can testify to playing whackamole with trying to set threshold alerting values.
  8. If we alert on a single value, we’ll alerts on outliers, but those aren’t that useful. Every system has outliers, these are part of normal operations. Here we can see that one request timed out at 5,000 milliseconds. That doesn’t mean there’s something wrong with the service. You can’t get rid of outliers, and in most cases you don’t really want to be alerted by them. If you have multiple values here that are 5,000 milliseconds, now that’s not an outlier, but a pattern indicating there’s probably something wrong with your service.
  9. Let’s try alerting on averages over a time period, say one minute. 200 ms is too slow for our users, so we alert if one minute request average exceeds that. In this example though, 66% of the sample population is over 200 milliseconds - something is clearly wrong. But the other two samples are normal enough that the average comes out to be under our alerting threshold. Whoops - using this threshold alert here causes us to not be alerted when we want to be.
  10. Let’s go through a quick math refresher. First let’s look at averages, which are also known as arithmetic means. To calculate the average, we take the sum of samples and divide by the number of samples. We do this for a set time period, which in monitoring systems often varies between 10 seconds and 5 minutes.
  11. The median is the midpoint of a data set. Here we see that the median of this data set is 555. There are four samples greater than 555, and four samples less than it. This is also known as the 50th percentile, and can be represented by q(0.5). q(0.5) is showing the 50th percentile in quantile notation. In this example, the 0th quantile is the first element, 111.
  12. The 90th percentile is the sample where 90% of the samples are less than it. In this example, 90% of the samples are below 1,000. I used 11 samples here to explicity demonstrate this point. If we had 9 samples here instead, the 90% percentile would be a weighted average between the 8th and 9th sample. I’ll skip the details of those exact calculations and leave it to more qualified statistics explanations.
  13. The maximum value of the sample set is the 100th percentile, or q(1). There are also inverse percen
  14. Let’s talk about histograms. A histogram is one of the seven basic tools of quality. The Y axis indicates the number of samples, where the X axis indicates the sample value. One use of a histogram that you may have seen is plotting human height vs number of people who are that tall.
  15. Human height follows what is called a normal distribution (also known as a Gaussian distribution). The majority of the population tends to group around one value, and tapers off at the high and low sample values. With a perfect normal distribution, the arithmetic mean (the average) and the median are one in the same.
  16. The mode is also equal to the median. You’ve heard the term standard deviation before most likely. With a normal distribution, 68% of the values lie within one standard deviation for both sides of the median. 95% within 2 standard deviations, 99.7% within 3 sigma. The smaller a standard deviation, the closer the data is to the mean. The larger one sigma is, the farther the data is away from the mean. It is important to note that these metrics only make sense for normal distribution, where there is a single mode. You’ve probably heard about six sigma in manufacturing processes. That’s six standard deviations - 99.99966% of samples will be within that range.
  17. This is a non normal distribution. In this example, there are large numbers of samples grouped at the highest and lowest sample values. Because there are two distinct peaks, this is called a bimodal distribution (or multi-modal distribution). In a multimodal distribution like this, standard deviation and multi-sigma values are useless. They don’t mean anything. Remember the percentiles and quantiles discussed earlier? Those are how you describe distributions of values in a non normal distribution.
  18. This is another non-normal distribution. As you can see, it only has one mode, and is a skewed distribution. Standard deviation has little to no meaning here, nor do multiple sigmas. You’ll see histograms like this from manmade as well as natural phenomena. For example, distribution of elements in the universe. Lots of hydrogen - a lot less heavy metals.
  19. Here is a histogram of web page request time. The higher the bar, the more users are affected. This is a highly skewed distribution - notice the grouping between the spike at ~150 milliseconds, and the long tail past there. There’s another smaller spike at ~25 ms, so this is mostly a bimodal distribution.
  20. In terms of website performance, people will generally get angry if request times take longer than 250 milliseconds. So what we see here is a bunch of users who are getting acceptable response times, and a long tail of pissed off users. In terms of website performance, people will generally get angry if request times take longer than 250 milliseconds. So what we see here is a bunch of users who are getting acceptable response times, and a long tail of pissed off users. People on left side are having a great experience, people on right side are leaving the site. Note that this is for a time slice, say 5 minutes. What does this look like if we integrate over time?
  21. How many people here have been in this situation? You get support tickets for your application coming in saying the it’s slow. You try it - it works fine. You have QA run their automations against it. Again - performance is acceptable, even great. What’s going on? Works for me, right? The problem here is that like the previous slide, you’re the user on the left side of the red line. The users who are complaining are on the right side of the red line. They’re hitting codepaths that your synthetics don’t, and that most of your real users don’t. Often these can be large important customers who write big checks. They have thousands of users under one account, and when they use your application, it’s slooooow.
  22. Heat maps are visual representations of histograms over time windows. It gives you a visualization of data distributions over time.
  23. With heat maps, you can add percentile overlays to show the 50th, 95, and any other percentile distribution over time slices
  24. A percentile is a barrier where to the left the samples are 95%, to the right are the remaining 5%. There is a caveat with the barrier hitting in the middle of data points. If you measure on the right including the barrier, >= 95th percentile of whole data set, if you measure to the left of the barrier, <= 95%. If you have two samples, median is every value between those two samples. Samples on the barrier are counted twice. Divide data set into two sets. Have a slide that says - bespoke things you probably didn’t know about histograms. For the purpose of our examples, we’ll avoid these edge cases. If you see a histogram where the ⅓ quantile and ⅔ quantile are equal value, they add up to > 100%. Histogram of 1 value is one example (everything is measured twice). 1,2 - 1,2,3.
  25. Percentiles cannot be averaged. You have to calculate them from the raw usage data. There are several monitoring solutions out there that will let you average percentiles - this is flat out WRONG
  26. What’s your SLA? If you set your 95% percentile at 250 ms, and you meet your SLA, you’re pissing off 5% of your users. They’re going to your competitor. Let’s try to calculate how many users you are screwing.
  27. Take the number of requests outside your 95 percentile (the 5th percent inverse quantile), and integrate that over time to get a cumulative number of users that you’ve screwed. Multiply that times the dollar value of each lost request - that’s how much money you’re losing.
  28. Here we are setting alerts if request latency 90th percentile over one minute is exceeded. This allows us to be woken up if our SLA is exceeded.
  29. Circonus.com allows you to set percentile based alerts, so that you’ll be alerted if users start getting pissed off. Here is a percentile based alert - you can expand that to alert based on number of users pissed off per hour. Or even translate that to a dollar value using CAQL (circonus analytics query language). So you can say ‘alert me if we are losing more than $500 worth of users per hour’. This is something you’ll never be able to do with threshold based alerting. Thus, you can set a limit that is essentially normalized to traffic loads, say holiday sale surges.
  30. Circonus.com allows you to set percentile based alerts, so that you’ll be alerted if users start getting pissed off. Here is a percentile based alert - you can expand that to alert based on number of users pissed off per hour. Or even translate that to a dollar value using CAQL (circonus analytics query language). So you can say ‘alert me if we are losing more than $500 worth of users per hour’. This is something you’ll never be able to do with threshold based alerting. Thus, you can set a limit that is essentially normalized to traffic loads, say holiday sale surges.
  31. Circonus.com allows you to set percentile based alerts, so that you’ll be alerted if users start getting pissed off. Here is a percentile based alert - you can expand that to alert based on number of users pissed off per hour. Or even translate that to a dollar value using CAQL (circonus analytics query language). So you can say ‘alert me if we are losing more than $500 worth of users per hour’. This is something you’ll never be able to do with threshold based alerting. Thus, you can set a limit that is essentially normalized to traffic loads, say holiday sale surges.