SlideShare una empresa de Scribd logo
1 de 85
© 2014 MapR Technologies 1 
Anomaly Detection 
How to Find What You Didn’t 
Know to Look For 
© MapR Technologies, confidential 
October 14, 2014
© 2014 MapR Technologies 2 
Anomaly Detection: 
How To Find What You Didn’t Know to Look For 
Ted Dunning, Chief Applications Architect MapR Technologies 
Email tdunning@mapr.com tdunning@apache.org 
Twitter @Ted_Dunning 
Ellen Friedman, Consultant and Commentator 
Email ellenf@apache.org 
Twitter @Ellen_Friedman
A New Look at Anomaly Detection 
by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly) 
e-book available courtesy of MapR 
http://bit.ly/1jQ9QuL 
© 2014 MapR Technologies 3
Practical Machine Learning series (O’Reilly) 
• Machine learning is becoming mainstream 
• Need pragmatic approaches that take into account real world 
business settings: 
– Time to value 
– Limited resources 
– Availability of data 
– Expertise and cost of team to develop and to maintain system 
• Look for approaches with big benefits for the effort expended 
© 2014 MapR Technologies 4
© 2014 MapR Technologies 5 
Anomaly Detection
© 2014 MapR Technologies 6 
Who Needs Anomaly Detection? 
Utility providers using 
smart meters
© 2014 MapR Technologies 7 
Who Needs Anomaly Detection? 
Feedback from 
manufacturing assembly 
lines
© 2014 MapR Technologies 8 
Who Needs Anomaly Detection? 
Monitoring data traffic on 
communication networks
© 2014 MapR Technologies 9 
What is Anomaly Detection? 
• The goal is to discover rare events 
– especially those that shouldn’t have happened 
• Find a problem before other people see it 
– especially before it causes a problem for customers 
• Why is this a challenge? 
– I don’t know what an anomaly looks like (yet)
© 2014 MapR Technologies 10 
Spot the Anomaly
© 2014 MapR Technologies 11 
Spot the Anomaly 
Looks pretty 
anomalous 
to me
© 2014 MapR Technologies 12 
Spot the Anomaly 
Will the real anomaly 
please stand up?
© 2014 MapR Technologies 13 
Basic idea: 
Find “normal” first
© 2014 MapR Technologies 14 
Steps in Anomaly Detection 
• Build a model: Collect and process data for training a model 
• Use the machine learning model to determine what is the normal 
pattern 
• Decide how far away from this normal pattern you’ll consider to 
be anomalous 
• Use the AD model to detect anomalies in new data 
– Methods such as clustering for discovery can be helpful
How hard is it to set an alert for anomalies? 
Grey data is from normal events; x’s are anomalies. 
Where would you set the threshold? 
© 2014 MapR Technologies 15
© 2014 MapR Technologies 16 
Basic idea: 
Set adaptive thresholds
© 2014 MapR Technologies 17 
What Are We Really Doing 
• We want action when something breaks 
(dies/falls over/otherwise gets in trouble) 
• But action is expensive 
• So we don’t want too many false alarms 
• And we don’t want too many false negatives 
• What’s the right threshold to set for alerts? 
– We need to trade off costs
© 2014 MapR Technologies 18 
A Second Look
© 2014 MapR Technologies 19 
A Second Look 
99.9%-ile
New algorithm: t-digest 
© 2014 MapR Technologies 20
© 2014 MapR Technologies 21 
How Hard Can it Be? 
Online 
Summarizer 
x > t ? Alarm ! 
99.9%-ile 
t 
x
© 2014 MapR Technologies 22 
Detecting Anomalies in Sporadic Events 
0.0 0.2 0.4 0.6 0.8 1.0 
0 5000 10000 15000 20000 
pnorm(centroids[order(centroids)]) 
counts[order(centroids)]
© 2014 MapR Technologies 23 
Using t-Digest 
• Apache Mahout uses t-digest as an on-line percentile estimator 
– very high accuracy for extreme tails 
– new in version Mahout v 0.9 
• t-digest also available elsewhere 
– in streamlib (open source library on github) 
– standalone (github and Maven Central) 
• What’s the big deal with anomaly detection? 
• This looks like a solved problem
© 2014 MapR Technologies 24 
Already Done? Etsy Skyline?
© 2014 MapR Technologies 25 
What About This? 
0 5 10 15 
offset + noise + pulse1 + pulse2 
−2 0 2 4 6 8 10 
A 
B
© 2014 MapR Technologies 26 
Model Delta Anomaly Detection 
+ δ 
Online 
Summarizer 
δ > t ? 
t 
99.9%-ile 
Alarm ! 
Model 
-
The Real Inside Scoop 
• The model-delta anomaly detector is really just a sum of random 
© 2014 MapR Technologies 27 
variables 
– the model we know about already 
– and a normally distributed error 
• The output (delta) is (roughly) the log probability of the sum 
distribution (really δ2) 
• Thinking about probability distributions is good 
• But how do you handle AD in systems with sporadic events?
© 2014 MapR Technologies 28 
Spot the Anomaly 
Anomaly?
© 2014 MapR Technologies 29 
Maybe not!
© 2014 MapR Technologies 30 
Where’s Waldo? 
This is the real 
anomaly
© 2014 MapR Technologies 31 
Normal Isn’t Just Normal 
• What we want is a model of what is normal 
• What doesn’t fit the model is the anomaly 
• For simple signals, the model can be simple … 
x ~ N(0,e ) 
• The real world is rarely so accommodating
© 2014 MapR Technologies 32 
We Do Windows
© 2014 MapR Technologies 33 
We Do Windows
© 2014 MapR Technologies 34 
We Do Windows
© 2014 MapR Technologies 35 
We Do Windows
© 2014 MapR Technologies 36 
We Do Windows
© 2014 MapR Technologies 37 
We Do Windows
© 2014 MapR Technologies 38 
We Do Windows
© 2014 MapR Technologies 39 
We Do Windows
© 2014 MapR Technologies 40 
We Do Windows
© 2014 MapR Technologies 41 
We Do Windows
© 2014 MapR Technologies 42 
We Do Windows
© 2014 MapR Technologies 43 
We Do Windows
© 2014 MapR Technologies 44 
We Do Windows
© 2014 MapR Technologies 45 
We Do Windows
© 2014 MapR Technologies 46 
We Do Windows
© 2014 MapR Technologies 47 
Windows on the World 
• The set of windowed signals is a nice model of our original signal 
• Clustering can find the prototypes 
– Fancier techniques available using sparse coding 
• The result is a dictionary of shapes 
• New signals can be encoded by shifting, scaling and adding 
shapes from the dictionary
© 2014 MapR Technologies 48 
Most Common Shapes (for EKG)
< 1 bit / sample 
© 2014 MapR Technologies 49 
Reconstructed signal 
Original 
signal 
Reconstructed 
signal 
Reconstruction 
error
© 2014 MapR Technologies 50 
An Anomaly 
Original technique for finding 
1-d anomaly works against 
reconstruction error
© 2014 MapR Technologies 51 
Close-up of anomaly 
Not what you want your 
heart to do. 
And not what the model 
expects it to do.
© 2014 MapR Technologies 52 
A Different Kind of Anomaly
© 2014 MapR Technologies 53 
Model Delta Anomaly Detection 
+ δ 
Online 
Summarizer 
δ > t ? 
t 
99.9%-ile 
Alarm ! 
Model 
-
© 2014 MapR Technologies 54 
The Real Inside Scoop 
• The model-delta anomaly detector is really just a sum of random 
variables 
– the model we know about already 
– and a normally distributed error 
• The output (delta) is (roughly) the log probability of the sum 
distribution (really δ2) 
• Thinking about probability distributions is good
Anomalies among sporadic events 
© 2014 MapR Technologies 55
Sporadic Web Traffic to an e-Business Site 
© 2014 MapR Technologies 56 
It’s important to know if traffic is stopped or 
delayed because of a problem… 
But visits to site normally come at 
varying intervals. 
How long after the last event 
should you begin to worry?
Sporadic Web Traffic to an e-Business Site 
© 2014 MapR Technologies 57 
It’s important to know if traffic is stopped or 
delayed because of a problem… 
But visits to site normally come at 
varying intervals. 
And how do you let your CEO 
sleep through the night?
© 2014 MapR Technologies 58 
Basic idea: 
Time interval between events is how 
to convert to something useful you 
can measure
Sporadic Events: Finding Normal and Anomalous Patterns 
• Time between intervals is much more usable than absolute times 
© 2014 MapR Technologies 59 
• Counts don’t link as directly to probability models 
• Time interval is log ρ 
• This is a big deal
© 2014 MapR Technologies 60 
Event Stream (timing) 
• Events of various types arrive at irregular intervals 
– we can assume Poisson distribution 
• The key question is whether frequency has changed relative to 
expected values 
– This shows up as a change in interval 
• Want alert as soon as possible
© 2014 MapR Technologies 61 
Converting Event Times to Anomaly 
99.9%-ile 
99.99%-ile
But in the real world, event 
© 2014 MapR Technologies 62 
rates often change
Time Intervals Are Key to Modeling Sporadic Events 
© 2014 MapR Technologies 63
© 2014 MapR Technologies 64 
Model-Scaled Intervals Solve the Problem
© 2014 MapR Technologies 65 
Model Delta Anomaly Detection 
+ δ 
Online 
Summarizer 
δ > t ? 
t 
99.9%-ile 
Alarm ! 
Model 
- 
log p
© 2014 MapR Technologies 66 
Detecting Anomalies in Sporadic Events 
Incoming 
events 
99.97%-ile 
Alarm 
Δn 
Rate 
predictor 
Rate 
history 
t-digest 
δ> t 
t i δ λ(t i - t i - n) 
λ 
t
© 2014 MapR Technologies 67 
Detecting Anomalies in Sporadic Events 
Incoming 
events 
99.97%-ile 
Alarm 
Δn 
Rate 
predictor 
Rate 
history 
t-digest 
δ> t 
t i δ λ(t i - t i - n) 
λ 
t
© 2014 MapR Technologies 68 
Slipped Week: Simple Rate Predictor 
Nov 02 Nov 07 Nov 12 Nov 17 Nov 22 Nov 27 Dec 02 
0 100 200 300 400 500 
Main Page Traffic 
Date 
Hits (x 1000) 
A B C D
© 2014 MapR Technologies 69 
Poisson Distribution 
• Time between events is exponentially distributed 
Dt ~ le-lt 
• This means that long delays are exponentially rare 
P(Dt > T) = e-lT 
-logP(Dt > T) = lT 
• If we know λ we can select a good threshold 
– or we can pick a threshold empirically
© 2014 MapR Technologies 70 
Seasonality Poses a Challenge 
Nov 17 Nov 27 Dec 07 Dec 17 Dec 27 
0 2 4 6 8 
Christmas Traffic 
Date 
Hits / 1000
© 2014 MapR Technologies 71 
Something more is needed … 
Nov 17 Nov 27 Dec 07 Dec 17 Dec 27 
0 2 4 6 8 
Christmas Traffic 
Date 
Hits / 1000
© 2014 MapR Technologies 72 
We need a better rate predictor… 
Incoming 
events 
99.97%-ile 
Alarm 
Δn 
Rate 
predictor 
Rate 
history 
t-digest 
δ> t 
t i δ λ(t i - t i - n) 
λ 
t
© 2014 MapR Technologies 73 
A New Rate Predictor for Sporadic Events
Improved Prediction with Adaptive Modeling 
© 2014 MapR Technologies 74 
Dec 17 Dec 19 Dec 21 Dec 23 Dec 25 Dec 27 Dec 29 
0 2 4 6 8 
Christmas Prediction 
Date 
Hits (x 1000)
Anomaly Detection + Classification  Useful Pair 
© 2014 MapR Technologies 75 
• Use the AD model to detect anomalies in new data 
– Methods such as clustering for discovery can be helpful 
• Once you have well-defined models in your system, you may 
also want to use classification to tag those 
• Continue to use the AD model to find new anomalies
© 2014 MapR Technologies 76 
Recap (out of order) 
• Anomaly detection is best done with a probability model 
• -log p is a good way to convert to anomaly measure 
• Adaptive quantile estimation (t-digest) works for auto-setting 
thresholds
© 2014 MapR Technologies 77 
Recap 
• Different systems require different models 
• Continuous time-series 
– sparse coding to build signal model 
• Events in time 
– rate model base on variable rate Poisson 
– segregated rate model 
• Events with labels 
– language modeling 
– hidden Markov models
© 2014 MapR Technologies 78 
Why Use Anomaly Detection?
© 2014 MapR Technologies 79 
Keep in mind… 
• Model normal, then find 
anomalies 
- 
• t-digest for adaptive threshold 
• Probabilistic models for 
complex patterns 
0 5 10 15 
−2 0 2 4 6 8 10 
offset + noise + pulse1 + pulse2 
A 
B
Dec 17 Dec 19 Dec 21 Dec 23 Dec 25 Dec 27 Dec 29 
© 2014 MapR Technologies 80 
0 2 4 6 8 
Christmas Prediction 
Date 
Hits (x 1000) 
Keep in mind… 
• Time intervals are key for 
sporadic events 
• Complex time shift to predict 
rate with seasonality 
• Sequence of events reveals 
phishing attack
A New Look at Anomaly Detection 
by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly) 
e-book available courtesy of MapR 
http://bit.ly/1jQ9QuL 
© 2014 MapR Technologies 81
Coming in October: Time Series Databases 
by Ted Dunning and Ellen Friedman © Oct 2014 (published by O’Reilly) 
© 2014 MapR Technologies 82
Thank you for coming today! 
© 2014 MapR Technologies 83
© 2014 MapR Technologies 85 
© MapR Technologies, confidential
© 2014 MapR Technologies 86 
Sandbox

Más contenido relacionado

La actualidad más candente

How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
DataWorks Summit
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation Techn
Ted Dunning
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matter
DataWorks Summit
 

La actualidad más candente (20)

Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache Mahout
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the Hive
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation Techn
 
Deep Learning for Fraud Detection
Deep Learning for Fraud DetectionDeep Learning for Fraud Detection
Deep Learning for Fraud Detection
 
Building multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesBuilding multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search engines
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
 
Polyvalent recommendations
Polyvalent recommendationsPolyvalent recommendations
Polyvalent recommendations
 
Using Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for RecommendationUsing Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for Recommendation
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learning
 
Dunning ml-conf-2014
Dunning ml-conf-2014Dunning ml-conf-2014
Dunning ml-conf-2014
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matter
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 

Similar a Anomaly Detection - New York Machine Learning

How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionHow to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detection
DataWorks Summit
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the Movies
DataWorks Summit
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
DataWorks Summit
 
Practical Computing Wiith Chaos
Practical Computing Wiith ChaosPractical Computing Wiith Chaos
Practical Computing Wiith Chaos
MapR Technologies
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
Ted Dunning
 
Architecting R into Storm Application Development Process
Architecting R into Storm Application Development ProcessArchitecting R into Storm Application Development Process
Architecting R into Storm Application Development Process
DataWorks Summit
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
DataWorks Summit
 

Similar a Anomaly Detection - New York Machine Learning (20)

How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionHow to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detection
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
 
Practical Computing with Chaos
Practical Computing with ChaosPractical Computing with Chaos
Practical Computing with Chaos
 
Practical Computing With Chaos
Practical Computing With ChaosPractical Computing With Chaos
Practical Computing With Chaos
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the Movies
 
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure Development
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure Development
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
Practical Computing Wiith Chaos
Practical Computing Wiith ChaosPractical Computing Wiith Chaos
Practical Computing Wiith Chaos
 
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap Learning
 
Ted Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFTed Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SF
 
Ted Dunning - Keynote: How Can We Take Flink Forward?
Ted Dunning -  Keynote: How Can We Take Flink Forward?Ted Dunning -  Keynote: How Can We Take Flink Forward?
Ted Dunning - Keynote: How Can We Take Flink Forward?
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
 
GoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 SkinnedGoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 Skinned
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
 
Architecting R into Storm Application Development Process
Architecting R into Storm Application Development ProcessArchitecting R into Storm Application Development Process
Architecting R into Storm Application Development Process
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down Internet
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
 

Más de Ted Dunning

Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
Ted Dunning
 

Más de Ted Dunning (8)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Último (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Anomaly Detection - New York Machine Learning

  • 1. © 2014 MapR Technologies 1 Anomaly Detection How to Find What You Didn’t Know to Look For © MapR Technologies, confidential October 14, 2014
  • 2. © 2014 MapR Technologies 2 Anomaly Detection: How To Find What You Didn’t Know to Look For Ted Dunning, Chief Applications Architect MapR Technologies Email tdunning@mapr.com tdunning@apache.org Twitter @Ted_Dunning Ellen Friedman, Consultant and Commentator Email ellenf@apache.org Twitter @Ellen_Friedman
  • 3. A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly) e-book available courtesy of MapR http://bit.ly/1jQ9QuL © 2014 MapR Technologies 3
  • 4. Practical Machine Learning series (O’Reilly) • Machine learning is becoming mainstream • Need pragmatic approaches that take into account real world business settings: – Time to value – Limited resources – Availability of data – Expertise and cost of team to develop and to maintain system • Look for approaches with big benefits for the effort expended © 2014 MapR Technologies 4
  • 5. © 2014 MapR Technologies 5 Anomaly Detection
  • 6. © 2014 MapR Technologies 6 Who Needs Anomaly Detection? Utility providers using smart meters
  • 7. © 2014 MapR Technologies 7 Who Needs Anomaly Detection? Feedback from manufacturing assembly lines
  • 8. © 2014 MapR Technologies 8 Who Needs Anomaly Detection? Monitoring data traffic on communication networks
  • 9. © 2014 MapR Technologies 9 What is Anomaly Detection? • The goal is to discover rare events – especially those that shouldn’t have happened • Find a problem before other people see it – especially before it causes a problem for customers • Why is this a challenge? – I don’t know what an anomaly looks like (yet)
  • 10. © 2014 MapR Technologies 10 Spot the Anomaly
  • 11. © 2014 MapR Technologies 11 Spot the Anomaly Looks pretty anomalous to me
  • 12. © 2014 MapR Technologies 12 Spot the Anomaly Will the real anomaly please stand up?
  • 13. © 2014 MapR Technologies 13 Basic idea: Find “normal” first
  • 14. © 2014 MapR Technologies 14 Steps in Anomaly Detection • Build a model: Collect and process data for training a model • Use the machine learning model to determine what is the normal pattern • Decide how far away from this normal pattern you’ll consider to be anomalous • Use the AD model to detect anomalies in new data – Methods such as clustering for discovery can be helpful
  • 15. How hard is it to set an alert for anomalies? Grey data is from normal events; x’s are anomalies. Where would you set the threshold? © 2014 MapR Technologies 15
  • 16. © 2014 MapR Technologies 16 Basic idea: Set adaptive thresholds
  • 17. © 2014 MapR Technologies 17 What Are We Really Doing • We want action when something breaks (dies/falls over/otherwise gets in trouble) • But action is expensive • So we don’t want too many false alarms • And we don’t want too many false negatives • What’s the right threshold to set for alerts? – We need to trade off costs
  • 18. © 2014 MapR Technologies 18 A Second Look
  • 19. © 2014 MapR Technologies 19 A Second Look 99.9%-ile
  • 20. New algorithm: t-digest © 2014 MapR Technologies 20
  • 21. © 2014 MapR Technologies 21 How Hard Can it Be? Online Summarizer x > t ? Alarm ! 99.9%-ile t x
  • 22. © 2014 MapR Technologies 22 Detecting Anomalies in Sporadic Events 0.0 0.2 0.4 0.6 0.8 1.0 0 5000 10000 15000 20000 pnorm(centroids[order(centroids)]) counts[order(centroids)]
  • 23. © 2014 MapR Technologies 23 Using t-Digest • Apache Mahout uses t-digest as an on-line percentile estimator – very high accuracy for extreme tails – new in version Mahout v 0.9 • t-digest also available elsewhere – in streamlib (open source library on github) – standalone (github and Maven Central) • What’s the big deal with anomaly detection? • This looks like a solved problem
  • 24. © 2014 MapR Technologies 24 Already Done? Etsy Skyline?
  • 25. © 2014 MapR Technologies 25 What About This? 0 5 10 15 offset + noise + pulse1 + pulse2 −2 0 2 4 6 8 10 A B
  • 26. © 2014 MapR Technologies 26 Model Delta Anomaly Detection + δ Online Summarizer δ > t ? t 99.9%-ile Alarm ! Model -
  • 27. The Real Inside Scoop • The model-delta anomaly detector is really just a sum of random © 2014 MapR Technologies 27 variables – the model we know about already – and a normally distributed error • The output (delta) is (roughly) the log probability of the sum distribution (really δ2) • Thinking about probability distributions is good • But how do you handle AD in systems with sporadic events?
  • 28. © 2014 MapR Technologies 28 Spot the Anomaly Anomaly?
  • 29. © 2014 MapR Technologies 29 Maybe not!
  • 30. © 2014 MapR Technologies 30 Where’s Waldo? This is the real anomaly
  • 31. © 2014 MapR Technologies 31 Normal Isn’t Just Normal • What we want is a model of what is normal • What doesn’t fit the model is the anomaly • For simple signals, the model can be simple … x ~ N(0,e ) • The real world is rarely so accommodating
  • 32. © 2014 MapR Technologies 32 We Do Windows
  • 33. © 2014 MapR Technologies 33 We Do Windows
  • 34. © 2014 MapR Technologies 34 We Do Windows
  • 35. © 2014 MapR Technologies 35 We Do Windows
  • 36. © 2014 MapR Technologies 36 We Do Windows
  • 37. © 2014 MapR Technologies 37 We Do Windows
  • 38. © 2014 MapR Technologies 38 We Do Windows
  • 39. © 2014 MapR Technologies 39 We Do Windows
  • 40. © 2014 MapR Technologies 40 We Do Windows
  • 41. © 2014 MapR Technologies 41 We Do Windows
  • 42. © 2014 MapR Technologies 42 We Do Windows
  • 43. © 2014 MapR Technologies 43 We Do Windows
  • 44. © 2014 MapR Technologies 44 We Do Windows
  • 45. © 2014 MapR Technologies 45 We Do Windows
  • 46. © 2014 MapR Technologies 46 We Do Windows
  • 47. © 2014 MapR Technologies 47 Windows on the World • The set of windowed signals is a nice model of our original signal • Clustering can find the prototypes – Fancier techniques available using sparse coding • The result is a dictionary of shapes • New signals can be encoded by shifting, scaling and adding shapes from the dictionary
  • 48. © 2014 MapR Technologies 48 Most Common Shapes (for EKG)
  • 49. < 1 bit / sample © 2014 MapR Technologies 49 Reconstructed signal Original signal Reconstructed signal Reconstruction error
  • 50. © 2014 MapR Technologies 50 An Anomaly Original technique for finding 1-d anomaly works against reconstruction error
  • 51. © 2014 MapR Technologies 51 Close-up of anomaly Not what you want your heart to do. And not what the model expects it to do.
  • 52. © 2014 MapR Technologies 52 A Different Kind of Anomaly
  • 53. © 2014 MapR Technologies 53 Model Delta Anomaly Detection + δ Online Summarizer δ > t ? t 99.9%-ile Alarm ! Model -
  • 54. © 2014 MapR Technologies 54 The Real Inside Scoop • The model-delta anomaly detector is really just a sum of random variables – the model we know about already – and a normally distributed error • The output (delta) is (roughly) the log probability of the sum distribution (really δ2) • Thinking about probability distributions is good
  • 55. Anomalies among sporadic events © 2014 MapR Technologies 55
  • 56. Sporadic Web Traffic to an e-Business Site © 2014 MapR Technologies 56 It’s important to know if traffic is stopped or delayed because of a problem… But visits to site normally come at varying intervals. How long after the last event should you begin to worry?
  • 57. Sporadic Web Traffic to an e-Business Site © 2014 MapR Technologies 57 It’s important to know if traffic is stopped or delayed because of a problem… But visits to site normally come at varying intervals. And how do you let your CEO sleep through the night?
  • 58. © 2014 MapR Technologies 58 Basic idea: Time interval between events is how to convert to something useful you can measure
  • 59. Sporadic Events: Finding Normal and Anomalous Patterns • Time between intervals is much more usable than absolute times © 2014 MapR Technologies 59 • Counts don’t link as directly to probability models • Time interval is log ρ • This is a big deal
  • 60. © 2014 MapR Technologies 60 Event Stream (timing) • Events of various types arrive at irregular intervals – we can assume Poisson distribution • The key question is whether frequency has changed relative to expected values – This shows up as a change in interval • Want alert as soon as possible
  • 61. © 2014 MapR Technologies 61 Converting Event Times to Anomaly 99.9%-ile 99.99%-ile
  • 62. But in the real world, event © 2014 MapR Technologies 62 rates often change
  • 63. Time Intervals Are Key to Modeling Sporadic Events © 2014 MapR Technologies 63
  • 64. © 2014 MapR Technologies 64 Model-Scaled Intervals Solve the Problem
  • 65. © 2014 MapR Technologies 65 Model Delta Anomaly Detection + δ Online Summarizer δ > t ? t 99.9%-ile Alarm ! Model - log p
  • 66. © 2014 MapR Technologies 66 Detecting Anomalies in Sporadic Events Incoming events 99.97%-ile Alarm Δn Rate predictor Rate history t-digest δ> t t i δ λ(t i - t i - n) λ t
  • 67. © 2014 MapR Technologies 67 Detecting Anomalies in Sporadic Events Incoming events 99.97%-ile Alarm Δn Rate predictor Rate history t-digest δ> t t i δ λ(t i - t i - n) λ t
  • 68. © 2014 MapR Technologies 68 Slipped Week: Simple Rate Predictor Nov 02 Nov 07 Nov 12 Nov 17 Nov 22 Nov 27 Dec 02 0 100 200 300 400 500 Main Page Traffic Date Hits (x 1000) A B C D
  • 69. © 2014 MapR Technologies 69 Poisson Distribution • Time between events is exponentially distributed Dt ~ le-lt • This means that long delays are exponentially rare P(Dt > T) = e-lT -logP(Dt > T) = lT • If we know λ we can select a good threshold – or we can pick a threshold empirically
  • 70. © 2014 MapR Technologies 70 Seasonality Poses a Challenge Nov 17 Nov 27 Dec 07 Dec 17 Dec 27 0 2 4 6 8 Christmas Traffic Date Hits / 1000
  • 71. © 2014 MapR Technologies 71 Something more is needed … Nov 17 Nov 27 Dec 07 Dec 17 Dec 27 0 2 4 6 8 Christmas Traffic Date Hits / 1000
  • 72. © 2014 MapR Technologies 72 We need a better rate predictor… Incoming events 99.97%-ile Alarm Δn Rate predictor Rate history t-digest δ> t t i δ λ(t i - t i - n) λ t
  • 73. © 2014 MapR Technologies 73 A New Rate Predictor for Sporadic Events
  • 74. Improved Prediction with Adaptive Modeling © 2014 MapR Technologies 74 Dec 17 Dec 19 Dec 21 Dec 23 Dec 25 Dec 27 Dec 29 0 2 4 6 8 Christmas Prediction Date Hits (x 1000)
  • 75. Anomaly Detection + Classification  Useful Pair © 2014 MapR Technologies 75 • Use the AD model to detect anomalies in new data – Methods such as clustering for discovery can be helpful • Once you have well-defined models in your system, you may also want to use classification to tag those • Continue to use the AD model to find new anomalies
  • 76. © 2014 MapR Technologies 76 Recap (out of order) • Anomaly detection is best done with a probability model • -log p is a good way to convert to anomaly measure • Adaptive quantile estimation (t-digest) works for auto-setting thresholds
  • 77. © 2014 MapR Technologies 77 Recap • Different systems require different models • Continuous time-series – sparse coding to build signal model • Events in time – rate model base on variable rate Poisson – segregated rate model • Events with labels – language modeling – hidden Markov models
  • 78. © 2014 MapR Technologies 78 Why Use Anomaly Detection?
  • 79. © 2014 MapR Technologies 79 Keep in mind… • Model normal, then find anomalies - • t-digest for adaptive threshold • Probabilistic models for complex patterns 0 5 10 15 −2 0 2 4 6 8 10 offset + noise + pulse1 + pulse2 A B
  • 80. Dec 17 Dec 19 Dec 21 Dec 23 Dec 25 Dec 27 Dec 29 © 2014 MapR Technologies 80 0 2 4 6 8 Christmas Prediction Date Hits (x 1000) Keep in mind… • Time intervals are key for sporadic events • Complex time shift to predict rate with seasonality • Sequence of events reveals phishing attack
  • 81. A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly) e-book available courtesy of MapR http://bit.ly/1jQ9QuL © 2014 MapR Technologies 81
  • 82. Coming in October: Time Series Databases by Ted Dunning and Ellen Friedman © Oct 2014 (published by O’Reilly) © 2014 MapR Technologies 82
  • 83. Thank you for coming today! © 2014 MapR Technologies 83
  • 84. © 2014 MapR Technologies 85 © MapR Technologies, confidential
  • 85. © 2014 MapR Technologies 86 Sandbox

Notas del editor

  1. Talk track: 2nd in series, first was on how to build a simple recommender. This one on anomaly detection is being sold by O’Reilly on Amazon, but for a limited time MapR is giving away the e-book for free. Here’s the link where you can register to get one.
  2. Talk track: ELLEN New ways to do it that take into account real world business goals, realistic resources, new types of data and best time to value…
  3. Talk track: mistakes affect huge numbers of people….
  4. Talk track: … even more so on auto mated assembly line
  5. STILL ELLEN
  6. Talk track: Say “Build a model”
  7. Talk track: Say “Build a model; model what is normal. Then determine what is not…
  8. ELLEN/TRANSITION SLIDE
  9. TED
  10. Ellen talking point: Ted authored it and contributed to open-source; others are now contributing adjustments, used in several places
  11. Talk track: Now where do you put the threshold? Adaptive model is the solution…
  12. Ellen: Talk track: We talk about this in the book with the EKG example where the normal pattern is fairly regular but very complex shape..
  13. Ellen comment for transition: Talk track: How handle that and in what situations does that matter?
  14. ELLEN: set up
  15. TO TED/ CEO story
  16. Talk track: This is what it looks like to have events such as those on website that come in at randomized times (people come when they want to) but the underlying average rate in this case is constant, in other words, a fairly steady stream of traffic. This looks at lot like the first signal we talked about: a randomized but even signal… We can use t-digest on it to set thresholds, everything works just grand. (Like radio activity Geiger counter clicks)
  17. Talk track: (Describe figure) Horizontal axis is days, with noon in the middle of each day. The faint shadow shows the underlying rate of events.The vertical axis is the time interval between events. Notice that as the rate of events is high, the time interval between events is small, but when the rate of events slows down, the time between events is much larger. Ellen: For this reason, we cannot set a simple threshold: if set low in day, we have an alert every night even though we expect a longer interval then. If we set it too high, we miss the real problems when traffic really is abnormally delayed or stopped altogether. What can you do to solve this? Ted: We build a model, multiple the modelled rate x the interval, we get a number we can threshold accurately.
  18. Talk track: (Description of graph) Shadow
  19. Ted: this was figure 5-2 in the book
  20. Talk track: You need a rate predictor Ellen: sometimes simple is good enough
  21. Ted: This was figure 5.3
  22. Talk track: This slide is here for reference when you download the slides
  23. Ted: This was figure 5.4
  24. Ted: This was figure 5.4
  25. Ted: this was figure 5-2 in the book
  26. We can look at yesterday and day before but need to look at the shape from previous days … but look at today for whether traffic is scaling
  27. Ted: This was figure 5.4