Anomaly detection is a useful machine learning technique for identifying interesting, valuable or unusual instances in data sets. Applications for anomaly detection are diverse, including: fraud and counterfeit detection; surveillance; network, security and process monitoring; data exploration and more.
In this presentation, I review the basic ideas behind outlier based detectors, and compare this to traditional classification. I highlight practical and advanced issues for performance. Finally, I present an application of anomaly detection for detecting seizures from intracranial EEG time series.
See the accompanying video, http://vimeo.com/71931374
4. data cleansing
3-5% mislabeled ground truth in MNIST database
9
1
0
1
7
2
3
9
5
0
3
6
6
0
7
5
0
7
6
3
stock price
Volkswagen (VOW.DE) short squeeze, 10/28/2008
5. transactions
video surveillance
email
Date: Sat, 12 Aug 2012 14:39:59 UTC
From: "Iglobal"
<tryme@yourdomain.com>
To: ”Mr. Foo1" <foo1@freemail.com>
Subject: Foo1, Please Confirm Your
Position!
Hi Foo1,
Welcome To The $7 Plan. I Bring in 3 to 5
New Members In Every Day, I can show you
how easily. Its to much Fun.
Solution #1 It costs too much every month.
Not with the $7 Plan! The TOTAL cost is $7
per month. The $7.00 Plan is still holding
your position and we have people that are
waiting to place under you. That's right only
Credit Card Fraud
Campaign Response
Traffic
Persons of Interest
Spam
Intrusion / Malware
9. Nothing is more expensive than a missed opportunity.
– H. Jackson Brown, Jr.
Advantages
Data haystacks .01%
Unusual = interesting
Models $$$
Labels $$$
…
Disadvantages?
10. We sell healthy, green apples!
Bob ... knows apples
common (n=13)
rare (n=1)
Bob “The 8th Dwarf”
8 Dwarf Orchards, Inc.
… sells healthy apples
… studies data science
… does “Big Apple Data”
11. Goal: label instances
(green vs. red)
watercore
greens
green = +1
red = -1
Feature Space
Labels
mass density (g/cm3)
reds
Training
zi
Inputs
xi
zi
yi
f :X
Y
12. Test Examples
watercore
Test Examples – Results
Confusion Matrix
Green (G)
not-green
(NG)
Label G
13 (TP)
4 (FP)
Label NG
1 (FN)
1 (TN)
mass density (g/cm3)
13. Key idea: trade-off mislabeling each class (P vs. N)
Sensitivity
Confusion matrix
True Classes
Green (G)
TPR = TP / (TP+FN) = 13/14
not-green (NG)
Specificity
Label G
13 (TP)
4 (FP)
Label NG
1 (FN)
1 (TN)
P
N
SPC= TN / (FP+TN) = 1/5
False Positive Rate
FPR= FP / (TP+FP) = 4/17
errors on the “positive” class, Green.
errors on the “negative” class, not-green.
14. Idea: distance to “average” example
centroid based anomaly detection
examples
centroid
threshold
anomaly
watercore
mass density (g/cm3)
false positive
anomaly score
16. Goal: find densest regions in feature space
Standard deviation
mass density (g/cm3)
Tukey statistic (IQR)
watercore
Mahalanobis distance
17. Goal: find densest regions in feature space
Flexible
Density based
Robust
watercore
Tunable
mass density (g/cm3)
How? the one-class support vector machine
18. Goal: find densest regions in feature space
x
xx
“Flood” graph
x
Pick fraction, e.g. 0.5
Mark waterlines
Note support
The One-class Support Vector Machine Does This
19.
Outlier impact
Rich data
Graphs
Spatio-temporal
Text
Use labels
Online / latency
Features
Clustering & alternatives
You Are Here
20. APPROACHES
SAMPLE METHODS
Statistical methods
Distance based methods
Rule systems
Profiling Methods
Model based approaches
Kernel methods
PCA & subspace methods
OCNM & OCSVM
CUSUM
Nearest neighbors
Decision trees
Replicator Neural Networks
Clustering
V. Chandola, A. Banerjee and V. Kumar, “Anomaly Detection: A Survey.” (2009)
21.
Problem: Detect seizures in patients from IEEG
Solution: Use one-class SVM to train on 15-minutes of
baseline
Performance: Improve state-of-the art latency
(5 secs) to -13 secs, auto channel selection, unsupervised
technique, …
Reference: “One-Class Novelty Detection for Seizure
Analysis from Intracranial EEG,” Journal of Machine
Learning Research ‘06
24. Traditional Model
Brain Electrical Activity
Novelty Model
Brain Electrical Activity
baseline
baseline
pre-seizure
seizure
other
(e.g., seizures, artifacts,
etc.)
25. Idea: Capture Spectral Changes
Sliding Windows
Spectrum
frequency
EEG
time
Teager Energy
Curve Length
Short-Term Energy
slide & compute
29. Nothing is more expensive than a missed opportunity.
– H. Jackson Brown, Jr.
Advantages
Data haystacks .01%
Unusual = interesting
Models $$$
Labels $$$
…
Challenges
Features FTW
Normal = ?
Deviation = ?
False positives
Adaptation
…
(1:00)Thank organizers & attendeesMy background thesisInvitation to connect
(1:00)Anomaly detection is intuitiveRequires a contextRequires a measure
(0:45)MNIST database of handwritten digits. Longstanding story about accuracy of the data set.Volkswagen share price from 210EUR -> 1005EUR. Porsche disclosed holdings, including options that intended to acquire the underlying in. This was going to deplete the float, which caused a run by short sellers. (http://www.risk.net/risk-magazine/feature/1498381/the-volkswagen-squeeze)Anomalies focus our attention
(0:45)Anomalies have intrinsic valuebusiness, social and scientific valuetransactions, like insurance, purchases, returns, etc., looking for unusual good and bad behavior. Canonical example is credit card fraud, for instance my recent “purchase” of wine in SpainVideo surveillance, directly examining people, vehicles, and scenes for gait, position, counts, etc. to determine unusual traffic, intent, directionEmail – canonical example is the spam scam. Anomalous to me individually by content, sender, etc.Anomalous to recipients of an ISP because of the number of spreadMalware – anomalous mailings by me
(0:45)Often overlookedTwo axesExpensive to acquire examplesExpensive to miss anomaliesCurrency – secret service tv episodeConditions – life safety, services, etcSeizures
Anomalies everywhereChanging perspective
Machine learning makes it happenIdeal vs. real systemAlertsbc of intervention costOnline is rareWorkflow is similar
Data growthUnusual eventsExpensive to modelLabeled examples are rare, expensivePrioritized focus
Meet bobRed apples are “poison” so build a healthy (green) apple detector
RFA request for applesCount all combinations of “what I said It was” x “what it actually was” -> confusion matrixNote the unforeseen apple examples: rotten, yellow, etc.These unanticipated counter-examples are one reason why traditional classification “breaks”
Confusion matrices are … confusingReduce to two statistics (sens, spec)Fpr is related to specSens: how well do we do on green applesSpec: how well do we do on the othersExample: can build a perfect green apple detector by labeling all apples green. That’s highly sensitive, but not specific
Watercore is a real produce feature!This works pretty well for some problems, but there are issues as we will see…
Tukey = nonparametric, spherical region of supportStddev = parametric, spherical region of supportMahalanobis = elliptical, generalization of stddev, tighter bounds but more expensive to computeIn practice, mahalanobis performs nicely
Ideal case: find statistically significant “islands”Curiously, outliers distort this taskThe one-class SVM is the canonical, golden algorithm to achieve this Oracle Data Mining implements one-class svmThere are better variants, now, like OCNM
Outlier pruning before modeling can helpRich data has representation challengesHow do you encode feature vectors?What is an anomaly?How do you define normal?Semisupervised technique: do anomaly detection + use labels for classifyingIf online system, concerned with latencyFeatures matter, even more so for anomaly detectionClustering is an alternative and related problem. Many other related problems. Maybe worth considering.
Good survey paperThey create a taxonomy of techniquesExamples of AD techniques listed Note familiar methods: lots of ML algorithms can be reworked as anomaly detectionStrategies:Find a technique that works for your dataMap your data so it works with your favorite techniqueInvent your own technique
When non-controllable, looking at Surgical brain resection (gold standard)Implantable device (experimental)alternative
Real 20-minictal EEGSeizures not so obvious in raw time series form
We pick simple but robust features from the speech and signal processing literatureTime series almost never useful in raw formUse sliding window approachesHow to pick window width?What about multiscale phenomena
Interictal (baseline) features vsictal (seizure)Notice that feature distributions shift during seizure = anomaly
Data growthUnusual eventsExpensive to modelLabeled examples are rare, expensivePrioritized focus