5. #MLSEV 5
Anomaly Detection
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
An unsupervised algorithm that looks for unusual instances in a dataset. Anomaly
detectors provide an anomaly score to each instance, the higher is the score the
most unusual is the instance. Example:
• Amount $2,459 is higher than all other transactions
• Only transaction
• In zip 21350
• For the purchase class “tech"
8. #MLSEV 8
Isolation Forest
“easy” to isolate
“hard” to isolate
Depth
Now repeat the process several
times and use average depth to
compute anomaly score:
0 (similar) 1 (dissimilar)
Isolation Forest: Grow random
decision trees until each instance is
in its own leaf. Random features
and splits
9. #MLSEV 9
Isolation Forest Splits
https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf
AnomalyUsual data point
12. #MLSEV 12
Outliers
• Data points that differ significantly from other observations
• Outliers can cause serious problems in statistical analyses
• Examples:
1
2
3
4
5
6
10 20 30 40 50 60 70 80 900
Price
(100k €)
Square Meters
Regression:
1
2
3
4
5
6
0
Price
(100k €)
10 20 30 40 50 60 70 80 90
Square Meters
Unsold
Sold
Classification:
13. #MLSEV 13
Outliers
• Data points that differ significantly from other observations
• Outliers can cause serious problems in statistical analyses
• Examples:
1
2
3
4
5
6
10 20 30 40 50 60 70 80 900
Price
(100k €)
Square Meters
Regression:
1
2
3
4
5
6
0
Price
(100k €)
10 20 30 40 50 60 70 80 90
Square Meters
Unsold
Sold
Classification:
14. #MLSEV 14
Removing Outliers
ORIGINAL
DATASET
TRAIN SET
TEST SET
ALL
MODEL
CLEAN
DATASET
REJECT MOST
ANOMALOUS
CLEAN
MODEL
COMPARE
EVALUATIONS
ANOMALY
DETECTOR
• Anomaly detectors can be used to remove outliers
• With this methodology outliers removal can be tested
ALL
EVALUATION
CLEAN
EVALUATION
16. #MLSEV 16
Summary
•An anomaly detector improved a classifier performance by removing top
10 anomalies as outliers
•Usually removing anomalies with score over 60% works
18. #MLSEV 18
Fraud Detection
HISTORIC NON
FRAUD
TRANSACTIONS
ANOMALY
DETECTOR
NEW
TRANSACTION(S)
ANOMALY
SCORE
KEEP HIGH
SCORES
SUSPICIOUS
TRANSACTION(S)
FRAUD
ANALYST
• Use Machine Learning to detect fraudulent financial transactions
• Fraud transactions being unusual can be detected with an anomaly
detector
20. #MLSEV 20
Summary
• Anomaly detectors can be an unsupervised alternative to classifiers
in extremely unbalanced datasets
• Fraud detection is an example. A similar approach can be used for other
use cases such as predictive maintenance or network intrusion
detection
• With this approach, the most challenging aspect is finding the features
that work
22. #MLSEV 22
Novel Categories
• A classification model performance could be reduced over time in
production with real data evolution over time
• Model degradation can be addressed by retraining with new data
• What if new data is not labeled?
• What if new data contains novel categories?
• Anomaly detectors can be used to spot model degradation and to
discover novel categories
23. #MLSEV 23
Novel Categories Discovery
ORIGINAL
DATASET
CLASSIFICATION
MODEL
ANOMALY
DETECTOR
NEW
INSTANCES
HIGH SCORED
INSTANCES, POTENTIAL
NOVEL CATEGORIES
REJECT HIGH
ANOMALY SCORES
SIMILAR
INSTANCES
PREDICTION
LABEL/RETRAIN
MODEL ALERT
WHEN CUMULATED
ANOMALY
SCORE
DATA ANALYST
25. #MLSEV 25
Summary
• Novel plates faults categories could be spotted with this method
• Model degradation in general can be monitored with anomaly detectors