History and Present Developments in Machine Learning, by Tom Dietterich, Emeritus Professor of computer science at Oregon State University and Chief Scientist at BigML.
*Machine Learning School in The Netherlands 2022.
3. BigML, Inc #DutchMLSchool
• Anomaly Detection Use Cases
• Four Basic Methods for Anomaly Detection with Engineered Features
• Benchmarking Study
• Incorporating Feedback
• Deep Versions of the Four Basic Methods
• Classifier-Based Anomaly Detection using the Max Logit Score
• Familiarity Hypothesis
• Challenges for the Future
Outline
3
5. BigML, Inc #DutchMLSchool 5
•Data Cleaning
•Remove corrupted data from the training data
•Example: Typos in feature values, feature values interchanged, test results from two patients
combined
•Fault Detection, Fraud Detection, Cyber Attack
•At training or test time, faulty or illegal behavior creates anomalous data
•Open Category Detection
•At test time, the classifier is given an instance of a novel category
•Example: Self-driving car (trained in Europe) encounters a kangaroo (in Australia)
•Out-of-Distribution Detection
•At test time, the classifier is given an instance collected in a different way
•Example: Chest X-Ray classifier trained only on front views is shown a side view
•Example: Self-driving car trained in clear conditions must operate during rainy conditions
Use Cases
6. BigML, Inc #DutchMLSchool 6
•Claim: Every deployed ML
classifier should include an
anomaly detector to detect
queries that lie outside the
region of competence of the
classifier
•Also useful as a performance
indicator to detect that you
need to retrain the classifier
Protecting a Classifier
𝑥𝑥𝑞𝑞
Anomaly
Detector
𝐴𝐴 𝑥𝑥𝑞𝑞 > 𝜏𝜏?
Classifier 𝑓𝑓
Training
Examples
(𝑥𝑥𝑖𝑖, 𝑦𝑦𝑖𝑖) no
�
𝑦𝑦 = 𝑓𝑓(𝑥𝑥𝑞𝑞)
yes reject
7. BigML, Inc #DutchMLSchool 7
•Definition: An “anomaly” is a data point generated by a process that is
different than the process generating the “nominal” data
•Let 𝐷𝐷0 be the probability distribution of the nominal process
•Let 𝐷𝐷𝑎𝑎 be the probability distribution of the anomaly process
•Two formal settings
• Clean training data
• Contaminated training data
Anomaly Detection Definitions
8. BigML, Inc #DutchMLSchool 8
• Given:
• Training data: 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑁𝑁
• All data come from 𝐷𝐷0 the “nominal” distribution
• Test data: 𝑥𝑥𝑁𝑁+1, … , 𝑥𝑥𝑁𝑁+𝑀𝑀 from a mixture of 𝐷𝐷0 and 𝐷𝐷𝑎𝑎 (the anomaly
distribution)
• Find:
• The data points in the test data that belong to 𝐷𝐷𝑎𝑎
• Examples:
• Protecting a classifier
• Detecting manufacturing defects / equipment failure
Clean Training Data
9. BigML, Inc #DutchMLSchool 9
• Given:
• Training data: 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑁𝑁 from a mixture of 𝐷𝐷0 and 𝐷𝐷𝑎𝑎 (the anomaly
distribution)
• Find:
• The data points in the training data that belong to 𝐷𝐷𝑎𝑎
• Use Cases:
• Data cleaning
• Fraud detection, Insider Threat detection
• These two cases can be combined
• Contaminated training data + Separate contaminated test data
Contaminated Training Data
11. BigML, Inc #DutchMLSchool 11
•Distance-Based Methods
•Anomaly score
𝐴𝐴 𝑥𝑥𝑞𝑞 = min
𝑥𝑥∈𝐷𝐷
𝑥𝑥𝑞𝑞 − 𝑥𝑥
•Density Estimation Methods
•Surprise: 𝐴𝐴 𝑥𝑥𝑞𝑞 = − log 𝑃𝑃𝐷𝐷(𝑥𝑥𝑞𝑞)
•Model the joint distribution
𝑃𝑃𝐷𝐷(𝑥𝑥) of the input data points
𝑥𝑥1, … ∈ 𝐷𝐷
Theoretical Approaches to Anomaly Detection
•Quantile Methods
•Find a smooth function 𝑓𝑓 such that
𝑥𝑥: 𝑓𝑓 𝑥𝑥 ≥ 0 contains 1 − 𝛼𝛼 of the
training data
•Anomaly score 𝐴𝐴 𝑥𝑥 = −𝑓𝑓(𝑥𝑥)
•Reconstruction Methods
•Train an auto-encoder: 𝑥𝑥 ≈
𝐷𝐷 𝐸𝐸 𝑥𝑥 , where 𝐸𝐸 is the encoder and
𝐷𝐷 is the decoder
•Anomaly score
𝐴𝐴 𝑥𝑥𝑞𝑞 = 𝑥𝑥𝑞𝑞 − 𝐷𝐷 𝐸𝐸 𝑥𝑥𝑞𝑞
12. BigML, Inc #DutchMLSchool 12
•Define a distance 𝑑𝑑(𝑥𝑥𝑖𝑖, 𝑥𝑥𝑗𝑗)
• 𝐴𝐴 𝑥𝑥𝑞𝑞 = min
𝑥𝑥∈𝐷𝐷
𝑑𝑑(𝑥𝑥𝑞𝑞, 𝑥𝑥)
•Requires a good distance metric
Approach 1: Distance-Based Methods
𝑥𝑥𝑞𝑞
𝑥𝑥𝑞𝑞
13. BigML, Inc #DutchMLSchool 13
• Approximates L1 (Manhattan) Distance
• (Guha, et al., ICML 2016)
• Construct a fully random binary tree
• choose attribute 𝑗𝑗 at random
• choose splitting threshold 𝜃𝜃 uniformly from
min 𝑥𝑥⋅𝑗𝑗 , max 𝑥𝑥⋅𝑗𝑗
• until every data point is in its own leaf
• let 𝑑𝑑(𝑥𝑥𝑖𝑖) be the depth of point 𝑥𝑥𝑖𝑖
• repeat 𝐿𝐿 times
• let ̅
𝑑𝑑(𝑥𝑥𝑖𝑖) be the average depth of 𝑥𝑥𝑖𝑖
• 𝐴𝐴 𝑥𝑥𝑖𝑖 = 2
−
�
𝑑𝑑 𝑥𝑥𝑖𝑖
𝑟𝑟 𝑥𝑥𝑖𝑖
• 𝑟𝑟(𝑥𝑥𝑖𝑖) is the expected depth
Isolation Forest [Liu, Ting, Zhou, 2011]
𝑥𝑥⋅𝑗𝑗
𝑥𝑥⋅𝑗𝑗 > 𝜃𝜃
𝑥𝑥⋅2 > 𝜃𝜃2 𝑥𝑥⋅8 > 𝜃𝜃3
𝑥𝑥⋅3 > 𝜃𝜃4 𝑥𝑥⋅1 > 𝜃𝜃5
𝑥𝑥𝑖𝑖
14. BigML, Inc #DutchMLSchool 14
• Given a data set 𝑥𝑥1, … , 𝑥𝑥𝑁𝑁 where
𝑥𝑥𝑖𝑖 ∈ ℝ𝑑𝑑
• We assume the data have been drawn
iid from an unknown probability
density: 𝑥𝑥𝑖𝑖 ∼ 𝑃𝑃 𝑥𝑥𝑖𝑖
• Goal: Estimate 𝑃𝑃
• Anomaly Score: 𝐴𝐴 𝑥𝑥𝑞𝑞 = − log 𝑃𝑃 𝑥𝑥𝑞𝑞
• “surprisal” from information theory
• Why density estimation?
• Gives a more global view by combining
distances to all data points
Approach 2: Density Estimation
15. BigML, Inc #DutchMLSchool 15
•Introduce sparse random
projections Π𝑙𝑙 into 1-
dimensional space
•Fit a density estimator
𝑃𝑃𝑙𝑙 Π𝑙𝑙 𝑥𝑥 in each 1-d space
• 𝐴𝐴 𝑥𝑥 =
1
𝐿𝐿
∑𝑙𝑙=1
𝐿𝐿
− log 𝑃𝑃𝑙𝑙 Π𝑙𝑙 𝑥𝑥𝑞𝑞
Example: LODA
(Pevny, 2015)
16. BigML, Inc #DutchMLSchool 16
• Vapnik’s principle: We only need to
estimate the “decision boundary” between
nominal and anomalous
• Surround the data by a function 𝑓𝑓 that
captures 1 − 𝜖𝜖 of the training data
• One-Class Support Vector Machine
(OCSVM)
• 𝑓𝑓 is a hyperplane in “kernel space”
• Support Vector Data Description (SVDD)
• 𝑓𝑓 is a sphere is “kernel space”
• Issue
• Need to choose 𝜖𝜖 at learning time rather
than run time
Approach 3: Quantile Methods
17. BigML, Inc #DutchMLSchool 17
• NavLab self-driving van (Pomerleau, 1992)
• Primary head: Predict steering angle from
input image
• Secondary head: Predict the input image
(“auto-encoder”)
• 𝐴𝐴 𝑥𝑥𝑞𝑞 = 𝑥𝑥𝑞𝑞 − �
𝑥𝑥𝑞𝑞
• If reconstruction is poor, this suggests that
the steering angle should not be trusted
• Principle: Anomaly Detection through
Failure
• Define a task on which the learned system
should fail for anomalies
Approach 4: Reconstruction Methods
Pomerleau, NIPS 1992
18. BigML, Inc #DutchMLSchool 18
• NASA Mars Science Laboratory ChemCam
instrument
• Collects 6144 spectral bands on rock samples
from 7m distance using laser stimulation
• Goal: active learning to find interesting spectra
• DEMUD
• Incremental PCA applied to samples one at a time
• Fit only to the samples labeled as “uninteresting” by
the user
• Show the user the most un-uninteresting sample
(sample with highest PCA reconstruction error)
• Rapidly discovers interesting samples
• Wagstaff, et al. (2013)
Application: Finding Unusual Chemical Spectra
19. BigML, Inc #DutchMLSchool 19
• Distance-Based Methods
• k-NN: Mean distance to 𝑘𝑘-nearest neighbors
• LOF: Local Outlier Factor (Breunig, et al., 2000)
• ABOD: kNN Angle-Based Outlier Detector (Kriegel, et al., 2008)
• IFOR: Isolation Forest (Liu, et al., 2008)
• Density-Based Approaches
• RKDE: Robust Kernel Density Estimation (Kim & Scott, 2008)
• EGMM: Ensemble Gaussian Mixture Model (our group)
• LODA: Lightweight Online Detector of Anomalies (Pevny, 2016)
• Quantile-Based Methods
• OCSVM: One-class SVM (Schoelkopf, et al., 1999)
• SVDD: Support Vector Data Description (Tax & Duin, 2004)
Benchmarking Study [Andrew Emmott, 2015, 2020]
20. BigML, Inc #DutchMLSchool 20
• Select 19 data sets from UC Irvine repository
• Choose one or more classes to be “anomalies”; the rest are “nominals”
• Manipulate
• Relative frequency
• Point difficulty
• Irrelevant features
• Clusteredness
• 20 replicates of each configuration
• Result: 11,888 Non-trivial Benchmark Datasets
Benchmarking Methodology
21. BigML, Inc #DutchMLSchool 21
• Linear ANOVA
• log
𝐴𝐴𝐴𝐴𝐴𝐴
1 −𝐴𝐴𝐴𝐴𝐴𝐴
~ 𝑟𝑟𝑟𝑟 + 𝑝𝑝𝑝𝑝 + 𝑐𝑐𝑐𝑐 + 𝑖𝑖𝑖𝑖 + 𝑝𝑝𝑠𝑠𝑠𝑠𝑠𝑠 + 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎
• rf: relative frequency
• pd: point difficulty
• cl: normalized clusteredness
• ir: irrelevant features
• pset: “Parent” set
• algo: anomaly detection algorithm
• Assess the algo effect while controlling for all other factors
• 𝐴𝐴𝐴𝐴𝐴𝐴: area under the ROC curve for the nominal vs. anomaly binary decision
Analysis of Variance
22. BigML, Inc #DutchMLSchool 22
• 19 UCI Datasets
• 9 Leading “feature-based” algorithms
• 11,888 non-trivial benchmark datasets
• Mean AUC effect for “nominal” vs. “anomaly” decisions
• Controlling for
• Parent data set
• Difficulty of individual queries
• Fraction of anomalies
• Irrelevant features
• Clusteredness of anomalies
• Baseline method: Distance to nominal mean (“tmd”)
• Best methods: K-nearest neighbors and Isolation Forest
• Worst methods: Kernel-based OCSVM and SVDD
Benchmarking Study Results
0.62
0.64
0.66
0.68
0.70
0.72
0.74
0.76
0.78
knn iforest egmm rkde lof abod loda svdd tmd ocsvm
Mean AUC Effect
23. BigML, Inc #DutchMLSchool 23
• Show top-ranked candidate to the
user
• User labels candidate
• Label is used to update the anomaly
detector
• Two methods
• AAD [Das, et al, ICDM 2016]
• GLAD-OMD (modified version of
iForest) [Siddiqui, et al., KDD 2018]
Incorporating User Feedback: Initial Work
Data
Anomaly
Detection
Best
Candidate
User
Anomaly Analysis
yes
no
24. BigML, Inc #DutchMLSchool 24
User Feedback Yields Big Improvements in
Anomaly Discovery
APT Engagement 3 Results
27. BigML, Inc #DutchMLSchool 27
•K-nearest neighbor in the
latent space
•Issue: What distance metric to
use?
•Cosine distance is the most
popular:
𝑑𝑑 𝑧𝑧1, 𝑧𝑧2 =
𝑧𝑧1 ⋅ 𝑧𝑧2
𝑧𝑧1 ‖𝑧𝑧2‖
Distance-Based Methods
28. BigML, Inc #DutchMLSchool 28
•Mahalanobis Method
• Fit a joint multivariate Gaussian
• Each class 𝑘𝑘 has its own mean 𝜇𝜇𝑘𝑘
• Shared covariance matrix Σ
•Given a new 𝑥𝑥,
log 𝑃𝑃(𝑥𝑥) ∝ min
𝑘𝑘
𝑥𝑥 − 𝜇𝜇𝑘𝑘
⊤
Σ−1
𝑥𝑥 − 𝜇𝜇𝑘𝑘
This is known as the squared
Mahalanobis distance
Density-Based Methods
29. BigML, Inc #DutchMLSchool 29
• Residual Flow Deep Density Estimator
• (Chen, Behrmann, Duvenaud, et al. NeurIPS 2019)
• Standard Cross-Entropy Supervised Loss
• Claim: This helps focus 𝑃𝑃 𝑥𝑥 on relevant aspects of the images
• Anomaly Score: 𝐴𝐴 𝑥𝑥𝑞𝑞 = − log 𝑃𝑃(𝑥𝑥𝑞𝑞)
Open Hybrid: Classification + Density Estimation
(Tack, Li, Guo, Guo, 2020)
30. BigML, Inc #DutchMLSchool 30
• The method is somewhat tricky to work with
• Set 𝑐𝑐 as the mean of a small set of points passed through the untrained network
• No bias weights
• These help prevent “hypersphere collapse”
Quantile Method: Deep SVDD (Ruff, et al. ICML 2018)
31. BigML, Inc #DutchMLSchool 31
• Encoder: 𝑧𝑧 = 𝐸𝐸 𝑥𝑥
• Decoder: �
𝑥𝑥 = 𝐷𝐷(𝑧𝑧)
• Challenge: How to constrain 𝐸𝐸 and
𝐷𝐷 so that the autoencoder fails on
anomalies but succeeds on nominal
images?
• Autoencoders often learn general-
purpose image compression
methods
Reconstruction Methods: Deep Autoencoders
𝑥𝑥
𝑧𝑧
�
𝑥𝑥
𝐸𝐸 𝐷𝐷
33. BigML, Inc #DutchMLSchool 33
•Garrepalli (2020)
• Train classifier to optimize
softmax likelihood (minimize
“cross-entropy loss”)
• Maximum logit score is better
than two distance methods:
• Isolation Forest
• LOF (a nearest-neighbor method)
Surprise: The Max Logit Score
0.68 0.67
0.63
0.72
0.51
0.44
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
H (y|x) Max SoftMax-
prob.
Max BCE-prob Max-logit Iforest LOF
AUROC
Anomaly Measures on Latent Representations for CIFAR-100
34. BigML, Inc #DutchMLSchool 34
• Vaze, Han, Vedaldi, Zisserman (2021): “Open
Set Recognition: A Good Classifier is All You
Need” (ICLR 2022; arXiv 2110.06207)
• Carefully train a classifier using the latest tricks
• Standard cross-entropy combined with the
following:
• Cosine learning rate schedule
• Learning rate warmup
• RandAugment augmentations
• Label Smoothing
• Anomaly score: max logit
• − max
𝑘𝑘
ℓ𝑘𝑘
More Evidence for Max Logit
Protocol from Lawrence Neal et al. (2018)
35. BigML, Inc #DutchMLSchool 35
•Novel class difficulty based on
semantic distance
• CUB: Bird species
• Air: Aircraft
• ImageNet
Still More Evidence for Max Logit
37. BigML, Inc #DutchMLSchool 37
• DenseNet with 384-dimensional
latent space.
• CIFAR-10: 6 known classes, 4 novel
classes
• UMAP visualization
• Light green: novel classes
• Darker greens: known classes
• Note that many novel classes stay
toward the center of the space;
others overlap with known classes
• Training was not required to “pull
them out” so that they could be
discriminated
How are open set images represented by deep
learning?
Alex Guyer
6 Known
Classes
4 Novel
Classes
38. BigML, Inc #DutchMLSchool 38
Similar Results from Other Groups
[Tack, et al. NeurIPS 2020] [Vaze, et al. arXiv 2110.06207]
39. BigML, Inc #DutchMLSchool 39
• Convolutional neural network learns “features” that
detect image patches relevant to the classification
task
• The logit layer weights these features to make the
classification decision
• Novel classes activate fewer of these features, so
their activation vectors are smaller
• Hypothesis: The networks don’t detect that an
elephant is novel because of trunk and tusks but
because its head doesn’t activate known features
The Familiarity Hypothesis
The network doesn’t
detect novelty, it detects
the absence of familiarity
40. BigML, Inc #DutchMLSchool 40
Novel images strongly activate fewer
features
• CIFAR 10: 6 known classes; 4 novel
classes
• DenseNet (𝑧𝑧 has 324 dimensions)
• Activation threshold 𝜃𝜃
• Count number of features whose
activation exceeds 𝜃𝜃
• OOD images activate fewer
features
Evidence: Number of Activated Features
Alex Guyer (unpublished)
41. BigML, Inc #DutchMLSchool 41
Are they features “on” the object vs. the
background?
• Strategy: blur the object and see how the
feature activations change
• activations that change must be on the object
• Details:
• PASCAL VOC Segmented Images
• Blur the original image (31x31 kernel; sd=31)
• Form composite image where blurred region
replaces the segmented region
Which features are responsible for the drop in
activation?
https://www.peko-step.com/en/tool/blur.html
42. BigML, Inc #DutchMLSchool 42
Blurring Examples
Note: This does not remove all object-related information (e.g.,
object boundary), so we don’t detect all on-object features
43. BigML, Inc #DutchMLSchool 43
• “presence feature”
• 𝐵𝐵𝐵𝐵 𝑖𝑖, 𝑗𝑗 > 0. Blurring decreases the
activity of the feature. Its net effect is to
measure the presence of one or more
image patterns
• Its activity is high when those patterns
are present
• “absence feature”
• 𝐵𝐵𝐵𝐵 𝑖𝑖, 𝑗𝑗 < 0. Blurring increases the
activity of the feature. Its net effect is to
measure the absence of one or more
image patterns
• Its activity is high when those patterns
are absent
• Define the “blurring effect” of feature 𝑗𝑗 on
image 𝑖𝑖
𝐵𝐵𝐵𝐵 𝑖𝑖, 𝑗𝑗 = 𝑧𝑧𝑖𝑖𝑖𝑖 − ̃
𝑧𝑧𝑖𝑖𝑖𝑖
where
• 𝑧𝑧𝑖𝑖𝑖𝑖 is the activation of latent feature 𝑗𝑗 on
image 𝑖𝑖
• ̃
𝑧𝑧𝑖𝑖𝑖𝑖 is the activation of latent feature 𝑗𝑗 on
blurred image 𝑖𝑖
Blurring Effect
44. BigML, Inc #DutchMLSchool 44
•On average, the activation of
a feature changes when the
object (of class 𝑘𝑘) is blurred
𝑂𝑂𝑂𝑂 𝑗𝑗, 𝑘𝑘
=
1
𝑁𝑁𝑘𝑘
�
𝑖𝑖:𝑦𝑦𝑖𝑖=𝑘𝑘
𝑧𝑧𝑖𝑖𝑖𝑖𝑖𝑖 − ̃
𝑧𝑧𝑖𝑖𝑖𝑖𝑖𝑖
•Feature 𝑗𝑗 is a net presence
feature for class 𝑘𝑘 if
𝑂𝑂𝑂𝑂 𝑗𝑗, 𝑘𝑘 > 0.02
•Feature 𝑗𝑗 is a net absence
feature for class 𝑘𝑘 if
𝑂𝑂𝑂𝑂 𝑗𝑗, 𝑘𝑘 < −0.02
•Otherwise 𝑗𝑗 is net neutral for
class 𝑘𝑘
“On Object” score of feature 𝑗𝑗 for class 𝑘𝑘
45. BigML, Inc #DutchMLSchool 45
• Logit score is ℓ𝑗𝑗𝑗𝑗 = ∑𝑗𝑗 𝑤𝑤𝑗𝑗𝑗𝑗𝑧𝑧𝑖𝑖𝑖𝑖
• Contribution of 𝑗𝑗 in image 𝑖𝑖 to class 𝑘𝑘:
• 𝑐𝑐𝑖𝑖𝑖𝑖𝑖𝑖 = 𝑤𝑤𝑗𝑗𝑗𝑗𝑧𝑧𝑖𝑖𝑖𝑖 (in normal images)
• ̃
𝑐𝑐𝑖𝑖𝑖𝑖𝑖𝑖 = 𝑤𝑤𝑗𝑗𝑗𝑗 ̃
𝑧𝑧𝑖𝑖𝑖𝑖 (in blurred images)
• Mean contribution
• ̅
𝑐𝑐𝑗𝑗𝑗𝑗 =
1
𝑁𝑁𝑘𝑘
∑ 𝑖𝑖 𝑦𝑦𝑖𝑖 = 𝑘𝑘 𝑐𝑐𝑖𝑖𝑖𝑖𝑖𝑖
• ̅̃
𝑐𝑐𝑗𝑗𝑗𝑗 =
1
𝑁𝑁𝑘𝑘
∑ 𝑖𝑖 𝑦𝑦𝑖𝑖 = 𝑘𝑘 ̃
𝑐𝑐𝑖𝑖𝑖𝑖𝑖𝑖
Feature Taxonomy
𝒘𝒘𝒋𝒋𝒋𝒋 > 𝟎𝟎 𝒘𝒘𝒋𝒋𝒋𝒋 < 𝟎𝟎
𝑂𝑂𝑂𝑂 𝑗𝑗, 𝑘𝑘
> 0.02
positive
presence
negative
presence
𝑂𝑂𝑂𝑂 𝑗𝑗, 𝑘𝑘
< 0.02
positive
absence
negative
absence
Sun & Li: On the Effectiveness of Sparsification for Detecting the
Deep Unknowns. arXiv 2111.09805
46. BigML, Inc #DutchMLSchool 46
Mean feature types for class 3
1.00
0.00
On-Object
Index
(presence)
On-Object
Index
(absence)
positive features
negative features
red = presence
blue = absence
47. BigML, Inc #DutchMLSchool 47
Zoomed View: Blurring reduces ̅
𝑐𝑐𝑗𝑗𝑗𝑗
Mean unblurred
contribution
Mean blurred contribution
• Blurring…
• reduces the contribution of
positive presence features (red
dots)
• reduces the contribution of
negative absence features (blue
dots)
1.00
0.00
On-Object
Index
(presence)
On-Object
Index
(absence)
48. BigML, Inc #DutchMLSchool 48
Decomposing the Logit Score: Four Cases
Positive presence:
𝑤𝑤𝑗𝑗𝑗𝑗 > 0 and
𝑂𝑂𝑂𝑂 𝑗𝑗, 𝑘𝑘 > 0
Positive absence:
𝑤𝑤𝑗𝑗𝑗𝑗 > 0 and
𝑂𝑂𝑂𝑂 𝑗𝑗, 𝑘𝑘 < 0
Negative presence:
𝑤𝑤𝑗𝑗𝑗𝑗 > 0 and
𝑂𝑂𝑂𝑂(𝑗𝑗, 𝑘𝑘) > 0
Negative absence:
𝑤𝑤𝑗𝑗𝑗𝑗 < 0 and
𝑂𝑂𝑂𝑂 𝑗𝑗, 𝑘𝑘 < 0
52. BigML, Inc #DutchMLSchool 52
• Note that the Positive Presence
features dominate the max logit
score
• The Negative Absence and
Positive Absence features
(purple and blue lines) make a
small contribution
• Negative Presence features
make no contribution
• Conclusion: Decreases in
activations of positive presence
account for most of the max
logit score
Decomposing the Novelty Scores
53. BigML, Inc #DutchMLSchool 53
•Red line: trend for Positive
Presence contribution to max
logit score
•Black line: smooth estimate of
classification accuracy
(“known” vs “novel”)
Decreases in Positive Presence Features
Account for Novelty Detection Accuracy
54. BigML, Inc #DutchMLSchool 54
•Blakemore, Colin, and Grahame F.
Cooper. “Development of the brain
depends on the visual environment.”
(1970): 477-478.
• Kittens raised in environments with
only horizontal or only vertical lines
• “They were virtually blind for contours
perpendicular to the orientation they
had experienced.”
•Chomsky: “Poverty of the stimulus”
Can we expect computer vision systems to perceive
things they have not been trained on?
Source: Li Yang Ku
https://computervisionblog.wordpress.com/2013/06/01/ca
ts-and-vision-is-vision-acquired-or-innate/
55. BigML, Inc #DutchMLSchool 55
• Familiarity-based anomaly detection advantages:
• Easy to implement – Anomaly signal (max logit) can be extracted from the
classifier. No separate anomaly detection model is needed
• Training on additional, auxiliary classes improves both classification and
anomaly detection performance
• Familiarity-based anomaly detection weaknesses
• Partially-occluded nominal objects will be flagged as anomalies
• If an image contains both a novel object and a known object, the novel object
will not be detected
• Adversarial attacks can easily cause false anomalies and missed anomalies
Implications
57. BigML, Inc #DutchMLSchool 57
• Can we learn deep representations that can represent outliers?
• Nonstationarity
• As the world changes, the anomaly detection model must also change
• Explanation
• Users often want explanations of why something is labeled as anomalous in order to provide feedback or
take other actions
• Setting alarm thresholds
• How can we set a threshold to control the false alarm and missed alarm rates?
• Incremental (continual) learning in deep networks
• How can we efficiently update a trained neural network to incorporate user feedback?
• Anomaly detection in temporal, spatial, and spatio-temporal data, in video data, etc.
• Anomaly detection at multiple scales
Challenges for Anomaly Detection
59. BigML, Inc #DutchMLSchool
• Four Basic Methods
• Distances, densities, density quantiles, and reconstruction
• Distances work best; Isolation Forest is very robust
• Anomaly Detection in Deep Learning
• The four basic methods have been extended to deep learning
• They often do not work well when applied to learned representations
• Classifier Max Logit Score Gives Very Competitive Performance
• Computed as a side effect of standard deep classifiers
• Measures familiarity rather than novelty, which makes it risky in many settings
• Advances in Deep Anomaly Detection Require Learning Better Representations
Shallow and Deep Methods for Anomaly Detection
59