Concepts and types of anomaly detection and also step-by-step explanation on how to detect anomalies with normal distribution and multivariate normal distribution.
2. Agenda
● What are anomalies ?
● Types of anomaly detection
● Normal distribution
● Multivariate normal distribution
● Case study - body fat dataset
● Conclusions
● References
3. What are anomalies ?
Concepts and Definitions
● Small number of observations that do not conform the behavior
from the the rest of the database
● Also known as outliers
● Generally less than 5% of total population
7. Concepts and Definitions
What are anomalies ?
Credit Card Fraud
JAN FEB MAR APR MAY JUN JUL AGO SEP OUT NOV DEC
BRL 200 210 190 180 200 190 210 180 200 180 950 250
USD 12 0 0 1 0 12 0 0 2 2 200 0
EUR 0 0 0 0 0 120 0 0 1 2 5 0
8. Concepts and Definitions
What are anomalies ?
Credit Card Fraud
JAN FEB MAR APR MAY JUN JUL AGO SEP OUT NOV DEC
BRL 200 210 190 180 200 190 210 180 200 180 950 250
USD 12 0 0 1 0 12 0 0 2 2 200 0
EUR 0 0 0 0 0 120 0 0 1 2 5 0
9. Concepts and Definitions
What are anomalies ?
Factory Inspection
ID Temperature (Celsius) Rotation(RPM)
100 10.89 10
110 9.78 10
120 45.23 15
130 9.91 10
140 9.23 11
10. Concepts and Definitions
What are anomalies ?
Factory Inspection
ID Temperature (Celsius) Rotation(RPM)
100 10.89 10
110 9.78 10
120 45.23 15
130 9.91 10
140 9.23 11
11. Concepts and Definitions
What are anomalies ?
Cyber Security
timestamp IP command
2015-01-10 15:05:05 10.10.1.10 open port 80
2015-01-10 15:25:10 10.10.1.10 request content
2015-01-10 15:27:25 10.10.1.10 open port 22
2015-01-10 16:15:36 10.10.1.10 send command as root
12. Concepts and Definitions
Types of Anomaly Detection
● By learning method
○ Supervised
○ Unsupervised
○ Semi-supervised
● By dimensionality
○ Univariate (one dimension)
○ Multivariate (multiple dimensions)
● By characteristic
○ Point
○ Contextual
13. Concepts and Definitions - By learning method
Types of Anomaly Detection
● Supervised Learning
Column A Column B Column C Anomalous (label)
... ... ... FALSE
... ... ... FALSE
... ... ... TRUE
14. Concepts and Definitions - By learning method
Types of Anomaly Detection
● Unsupervised Learning
Column A Column B Column C
... ... ...
... ... ...
... ... ...
15. Concepts and Definitions - By learning method
Types of Anomaly Detection
● Semi-Supervised Learning
Column A Column B Column C Anomalous (label)
... ... ... TRUE
... ... ... TRUE
... ... ... TRUE
16. Concepts and Definitions - By dimensionality
Types of Anomaly Detection
● Univariate
Temperature
121
118
121
104
120
...
26. Algorithms
Normal Distribution
● Probability density function
Data distribution
● 68.27% = values 1 sd away from the mean
● 95.45% = values 2 sd away from the mean
● 99.73% = values 3 sd away from the mean
Therefore
● 31.73% = values beyond 1 sd from the mean
● 4.55% = values beyond 2 sd from the mean
● 0.27% = values beyond 3 sd from the mean
27.
28. Algorithms
Normal Distribution
● Probability density function
Temp. Mean SD Density
121 120 1.224745 0.2333993
118 120 1.224745 0.08586282
119 120 1.224745 0.2333993
120 120 1.224745 0.325735
104 120 1.224745 2.838368e-38
121 120 1.224745 0.2333993
119 120 1.224745 0.2333993
122 120 1.224745 0.08586282
120 120 1.224745 0.325735
120 120 1.224745 0.325735
30. Temp. Mean SD Prob. Dens < 0.3 Dens < 0.2 Dens < 0.1 Dens < 0.05
121 120 1.224745 0.2333993 T F F F
118 120 1.224745 0.08586282 T T T F
119 120 1.224745 0.2333993 T F F F
120 120 1.224745 0.325735 F F F F
104 120 1.224745 2.838368e-38 T T T T
121 120 1.224745 0.2333993 T F F F
119 120 1.224745 0.2333993 T F F F
122 120 1.224745 0.08586282 T T T F
120 120 1.224745 0.325735 F F F F
120 120 1.224745 0.325735 F F F F
31. Temp. Mean SD Prob. Dens < 0.3 Dens < 0.2 Dens < 0.1 Dens < 0.05
121 120 1.224745 0.2333993 T F F F
118 120 1.224745 0.08586282 T T T F
119 120 1.224745 0.2333993 T F F F
120 120 1.224745 0.325735 F F F F
104 120 1.224745 2.838368e-38 T T T T
121 120 1.224745 0.2333993 T F F F
119 120 1.224745 0.2333993 T F F F
122 120 1.224745 0.08586282 T T T T
120 120 1.224745 0.325735 F F F F
120 120 1.224745 0.325735 F F F F
32. Algorithms
Multivariate Normal Distribution
● Point, multivariate and supervised
Temp. Weight Anomaly
121 67 F
118 66 F
119 74 F
120 75 F
104 45 T
121 86 F
119 56 F
122 55 F
120 99 T
120 65 F
34. Temp. Temp. Dens. Weight Weight Dens. Final Dens. (temp dens. x weight dens.) Anomaly
121 0.2276141 67 0.0387217449 0.2276141*0.0387217449 = 0.008813 F
118 0.09488369 66 0.0381732505 0.0948836 * 0.0381732505 = 0.0036220 F
119 0.2276141 74 0.0327846726 0.2276141 * 0.0327846726 = 0.0074622 F
120 0.3046972 75 0.0308192796 0.3046971 * 0.0308192796 = 0.0093905 F
104 1.139061e-33 45 0.0031441128 1.139061e-33 * 0.0031441128= 3.58133e-36 T
121 0.2276141 86 0.0083344364 0.2276141 * 0.0083344 = 0.0018970 F
119 0.2276141 56 0.0196165609 0.2276141 * 0.0196165 = 0.0044650 F
122 0.09488369 55 0.0174177236 0.0948836 * 0.0174177 = 0.0016526 F
120 0.3046972 99 0.0004030011 0.3046971 * 0.0004030 = 0.0001227 T
120 0.3046972 65 0.0372763042 0.3046971 * 0.0372763 = 0.0113579 F
35. Temp. Weight Final Dens. < 0.005 < 0.002 < 0.001 Anomaly
121 50 0.008813 F F F F
118 50 0.0036220 T F F F
119 51 0.0074622 F F F F
120 50 0.0093905 F F F F
104 53 3.58133e-36 T T T T
121 50 0.0018970 T T F F
119 51 0.0044650 T F F F
122 51 0.0016526 T T F F
120 45 0.0001227 T T T T
120 50 0.0113579 F F F F
36. Temp. Weight Final Dens. < 0.005 < 0.002 < 0.001 Anomaly
121 50 0.008813 F F F F
118 50 0.0036220 T F F F
119 51 0.0074622 F F F F
120 50 0.0093905 F F F F
104 53 3.58133e-36 T T T T
121 50 0.0018970 T T F F
119 51 0.0044650 T F F F
122 51 0.0016526 T T F F
120 45 0.0001227 T T T T
120 50 0.0113579 F F F F
39. Case Study - Body Fat Dataset
data %>% ggplot(aes(x=Height, y=Weight)) + geom_point()
plot the data
40. Case Study - Body Fat Dataset
plot histogram from each feature
hist(data$Height, breaks=35) hist(data$Weight, breaks = 30)
Is the distribution similar to a bell shape curve ?
42. Case Study - Body Fat Dataset
df_prob %>% ggplot(aes(x=Height, y=Weight, col = Dens.Prob)) +
geom_point() +
scale_colour_gradient(low="red", high="blue")
plot observations with probability density
The farther the
observation is from
normal values the
redder is its color.
The closer to normal the
bluer it gets.
44. Case Study - Body Fat Dataset
df_prob %>%
mutate(Anom = Dens.Prob <= 1.247143e-04) %>%
mutate(Height.Mean = mean(Height), Weight.Mean = mean(Weight)) %>%
ggplot(aes(x=Height, y=Weight, col=Anom)) + geom_point() +
geom_point(aes(x=Height.Mean, y = Weight.Mean), col = "black")
plot most anomalous observations with different color
45. Conclusions
● Anomalies are small group of data that behaves differently from what is
considered normal
● There are many applications for anomaly detection
● There are several types of anomalies
● Normal distribution can be used to detect anomalies provided that the
features follow a bell shaped curve distribution
● Multivariate normal distribution can be used when more than one feature
must be considered
46. References
1. Anomaly Detection: A Tutorial - Arindam Banerjee, Varun Chandola,
Vipin Kumar, Jaideep Srivastava, Aleksandar Lazarevic
a. https://www.siam.org/meetings/sdm08/TS2.ppt
2. Anomaly Detection -
a. http://www.holehouse.org/mlclass/15_Anomaly_Detection.html
3. Anomaly Detection - Machine Learning Class Notes - Lecture 16:
Anomaly Detection
a. http://dnene.bitbucket.org/docs/mlclass-notes/lecture16.html
4. Normal Distribution
a. https://en.wikipedia.org/wiki/Normal_distribution