SlideShare a Scribd company logo
1 of 67
Download to read offline
Outlier Selection and
One-Class Classification
.

Jeroen Janssens @jeroenhjanssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Overview

•

Anomalies and outliers

•

Stochastic Outlier Selection

•

Experiments and results

Outlier Selection and One-Class Classification

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Anomalies and outliers

Outlier Selection and One-Class Classification

Jeroen Janssens
.
.
.
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Definition (Anomaly)
An anomaly is an observation or event that deviates qualitatively from what is
considered to be normal, according to a domain expert.

Outlier Selection and One-Class Classification

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Detecting anomalies is important

•

Expensive

•

Dangerous

Outlier Selection and One-Class Classification

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Detecting anomalies is important

•

Expensive

•

Dangerous

•

Mess up your model

Outlier Selection and One-Class Classification

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Human anomaly detection may suffer from

•

Fatigue

•

Information overload

•

Emotional bias

Outlier Selection and One-Class Classification

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Computers work with numbers
Observations

.

Outlier Selection and One-Class Classification

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Computers work with numbers
Observations

Data points
width height label
8.4
7.3
apple
6.7
7.1
orange
8.0
6.8
apple
7.4
7.2
apple
9.6
9.2
orange
…
…
…

Outlier Selection and One-Class Classification

.

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Computers work with numbers
Data points
width height label
8.4
7.3
apple
6.7
7.1
orange
8.0
6.8
apple
7.4
7.2
apple
9.6
9.2
orange
…
…
…

Outlier Selection and One-Class Classification

Visualisation
10
8

.

6

6
. .apple
. .orange
4
8
10
width

height

Observations

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

From anomaly to outlier

rate of turn

. .anomalous vessel
. .normal vessel

.
.

Outlier Selection and One-Class Classification

speed over ground

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Definition (Outlier)
An outlier is a data point that deviates quantitatively from the majority of the
data points, according to an outlier-selection algorithm.

Outlier Selection and One-Class Classification

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Three standard deviations
. . outlier
. . inlier

3σ

.

−3σ

Outlier Selection and One-Class Classification

−2σ

−1σ

μ
x

1σ

2σ

3σ

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Euler diagram

Set legend
real-world observations

(a)

Experiments and results
. . . . . . . . . . .

Set notation

X

.
X

Outlier Selection and One-Class Classification

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Euler diagram

Set legend

Experiments and results
. . . . . . . . . . .

Set notation

real-world observations

labeled by expert as anomalous
labeled by expert as normal

A
X ∩A

.
X

(a)

(b)

X

A

Outlier Selection and One-Class Classification

X

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Euler diagram

Set legend

Experiments and results
. . . . . . . . . . .

Set notation

real-world observations

A

X
X

(c)

labeled by expert as anomalous
labeled by expert as normal

A
X ∩A

data points
unrecorded

D
X ∩D

.
X

(a)

(b)

X

A

Outlier Selection and One-Class Classification

D

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Euler diagram

Set legend

X
(d)

A

.

Outlier Selection and One-Class Classification

D

anomalies represented as data points
normalities represented as data points

Experiments and results
. . . . . . . . . . .

Set notation

CA = D ∩ A
CN = D ∩ A

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Euler diagram

Set legend

X
(d)

A

.

D

X
(e)

A

CO D

Outlier Selection and One-Class Classification

Experiments and results
. . . . . . . . . . .

Set notation

anomalies represented as data points
normalities represented as data points

CA = D ∩ A
CN = D ∩ A

classified by algorithm as an outlier
classified by algorithm as an inlier

CO
CI = D ∩ CO

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Euler diagram

Set legend

X
(d)

A

.

D

X
(e)

CO D

A
X

(f)

A

CO D

Outlier Selection and One-Class Classification

Experiments and results
. . . . . . . . . . .

Set notation

anomalies represented as data points
normalities represented as data points

CA = D ∩ A
CN = D ∩ A

classified by algorithm as an outlier
classified by algorithm as an inlier

CO
CI = D ∩ CO

hits
false alarms
misses
correct rejects

H = CA ∩ CO
FA = CN ∩ CO
M= CA ∩ CI
CR = CN ∩ CI
Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Confusion matrix
Expert labels the observation as a(n)

Outlier Selection and One-Class Classification

Outlier (CO )

Normality (CN )

hit
.
Hi

false alarm
.
FA

.
Inlier (CI )

Algorithm classi es the data point as an

Anomaly (CA )

miss
.
Mi

correct. reject
CR

Jeroen Janssens
Labels by the expert

Data set

. .anomaly .normality
.

. .data point

.

.

.

.
Classi cations by the algorithm
. .inlier .outlier
.

.

Outcome
. .hit .false alarm .miss .correct reject
.
.
.

.
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Stochastic Outlier Selection

Outlier Selection and One-Class Classification

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Stochastic Outlier Selection
•

Unsupervised outlier selection algorithm

•

Employs concept of affinity

•

Computes outlier probabilities

•

One parameter: perplexity

•

Inspired by t-SNE

Outlier Selection and One-Class Classification

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

A data point is selected as an outlier when all the other data points
have insufficient affinity with it.

Outlier Selection and One-Class Classification

Jeroen Janssens
.
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

From input to output
Subsections:

4.2.1

4.2.2

n

X
1
2
3
4
5
6

.1

4.2.3

D
2

8
6
4
2

.

.

0

m

Outlier Selection and One-Class Classification

1
2
3
4
5
6

.1

A

2 3 4 5 6

8
6
4
2

.
n

4.2.4 – 6

0

1
2
3
4
5
6

.1

B

2 3 4 5 6

.
n

1
0.8
0.6
0.4
0.2
0

1
2
3
4
5
6

.1

2 3 4 5 6

0.3
0.2
0.1

.
n

0

1
2
3
4
5
6

Φ
.1

.

1
0.8
0.6
0.4
0.2
0

1

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Demo: SOS on the
command-line
(see http://sos.jeroenjanssens.com)

Outlier Selection and One-Class Classification

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

From feature matrix to dissimilarity matrix
second feature x2

4

.

3

x4

0.
0

= 5.056
d2,6 ≡ d6,2

x5

Outlier Selection and One-Class Classification

1
2
3
4
5
6

x6

x2

x1

1

X

x3

2
1

Equation 2.1

.
data point

2

3

4
5
6
rst feature x1

7

8

9

.1

D
2

8
6
4
2

.

0

1
2
3
4
5
6

.1

2 3 4 5 6

8
6
4
2

.

0

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

From feature matrix to dissimilarity matrix
second feature x2

4

.

3

x4

0.
0

= 5.056
d2,6 ≡ d6,2

x5

2

3

4
5
6
rst feature x1

dij =
Outlier Selection and One-Class Classification

1
2
3
4
5
6

x6

x2

x1

1

X

x3

2
1

Equation 2.1

.
data point

7
m

8

9

.1

D
2

8
6
4
2

.

0

1
2
3
4
5
6

.1

2 3 4 5 6

8
6
4
2

.

0

2

∑ (xjk − xik )
k =1

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Affinity between data points
.
.
.

affinity aij

1
0.8
0.6

. σ 2 = 0 .1
i
.σ 2 = 1
i
. σ 2 = 10
i

Equation 4.2
D
1
2
3
4
5
6

0.4
0.2
0.
0

2

Outlier Selection and One-Class Classification

4
6
dissimilarity dij

8

10

.1

A

2 3 4 5 6

8
6
4
2

.

0

1
2
3
4
5
6

.1

2 3 4 5 6

.

1
0.8
0.6
0.4
0.2
0

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Affinity between data points
.
.
.

affinity aij

1
0.8
0.6

. σ 2 = 0 .1
i
.σ 2 = 1
i
. σ 2 = 10
i

Equation 4.2
D
1
2
3
4
5
6

0.4
0.2
0.
0

2

4
6
dissimilarity dij

8

.1

10

⎧
⎪ exp (−dij2 / 2σi2 )
⎪
aij = ⎨
⎪0
⎪
⎩

Outlier Selection and One-Class Classification

A

2 3 4 5 6

8
6
4
2

.

0

1
2
3
4
5
6

.1

2 3 4 5 6

.

1
0.8
0.6
0.4
0.2
0

if i ≠ j
if i = j
Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Smooth neighborhoods
6

.
.
.

second feature x2

5

.σ 2 .
1
.σ 2 .
3
.σ 2 .
5

4
3

x4

6

x3
σ 62

x5

2
1

.σ 2
2
.σ 2
4
.σ 2

x6
x2

x1

0

−1 .
−1
Outlier Selection and One-Class Classification

0

1

2

4
5
3
rst feature x1

6

7

8

9
Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

From affinity to binding probability
Equation 4.5 / 4.6
A

.

Outlier Selection and One-Class Classification

1
2
3
4
5
6

.1

B

2 3 4 5 6

.

1
0.8
0.6
0.4
0.2
0

1
2
3
4
5
6

.1

2 3 4 5 6

0.3
0.2
0.1

.

0

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

From affinity to binding probability
Equation 4.5 / 4.6
A

.

1
2
3
4
5
6

.1

B

2 3 4 5 6

.

1
0.8
0.6
0.4
0.2
0

1
2
3
4
5
6

.1

2 3 4 5 6

0.3
0.2
0.1

.

0

bij = p (i → j ∈ EG ) ∝ aij
aij
bij = n
∑k=1 aik
Outlier Selection and One-Class Classification

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Binding probabilities
B

v3

v4
.21

.

(a)

.25

v5

v6

.25

v.1

.24

v2

v3

v4
(b)
Outlier Selection and One-Class Classification

v5

.04

1
2
3
4
5
6

.1

2 3 4 5 6

.

0.25
0.2
0.15
0.1
0.05
0

B

.10
.29
.21

v6

1
2
3
4
5

.1

2 3 4 5 6

0.3
0.2

.

0.1
Jeroen Janssens
v3

v4

Anomalies and outliers
. . . . . . . . . . . .

.21

(a)

.25

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

v5

Binding probabilities
v

v6

.25

.24

v.1

2

v3

v4

.

.04

.29

v6

.21

v2

v1

.1

2 3 4 5 6

Experiments and results
. .0.25. . . . . . .
. .

.

0.2
0.15
0.1
0.05
0

B

.10

v5

(b)

1
2
3
4
5
6

.30

1
2
3
4
5
6

.1

2 3 4 5 6

0.3
0.2
0.1

.

.

0

.10

B

v3

v4
(c)

v5

.24
.22

Outlier Selection and One-Class Classification

.18

v6

1
2
3
4
5

.1

2 3 4 5 6

0.25
0.2
0.15
0.1
0.05Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

v3

v4

.10

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

.29

v5

(b)

Binding probabilities
v

.30

2

v1

v6

.21

1
2
3
4
5
6

.1

2 3 4 5 6

Experiments and results
. .0.3 . . . . . . .
. .

0.2
0.1

.

.

0

.

0.25
0.2
0.15
0.1
0.05
0

.10

B

v3

v4

.

(c)

v5

.24
.22

v1

v6

.18

v2

1
2
3
4
5
6

.1

2 3 4 5 6

.23
.10
.04

(d)

B

v3

v4
v5

.05
.04

v6
.05

Outlier Selection and One-Class Classification

1
2
3
4
5

.1

2 3 4 5 .6

0.05
0.04
0.03
0.02
0.01Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

v3

v4

(c)

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

v5

.24

v6

Binding probabilities
v
.22

v1

.18

2

1
2
3
4
5
6

.1

2 3 4 5 6

Experiments and results
. .0.25. . . . . . .
. .

.

0.2
0.15
0.1
0.05
0

.

0.05
0.04
0.03
0.02
0.01
0

.23
.10
.04

.

.05
.04

v5

(d)

B

v3

v4

v6
.05

v1

Outlier Selection and One-Class Classification

v2

.04

1
2
3
4
5
6

.1

2 3 4 5 .6

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Stochastic Neighbor Graph
G = (V, EG )

Outlier Selection and One-Class Classification

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Stochastic Neighbor Graph
G = (V, EG )
p(G) = ∏ bij .
i→j ∈ EG

Outlier Selection and One-Class Classification

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Stochastic Neighbor Graph
G = (V, EG )
p(G) = ∏ bij .
i→j ∈ EG

CO ∣ G = {xi ∈ X ∣ deg− (vi ) = 0}
G
= {xi ∈ X ∣ ∄vj ∈ V ∶ j → i ∈ EG }
= {xi ∈ X ∣ ∀vj ∈ V ∶ j → i ∉ EG }
Outlier Selection and One-Class Classification

Jeroen Janssens
v3

v4

p(Ga ) = 3.931 ⋅ 10−4

v5

(Ga )

v6

v3

v4

p(Gb ) = 4.562 ⋅ 10−5

v5

(Gb )

v6

v3

v4

p(Gc ) = 5.950 ⋅ 10−7

v5

v1

CO ∣Gb = {x5 , x6 }

v2

v1

(Gc )

CO ∣Ga = {x1 , x4 , x6 }

v2

v.1

v6
v2

CO ∣Gc = {x1 , x3 }
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Set of all SNGs
⋅10−4

8
6

Ga

4
2

Gb

Gc

0.

Outlier Selection and One-Class Classification

G

cumulative probability mass

probability mass

10

1

.
.
.

0.8

. h = 4.0
. h = 4.5
. h = 5.0

0.6
0.4
0.2
0.

G

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Approximating outlier probabilities by sampling SNGs
1

.
.
.
.
.
.

outlier probability

0.8
0.6

. x1
. x2
. x3
. x4
. x5
. x6

0.4
0.2
0 .0
10

101

102

p(xi ∈ CO ) = lim

S→∞

Outlier Selection and One-Class Classification

1
S

103
sampling iteration

S

104

∑ I{xi ∈ CO ∣ G(s) } ,
s=1

105

106

G(s) ∼ P(G)
Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Demo: Sampling SNGs in
CoffeeScript and D3
(see http://sos.jeroenjanssens.com)

Outlier Selection and One-Class Classification

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Computing outlier probabilities through marginalisation

p(xi ∈ CO ) = ∑ I{xi ∈ CO ∣ G} ⋅ p(G)
G∈G

= ∑ I{xi ∈ CO ∣ G} ⋅ ∏ bqr .
G∈G

q→r ∈ EG

∣G∣ = (n − 1)n

Outlier Selection and One-Class Classification

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Computing outlier probabilities in closed form
4

B
1
2
3
4
5
6

.1 2 3 4 5 6

0.3
0.2
0.1

.

0

1
2
3
4
5
6

Φ
.1

.

1
0.8
0.6
0.4
0.2
0

second feature x2

Equation 4.14

.

3

x4

x3
x5

2
1
0.
0

.
data point

x6
x2

x1

1

2

3

4
5
6
rst feature x1

7

8

9

p(xi ∈ CO ) = ∏ (1 − bji )
j≠ i

Outlier Selection and One-Class Classification

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

xi ∈ CO ∣ G ⇐⇒ deg− (vi ) = 0
G

Experiments and results
. . . . . . . . . . .

(1)

p(xi ∈ CO ) = EG [I{deg− (vi ) = 0}]
G

(2)

p(xi ∈ CO ) = EG [∏ I{ j → i ∉ EG }]

(3)

p(xi ∈ CO ) = EG [∏ (1 − I{ j → i ∈ EG })]

(4)

p(xi ∈ CO ) = ∏ (1 − EG [I{ j → i ∈ EG }])

(5)

p(xi ∈ CO ) = ∏ (1 − p( j → i ∈ EG ))

(6)

p(xi ∈ CO ) = ∏ (1 − bji )

(7)

j≠ i

j≠ i

j≠ i

j≠ i

j≠ i

Outlier Selection and One-Class Classification

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Selecting outliers
5

. . outlier
. . inlier

second feature x2

4
3

x4

x3
x5

2
1

x6
x2

x1

0

−1 .
−1

0

1

2

⎧
⎪ outlier
⎪
f(x ) = ⎨
⎪ inlier
⎪
⎩
Outlier Selection and One-Class Classification

4
5
3
rst feature x1

6

7

8

9

if p(x ∈ CO ) > θ,
if p(x ∈ CO ) ≤ θ.
Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Adaptive variances via the perplexity parameter
.
.
.
.
.
.
.

perplexity h(bi )

5
4
3

. h = 4.5
. x1
. x2
. x3
. x4
. x5
. x6

2
1
.

10−1

.
100
variance

.
101
σ i2

102

5

15

10
variance

σ i2

n

h(bi ) = 2H(bi ) , H(bi ) = − ∑ bij log2 (bij )
j= 1
j≠ i

Outlier Selection and One-Class Classification

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Continuous binary search
.
.
.
.
.
.

variance σ i2

15
10

perplexity h(bi )

5

. x1
. x2
. x3
. x4
. x5
. x6

.
4.8 .
4.6
4.4
4.2
0

Outlier Selection and One-Class Classification

1

2

3

4
5
7
6
binary search iteration

8

9

10

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Perplexity influences outlier probabilities
1

.
.
.
.
.
.

outlier probability

0.8
0.6

. x1
. x2
. x3
. x4
. x5
. x6

0.4
0.2
0.
0

Outlier Selection and One-Class Classification

1

2

4
3
perplexity h

5

6

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Experiments and results

Outlier Selection and One-Class Classification

Jeroen Janssens
SOS (h = 10)

KNNDD (k = 10)

LOF (k = 10)

LOCI

LSOD

A

B

Densities

Ring

Banana

.

.
.
.

.
.
.H

.

J .

.

.
.
.

.
.
.G

.

I

.
.
.

C

.

D

.

.

.

F .
.
.

.

E
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

DM = Iris ower data set

One-class datasets
Petal width

3

D1

.
. .CA = Versicolor ∪ Virginica
. .CN = Setosa
Outlier Selection and One-Class Classification

. . C1 = Setosa
. . C2 = Versicolor
. . C3 = Virginica

2
1
0.
4

.

Experiments and results
. . . . . . . . . . .

5

7
6
Sepal length

8

D3

D2

.
.
. .CA = Setosa ∪ Virginica
. .CN = Versicolor

.
. .CA = Setosa ∪ Versicolor
. .CN = Virginica
Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Weighted AUC
1

1

. . hit . false alarm . miss . correct reject
.
.
.

′

1

θ = 0.57

0.8

0.8

θ

0.5

′

0.6
0.6
0.4

CI

0.

hit rate

.

outlier score

CO

CA

Outlier Selection and One-Class Classification

CN

0.4
0.2
0.

0

0.2

0.4
0.6
false alarm rate

0.8

.
1

0.2

Jeroen Janssens
Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

. LSOD

. . . ..

...
.

.

. . . ..

..
.. .

.. .
.

. .

. . ..
.

.

. .. .

.

. .

.

. ..

. ..

. ..

..

.. .

.

.

. ..

.

..

.
.

.

.

..

0.5
.

Eco Bre Del Wa Win Co Bio Veh Gla Bos He Arrh Ha Bre Liv SPE Hep
b
e
ast ft P ve
t a
M
a
s
e lo
li
C
a
W. um form Recon Ge ed icle S s Iden on Hort Dis ythm erma st W. r Diso TF H titis
ilho ti
Ori p
usi ease ia n’s S New rde ear
Ge gnit ne
gin
uet cat ng
rs t
ur v
ner ion
al
tes ion
iva
ato
l
r

Outlier Selection and One-Class Classification

. .

..

.

.

Iris

.

0.

. LOCI .

.
.
.. .

.

.

..

.

0.6

. LOF .

.

.
.. .

.. . .
.

.. . .
.

. . ..
.

... . .

.
...

.. . .

. ..
.
.. . .

0.7

Experiments and results
. . . . . . . . . . .

. KNNDD .

.

weighted AUC

. SOS .

.
. ..

.

0.8

.

. .. . .

.. .

0.9

.

. .. .

1

....

Real-world datasets

. .. ..

Anomalies and outliers
. . . . . . . . . . . .

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Synthetic datasets
Parameter λ
Data set

Determines

(a)
(b)
(c)
(d)
(e)
(f)
(g)

Radius of ring
Cardinality of cluster and ring
Distance between clusters
Cardinality of one cluster and ring
Density of one cluster and ring
Radius of square ring
Radius of square ring

Outlier Selection and One-Class Classification

λ start

step size

λ end

5
100
4
0
1
2
2

0.1
5
0.1
5
0.05
0.05
0.05

2
5
0
45
0
0.8
0.8

Jeroen Janssens
(a)

(b)

(c)

(d)

(e)

(f)

(g)

.

.

.

.

.

.

.

λ inter

. .

.

.
.
.
.
.
.

λ start

. .

.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.

.

.

.
.
.
.
.
.

. .
. .
.
.
.
.

.

.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.

.

.
.
.
.
.
.

.
.
.
.
.
.

.

.

. .
. .
.
.
.
.

. .
. .
.
.
.
.

. .
. .
.
.
.
.

. .
. .
.
.
.
.

.

λ end
.
.
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Results on synthetic datasets
(a)

(b)

1
AUC

0.9
0.8
0.7
0.6 .
4

.

3.5

3
λ

2.5

2

40

30

20

10

30

40

.

λ

(c)

.

(d)

1
AUC

0.9
0.8
0.7
.

0.6
4

Outlier Selection and One-Class Classification

3

2
λ
(e)

1

0
.

10

20
λ
(f)

Jeroen Janssens
Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

0.9
AUC

Anomalies and outliers
. . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

0.8
0.7

Results on synthetic datasets

.

0.6

4

3

1

2
λ

0
.

10

20

30

40

λ

(e)

(f)

1
AUC

0.9
0.8
0.7
0.6
0 .5

.

.
0.4

0.3

0.2

0.1

1.4

0

λ

1 .2

0.8

1
λ

.

(g)
.
.
.
.
.

1
AUC

0.9
0.8
0.7
0.6
1.4

1.2

1

0.8

. SOS
. KNNDD
. LOF
. LOCI
. LSOD

.

λ
Outlier Selection and One-Class Classification

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

SOS performs significantly better
cd (p < .05)

5
.

4

3

2

1

LOCI
KNNDD

SOS
LOF
LSOD
cd (p < .01)

5
LOCI
LSOD
KNNDD
Outlier Selection and One-Class Classification

4

3

2

1
SOS
LOF

Jeroen Janssens
Anomalies and outliers
. . . . . . . . . . . .

Stochastic Outlier Selection
. . . . . . . . . . . . . . . . . . . . . . .

Experiments and results
. . . . . . . . . . .

Conclusion

•

Outlier selection can support the detection of anomalies

•

SOS is an intuitive and probabilistic algorithm to select outliers

•

SOS has a very good performance

•

No free lunch

Outlier Selection and One-Class Classification

Jeroen Janssens
Outlier Selection and
One-Class Classification
.

Jeroen Janssens @jeroenhjanssens

More Related Content

Similar to Outlier Selection and One-Class Classification Explained

RFP_2016_Zhenjie_CEN
RFP_2016_Zhenjie_CENRFP_2016_Zhenjie_CEN
RFP_2016_Zhenjie_CENZhenjie Cen
 
Respiration and the human transport system homework project
Respiration and the human transport system homework projectRespiration and the human transport system homework project
Respiration and the human transport system homework projectMrs Parker
 
M1 - Photoconductive Emitters
M1 - Photoconductive EmittersM1 - Photoconductive Emitters
M1 - Photoconductive EmittersThanh-Quy Nguyen
 
Final_project_watermarked
Final_project_watermarkedFinal_project_watermarked
Final_project_watermarkedNorbert Naskov
 
Solutions to Statistical infeence by George Casella
Solutions to Statistical infeence by George CasellaSolutions to Statistical infeence by George Casella
Solutions to Statistical infeence by George CasellaJamesR0510
 
Incorporating Learning Strategies in Training of Deep Neural Networks for Au...
Incorporating Learning Strategies in Training of Deep Neural  Networks for Au...Incorporating Learning Strategies in Training of Deep Neural  Networks for Au...
Incorporating Learning Strategies in Training of Deep Neural Networks for Au...Artur Filipowicz
 
Manual de tecnicas de bioestadística basica
Manual de tecnicas de bioestadística basica Manual de tecnicas de bioestadística basica
Manual de tecnicas de bioestadística basica KristemKertzeif1
 
(The R Series) Dick J. Brus - Spatial Sampling with R-CRC Press_Chapman & Hal...
(The R Series) Dick J. Brus - Spatial Sampling with R-CRC Press_Chapman & Hal...(The R Series) Dick J. Brus - Spatial Sampling with R-CRC Press_Chapman & Hal...
(The R Series) Dick J. Brus - Spatial Sampling with R-CRC Press_Chapman & Hal...Rafa Fonseca
 
2009 ESC Guideline of Syncope
2009 ESC Guideline of Syncope2009 ESC Guideline of Syncope
2009 ESC Guideline of Syncopeksmalechu
 
New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)
New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)
New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)Denis Zuev
 

Similar to Outlier Selection and One-Class Classification Explained (20)

outiar.pdf
outiar.pdfoutiar.pdf
outiar.pdf
 
HonsTokelo
HonsTokeloHonsTokelo
HonsTokelo
 
RFP_2016_Zhenjie_CEN
RFP_2016_Zhenjie_CENRFP_2016_Zhenjie_CEN
RFP_2016_Zhenjie_CEN
 
Respiration and the human transport system homework project
Respiration and the human transport system homework projectRespiration and the human transport system homework project
Respiration and the human transport system homework project
 
M1 - Photoconductive Emitters
M1 - Photoconductive EmittersM1 - Photoconductive Emitters
M1 - Photoconductive Emitters
 
Final_project_watermarked
Final_project_watermarkedFinal_project_watermarked
Final_project_watermarked
 
Solutions to Statistical infeence by George Casella
Solutions to Statistical infeence by George CasellaSolutions to Statistical infeence by George Casella
Solutions to Statistical infeence by George Casella
 
Incorporating Learning Strategies in Training of Deep Neural Networks for Au...
Incorporating Learning Strategies in Training of Deep Neural  Networks for Au...Incorporating Learning Strategies in Training of Deep Neural  Networks for Au...
Incorporating Learning Strategies in Training of Deep Neural Networks for Au...
 
Manual de tecnicas de bioestadística basica
Manual de tecnicas de bioestadística basica Manual de tecnicas de bioestadística basica
Manual de tecnicas de bioestadística basica
 
edu
eduedu
edu
 
thesis_Eryk_Kulikowski
thesis_Eryk_Kulikowskithesis_Eryk_Kulikowski
thesis_Eryk_Kulikowski
 
main
mainmain
main
 
(The R Series) Dick J. Brus - Spatial Sampling with R-CRC Press_Chapman & Hal...
(The R Series) Dick J. Brus - Spatial Sampling with R-CRC Press_Chapman & Hal...(The R Series) Dick J. Brus - Spatial Sampling with R-CRC Press_Chapman & Hal...
(The R Series) Dick J. Brus - Spatial Sampling with R-CRC Press_Chapman & Hal...
 
edc_adaptivity
edc_adaptivityedc_adaptivity
edc_adaptivity
 
2009 ESC Guideline of Syncope
2009 ESC Guideline of Syncope2009 ESC Guideline of Syncope
2009 ESC Guideline of Syncope
 
New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)
New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)
New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)
 
Thesis Abstract
Thesis AbstractThesis Abstract
Thesis Abstract
 
M.Sc thesis
M.Sc thesisM.Sc thesis
M.Sc thesis
 
Isl1408681688437
Isl1408681688437Isl1408681688437
Isl1408681688437
 
thesis
thesisthesis
thesis
 

More from Hakka Labs

Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Hakka Labs
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchHakka Labs
 
DataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceDataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceHakka Labs
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartDataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartHakka Labs
 
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleDataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleHakka Labs
 
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataDataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataHakka Labs
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale Hakka Labs
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQHakka Labs
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...Hakka Labs
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...Hakka Labs
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestHakka Labs
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringDataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringHakka Labs
 
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresDataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresHakka Labs
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkHakka Labs
 
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesDataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesHakka Labs
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityDataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityHakka Labs
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...Hakka Labs
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInHakka Labs
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 

More from Hakka Labs (20)

Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series search
 
DataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceDataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data Science
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartDataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at Instacart
 
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleDataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scale
 
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataDataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringDataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineering
 
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresDataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data Structures
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
 
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesDataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with Ourselves
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityDataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 

Recently uploaded

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Recently uploaded (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

Outlier Selection and One-Class Classification Explained

  • 1. Outlier Selection and One-Class Classification . Jeroen Janssens @jeroenhjanssens
  • 2. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Overview • Anomalies and outliers • Stochastic Outlier Selection • Experiments and results Outlier Selection and One-Class Classification Jeroen Janssens
  • 3. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Anomalies and outliers Outlier Selection and One-Class Classification Jeroen Janssens
  • 4. .
  • 5. .
  • 6. .
  • 7. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Definition (Anomaly) An anomaly is an observation or event that deviates qualitatively from what is considered to be normal, according to a domain expert. Outlier Selection and One-Class Classification Jeroen Janssens
  • 8. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Detecting anomalies is important • Expensive • Dangerous Outlier Selection and One-Class Classification Jeroen Janssens
  • 9. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Detecting anomalies is important • Expensive • Dangerous • Mess up your model Outlier Selection and One-Class Classification Jeroen Janssens
  • 10. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Human anomaly detection may suffer from • Fatigue • Information overload • Emotional bias Outlier Selection and One-Class Classification Jeroen Janssens
  • 11. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Computers work with numbers Observations . Outlier Selection and One-Class Classification Jeroen Janssens
  • 12. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Computers work with numbers Observations Data points width height label 8.4 7.3 apple 6.7 7.1 orange 8.0 6.8 apple 7.4 7.2 apple 9.6 9.2 orange … … … Outlier Selection and One-Class Classification . Jeroen Janssens
  • 13. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Computers work with numbers Data points width height label 8.4 7.3 apple 6.7 7.1 orange 8.0 6.8 apple 7.4 7.2 apple 9.6 9.2 orange … … … Outlier Selection and One-Class Classification Visualisation 10 8 . 6 6 . .apple . .orange 4 8 10 width height Observations Jeroen Janssens
  • 14. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . From anomaly to outlier rate of turn . .anomalous vessel . .normal vessel . . Outlier Selection and One-Class Classification speed over ground Jeroen Janssens
  • 15. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Definition (Outlier) An outlier is a data point that deviates quantitatively from the majority of the data points, according to an outlier-selection algorithm. Outlier Selection and One-Class Classification Jeroen Janssens
  • 16. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Three standard deviations . . outlier . . inlier 3σ . −3σ Outlier Selection and One-Class Classification −2σ −1σ μ x 1σ 2σ 3σ Jeroen Janssens
  • 17. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Euler diagram Set legend real-world observations (a) Experiments and results . . . . . . . . . . . Set notation X . X Outlier Selection and One-Class Classification Jeroen Janssens
  • 18. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Euler diagram Set legend Experiments and results . . . . . . . . . . . Set notation real-world observations labeled by expert as anomalous labeled by expert as normal A X ∩A . X (a) (b) X A Outlier Selection and One-Class Classification X Jeroen Janssens
  • 19. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Euler diagram Set legend Experiments and results . . . . . . . . . . . Set notation real-world observations A X X (c) labeled by expert as anomalous labeled by expert as normal A X ∩A data points unrecorded D X ∩D . X (a) (b) X A Outlier Selection and One-Class Classification D Jeroen Janssens
  • 20. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Euler diagram Set legend X (d) A . Outlier Selection and One-Class Classification D anomalies represented as data points normalities represented as data points Experiments and results . . . . . . . . . . . Set notation CA = D ∩ A CN = D ∩ A Jeroen Janssens
  • 21. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Euler diagram Set legend X (d) A . D X (e) A CO D Outlier Selection and One-Class Classification Experiments and results . . . . . . . . . . . Set notation anomalies represented as data points normalities represented as data points CA = D ∩ A CN = D ∩ A classified by algorithm as an outlier classified by algorithm as an inlier CO CI = D ∩ CO Jeroen Janssens
  • 22. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Euler diagram Set legend X (d) A . D X (e) CO D A X (f) A CO D Outlier Selection and One-Class Classification Experiments and results . . . . . . . . . . . Set notation anomalies represented as data points normalities represented as data points CA = D ∩ A CN = D ∩ A classified by algorithm as an outlier classified by algorithm as an inlier CO CI = D ∩ CO hits false alarms misses correct rejects H = CA ∩ CO FA = CN ∩ CO M= CA ∩ CI CR = CN ∩ CI Jeroen Janssens
  • 23. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Confusion matrix Expert labels the observation as a(n) Outlier Selection and One-Class Classification Outlier (CO ) Normality (CN ) hit . Hi false alarm . FA . Inlier (CI ) Algorithm classi es the data point as an Anomaly (CA ) miss . Mi correct. reject CR Jeroen Janssens
  • 24. Labels by the expert Data set . .anomaly .normality . . .data point . . . . Classi cations by the algorithm . .inlier .outlier . . Outcome . .hit .false alarm .miss .correct reject . . . .
  • 25. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Stochastic Outlier Selection Outlier Selection and One-Class Classification Jeroen Janssens
  • 26. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Stochastic Outlier Selection • Unsupervised outlier selection algorithm • Employs concept of affinity • Computes outlier probabilities • One parameter: perplexity • Inspired by t-SNE Outlier Selection and One-Class Classification Jeroen Janssens
  • 27. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . A data point is selected as an outlier when all the other data points have insufficient affinity with it. Outlier Selection and One-Class Classification Jeroen Janssens
  • 28. .
  • 29. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . From input to output Subsections: 4.2.1 4.2.2 n X 1 2 3 4 5 6 .1 4.2.3 D 2 8 6 4 2 . . 0 m Outlier Selection and One-Class Classification 1 2 3 4 5 6 .1 A 2 3 4 5 6 8 6 4 2 . n 4.2.4 – 6 0 1 2 3 4 5 6 .1 B 2 3 4 5 6 . n 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 .1 2 3 4 5 6 0.3 0.2 0.1 . n 0 1 2 3 4 5 6 Φ .1 . 1 0.8 0.6 0.4 0.2 0 1 Jeroen Janssens
  • 30. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Demo: SOS on the command-line (see http://sos.jeroenjanssens.com) Outlier Selection and One-Class Classification Jeroen Janssens
  • 31. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . From feature matrix to dissimilarity matrix second feature x2 4 . 3 x4 0. 0 = 5.056 d2,6 ≡ d6,2 x5 Outlier Selection and One-Class Classification 1 2 3 4 5 6 x6 x2 x1 1 X x3 2 1 Equation 2.1 . data point 2 3 4 5 6 rst feature x1 7 8 9 .1 D 2 8 6 4 2 . 0 1 2 3 4 5 6 .1 2 3 4 5 6 8 6 4 2 . 0 Jeroen Janssens
  • 32. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . From feature matrix to dissimilarity matrix second feature x2 4 . 3 x4 0. 0 = 5.056 d2,6 ≡ d6,2 x5 2 3 4 5 6 rst feature x1 dij = Outlier Selection and One-Class Classification 1 2 3 4 5 6 x6 x2 x1 1 X x3 2 1 Equation 2.1 . data point 7 m 8 9 .1 D 2 8 6 4 2 . 0 1 2 3 4 5 6 .1 2 3 4 5 6 8 6 4 2 . 0 2 ∑ (xjk − xik ) k =1 Jeroen Janssens
  • 33. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Affinity between data points . . . affinity aij 1 0.8 0.6 . σ 2 = 0 .1 i .σ 2 = 1 i . σ 2 = 10 i Equation 4.2 D 1 2 3 4 5 6 0.4 0.2 0. 0 2 Outlier Selection and One-Class Classification 4 6 dissimilarity dij 8 10 .1 A 2 3 4 5 6 8 6 4 2 . 0 1 2 3 4 5 6 .1 2 3 4 5 6 . 1 0.8 0.6 0.4 0.2 0 Jeroen Janssens
  • 34. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Affinity between data points . . . affinity aij 1 0.8 0.6 . σ 2 = 0 .1 i .σ 2 = 1 i . σ 2 = 10 i Equation 4.2 D 1 2 3 4 5 6 0.4 0.2 0. 0 2 4 6 dissimilarity dij 8 .1 10 ⎧ ⎪ exp (−dij2 / 2σi2 ) ⎪ aij = ⎨ ⎪0 ⎪ ⎩ Outlier Selection and One-Class Classification A 2 3 4 5 6 8 6 4 2 . 0 1 2 3 4 5 6 .1 2 3 4 5 6 . 1 0.8 0.6 0.4 0.2 0 if i ≠ j if i = j Jeroen Janssens
  • 35. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Smooth neighborhoods 6 . . . second feature x2 5 .σ 2 . 1 .σ 2 . 3 .σ 2 . 5 4 3 x4 6 x3 σ 62 x5 2 1 .σ 2 2 .σ 2 4 .σ 2 x6 x2 x1 0 −1 . −1 Outlier Selection and One-Class Classification 0 1 2 4 5 3 rst feature x1 6 7 8 9 Jeroen Janssens
  • 36. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . From affinity to binding probability Equation 4.5 / 4.6 A . Outlier Selection and One-Class Classification 1 2 3 4 5 6 .1 B 2 3 4 5 6 . 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 .1 2 3 4 5 6 0.3 0.2 0.1 . 0 Jeroen Janssens
  • 37. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . From affinity to binding probability Equation 4.5 / 4.6 A . 1 2 3 4 5 6 .1 B 2 3 4 5 6 . 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 .1 2 3 4 5 6 0.3 0.2 0.1 . 0 bij = p (i → j ∈ EG ) ∝ aij aij bij = n ∑k=1 aik Outlier Selection and One-Class Classification Jeroen Janssens
  • 38. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Binding probabilities B v3 v4 .21 . (a) .25 v5 v6 .25 v.1 .24 v2 v3 v4 (b) Outlier Selection and One-Class Classification v5 .04 1 2 3 4 5 6 .1 2 3 4 5 6 . 0.25 0.2 0.15 0.1 0.05 0 B .10 .29 .21 v6 1 2 3 4 5 .1 2 3 4 5 6 0.3 0.2 . 0.1 Jeroen Janssens
  • 39. v3 v4 Anomalies and outliers . . . . . . . . . . . . .21 (a) .25 Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . v5 Binding probabilities v v6 .25 .24 v.1 2 v3 v4 . .04 .29 v6 .21 v2 v1 .1 2 3 4 5 6 Experiments and results . .0.25. . . . . . . . . . 0.2 0.15 0.1 0.05 0 B .10 v5 (b) 1 2 3 4 5 6 .30 1 2 3 4 5 6 .1 2 3 4 5 6 0.3 0.2 0.1 . . 0 .10 B v3 v4 (c) v5 .24 .22 Outlier Selection and One-Class Classification .18 v6 1 2 3 4 5 .1 2 3 4 5 6 0.25 0.2 0.15 0.1 0.05Jeroen Janssens
  • 40. Anomalies and outliers . . . . . . . . . . . . v3 v4 .10 Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . .29 v5 (b) Binding probabilities v .30 2 v1 v6 .21 1 2 3 4 5 6 .1 2 3 4 5 6 Experiments and results . .0.3 . . . . . . . . . 0.2 0.1 . . 0 . 0.25 0.2 0.15 0.1 0.05 0 .10 B v3 v4 . (c) v5 .24 .22 v1 v6 .18 v2 1 2 3 4 5 6 .1 2 3 4 5 6 .23 .10 .04 (d) B v3 v4 v5 .05 .04 v6 .05 Outlier Selection and One-Class Classification 1 2 3 4 5 .1 2 3 4 5 .6 0.05 0.04 0.03 0.02 0.01Jeroen Janssens
  • 41. Anomalies and outliers . . . . . . . . . . . . v3 v4 (c) Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . v5 .24 v6 Binding probabilities v .22 v1 .18 2 1 2 3 4 5 6 .1 2 3 4 5 6 Experiments and results . .0.25. . . . . . . . . . 0.2 0.15 0.1 0.05 0 . 0.05 0.04 0.03 0.02 0.01 0 .23 .10 .04 . .05 .04 v5 (d) B v3 v4 v6 .05 v1 Outlier Selection and One-Class Classification v2 .04 1 2 3 4 5 6 .1 2 3 4 5 .6 Jeroen Janssens
  • 42. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Stochastic Neighbor Graph G = (V, EG ) Outlier Selection and One-Class Classification Jeroen Janssens
  • 43. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Stochastic Neighbor Graph G = (V, EG ) p(G) = ∏ bij . i→j ∈ EG Outlier Selection and One-Class Classification Jeroen Janssens
  • 44. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Stochastic Neighbor Graph G = (V, EG ) p(G) = ∏ bij . i→j ∈ EG CO ∣ G = {xi ∈ X ∣ deg− (vi ) = 0} G = {xi ∈ X ∣ ∄vj ∈ V ∶ j → i ∈ EG } = {xi ∈ X ∣ ∀vj ∈ V ∶ j → i ∉ EG } Outlier Selection and One-Class Classification Jeroen Janssens
  • 45. v3 v4 p(Ga ) = 3.931 ⋅ 10−4 v5 (Ga ) v6 v3 v4 p(Gb ) = 4.562 ⋅ 10−5 v5 (Gb ) v6 v3 v4 p(Gc ) = 5.950 ⋅ 10−7 v5 v1 CO ∣Gb = {x5 , x6 } v2 v1 (Gc ) CO ∣Ga = {x1 , x4 , x6 } v2 v.1 v6 v2 CO ∣Gc = {x1 , x3 }
  • 46. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Set of all SNGs ⋅10−4 8 6 Ga 4 2 Gb Gc 0. Outlier Selection and One-Class Classification G cumulative probability mass probability mass 10 1 . . . 0.8 . h = 4.0 . h = 4.5 . h = 5.0 0.6 0.4 0.2 0. G Jeroen Janssens
  • 47. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Approximating outlier probabilities by sampling SNGs 1 . . . . . . outlier probability 0.8 0.6 . x1 . x2 . x3 . x4 . x5 . x6 0.4 0.2 0 .0 10 101 102 p(xi ∈ CO ) = lim S→∞ Outlier Selection and One-Class Classification 1 S 103 sampling iteration S 104 ∑ I{xi ∈ CO ∣ G(s) } , s=1 105 106 G(s) ∼ P(G) Jeroen Janssens
  • 48. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Demo: Sampling SNGs in CoffeeScript and D3 (see http://sos.jeroenjanssens.com) Outlier Selection and One-Class Classification Jeroen Janssens
  • 49. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Computing outlier probabilities through marginalisation p(xi ∈ CO ) = ∑ I{xi ∈ CO ∣ G} ⋅ p(G) G∈G = ∑ I{xi ∈ CO ∣ G} ⋅ ∏ bqr . G∈G q→r ∈ EG ∣G∣ = (n − 1)n Outlier Selection and One-Class Classification Jeroen Janssens
  • 50. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Computing outlier probabilities in closed form 4 B 1 2 3 4 5 6 .1 2 3 4 5 6 0.3 0.2 0.1 . 0 1 2 3 4 5 6 Φ .1 . 1 0.8 0.6 0.4 0.2 0 second feature x2 Equation 4.14 . 3 x4 x3 x5 2 1 0. 0 . data point x6 x2 x1 1 2 3 4 5 6 rst feature x1 7 8 9 p(xi ∈ CO ) = ∏ (1 − bji ) j≠ i Outlier Selection and One-Class Classification Jeroen Janssens
  • 51. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . xi ∈ CO ∣ G ⇐⇒ deg− (vi ) = 0 G Experiments and results . . . . . . . . . . . (1) p(xi ∈ CO ) = EG [I{deg− (vi ) = 0}] G (2) p(xi ∈ CO ) = EG [∏ I{ j → i ∉ EG }] (3) p(xi ∈ CO ) = EG [∏ (1 − I{ j → i ∈ EG })] (4) p(xi ∈ CO ) = ∏ (1 − EG [I{ j → i ∈ EG }]) (5) p(xi ∈ CO ) = ∏ (1 − p( j → i ∈ EG )) (6) p(xi ∈ CO ) = ∏ (1 − bji ) (7) j≠ i j≠ i j≠ i j≠ i j≠ i Outlier Selection and One-Class Classification Jeroen Janssens
  • 52. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Selecting outliers 5 . . outlier . . inlier second feature x2 4 3 x4 x3 x5 2 1 x6 x2 x1 0 −1 . −1 0 1 2 ⎧ ⎪ outlier ⎪ f(x ) = ⎨ ⎪ inlier ⎪ ⎩ Outlier Selection and One-Class Classification 4 5 3 rst feature x1 6 7 8 9 if p(x ∈ CO ) > θ, if p(x ∈ CO ) ≤ θ. Jeroen Janssens
  • 53. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Adaptive variances via the perplexity parameter . . . . . . . perplexity h(bi ) 5 4 3 . h = 4.5 . x1 . x2 . x3 . x4 . x5 . x6 2 1 . 10−1 . 100 variance . 101 σ i2 102 5 15 10 variance σ i2 n h(bi ) = 2H(bi ) , H(bi ) = − ∑ bij log2 (bij ) j= 1 j≠ i Outlier Selection and One-Class Classification Jeroen Janssens
  • 54. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Continuous binary search . . . . . . variance σ i2 15 10 perplexity h(bi ) 5 . x1 . x2 . x3 . x4 . x5 . x6 . 4.8 . 4.6 4.4 4.2 0 Outlier Selection and One-Class Classification 1 2 3 4 5 7 6 binary search iteration 8 9 10 Jeroen Janssens
  • 55. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Perplexity influences outlier probabilities 1 . . . . . . outlier probability 0.8 0.6 . x1 . x2 . x3 . x4 . x5 . x6 0.4 0.2 0. 0 Outlier Selection and One-Class Classification 1 2 4 3 perplexity h 5 6 Jeroen Janssens
  • 56. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Experiments and results Outlier Selection and One-Class Classification Jeroen Janssens
  • 57. SOS (h = 10) KNNDD (k = 10) LOF (k = 10) LOCI LSOD A B Densities Ring Banana . . . . . . .H . J . . . . . . . .G . I . . . C . D . . . F . . . . E
  • 58. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . DM = Iris ower data set One-class datasets Petal width 3 D1 . . .CA = Versicolor ∪ Virginica . .CN = Setosa Outlier Selection and One-Class Classification . . C1 = Setosa . . C2 = Versicolor . . C3 = Virginica 2 1 0. 4 . Experiments and results . . . . . . . . . . . 5 7 6 Sepal length 8 D3 D2 . . . .CA = Setosa ∪ Virginica . .CN = Versicolor . . .CA = Setosa ∪ Versicolor . .CN = Virginica Jeroen Janssens
  • 59. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Weighted AUC 1 1 . . hit . false alarm . miss . correct reject . . . ′ 1 θ = 0.57 0.8 0.8 θ 0.5 ′ 0.6 0.6 0.4 CI 0. hit rate . outlier score CO CA Outlier Selection and One-Class Classification CN 0.4 0.2 0. 0 0.2 0.4 0.6 false alarm rate 0.8 . 1 0.2 Jeroen Janssens
  • 60. Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . . LSOD . . . .. ... . . . . . .. .. .. . .. . . . . . . .. . . . .. . . . . . . .. . .. . .. .. .. . . . . .. . .. . . . . .. 0.5 . Eco Bre Del Wa Win Co Bio Veh Gla Bos He Arrh Ha Bre Liv SPE Hep b e ast ft P ve t a M a s e lo li C a W. um form Recon Ge ed icle S s Iden on Hort Dis ythm erma st W. r Diso TF H titis ilho ti Ori p usi ease ia n’s S New rde ear Ge gnit ne gin uet cat ng rs t ur v ner ion al tes ion iva ato l r Outlier Selection and One-Class Classification . . .. . . Iris . 0. . LOCI . . . .. . . . .. . 0.6 . LOF . . . .. . .. . . . .. . . . . . .. . ... . . . ... .. . . . .. . .. . . 0.7 Experiments and results . . . . . . . . . . . . KNNDD . . weighted AUC . SOS . . . .. . 0.8 . . .. . . .. . 0.9 . . .. . 1 .... Real-world datasets . .. .. Anomalies and outliers . . . . . . . . . . . . Jeroen Janssens
  • 61. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Synthetic datasets Parameter λ Data set Determines (a) (b) (c) (d) (e) (f) (g) Radius of ring Cardinality of cluster and ring Distance between clusters Cardinality of one cluster and ring Density of one cluster and ring Radius of square ring Radius of square ring Outlier Selection and One-Class Classification λ start step size λ end 5 100 4 0 1 2 2 0.1 5 0.1 5 0.05 0.05 0.05 2 5 0 45 0 0.8 0.8 Jeroen Janssens
  • 62. (a) (b) (c) (d) (e) (f) (g) . . . . . . . λ inter . . . . . . . . . λ start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . λ end . .
  • 63. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Results on synthetic datasets (a) (b) 1 AUC 0.9 0.8 0.7 0.6 . 4 . 3.5 3 λ 2.5 2 40 30 20 10 30 40 . λ (c) . (d) 1 AUC 0.9 0.8 0.7 . 0.6 4 Outlier Selection and One-Class Classification 3 2 λ (e) 1 0 . 10 20 λ (f) Jeroen Janssens
  • 64. Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . 0.9 AUC Anomalies and outliers . . . . . . . . . . . . Experiments and results . . . . . . . . . . . 0.8 0.7 Results on synthetic datasets . 0.6 4 3 1 2 λ 0 . 10 20 30 40 λ (e) (f) 1 AUC 0.9 0.8 0.7 0.6 0 .5 . . 0.4 0.3 0.2 0.1 1.4 0 λ 1 .2 0.8 1 λ . (g) . . . . . 1 AUC 0.9 0.8 0.7 0.6 1.4 1.2 1 0.8 . SOS . KNNDD . LOF . LOCI . LSOD . λ Outlier Selection and One-Class Classification Jeroen Janssens
  • 65. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . SOS performs significantly better cd (p < .05) 5 . 4 3 2 1 LOCI KNNDD SOS LOF LSOD cd (p < .01) 5 LOCI LSOD KNNDD Outlier Selection and One-Class Classification 4 3 2 1 SOS LOF Jeroen Janssens
  • 66. Anomalies and outliers . . . . . . . . . . . . Stochastic Outlier Selection . . . . . . . . . . . . . . . . . . . . . . . Experiments and results . . . . . . . . . . . Conclusion • Outlier selection can support the detection of anomalies • SOS is an intuitive and probabilistic algorithm to select outliers • SOS has a very good performance • No free lunch Outlier Selection and One-Class Classification Jeroen Janssens
  • 67. Outlier Selection and One-Class Classification . Jeroen Janssens @jeroenhjanssens