Anomaly detection using deep one class classifier

Anomaly Detection using
Deep One-Class Classifier
Proceedings of the 35th International Conference on Machine
Learning, Stockholm, Sweden, PMLR 80, 2018

Anomaly Detection and Localization
Using GAN and One-Class Classifier
Satellite Image Forgery Detection and Localization
Using GAN and One-Class Classifier
https://arxiv.org/abs/1802.04881
Previous
Approach I

Anomaly Detection
• 정상치에서 벗어난 관측치들을 detect  One-class classification
혹은 one-class description
여기서는
• Generative adversarial network 또는 Auto-encoder를 이용하여 정상
image에 대한 feature를 mapping한 후 one-class support vector
machine (SVM)으로 분포를 결정. Query image에 대하여 결정된 분
포내에 존재하는지 여부 확인

Problem formulation
• 학습된 image외에 unseen or unfamiliar object가 발견될 경우, 그
림과 같이 binary mask로 영역을 표시
Trained Image Trained Image mask
Query Image w/
unfamiliar object
Query Image mask w/
unfamiliar object

Method
𝐴 𝑒
X
h 𝐴 𝑑
𝑋
X
min
𝐺
max
𝐷
𝑉(𝐷, 𝐺) = 𝐸 𝑋~𝑝 𝑑𝑎𝑡𝑎
log 𝐷 𝑋 + log(1 − 𝐷 𝐺 𝑋 )
𝑋 = 𝐺 𝑋 = 𝐴 𝑑 ℎ = 𝐴 𝑑 𝐴 𝑒(𝑋)
• Auto-encoder를 이용하여 image로부터 feature(h) 구하고 이를 다시
복원. 복원된 image와 원 image를 이용하여 GAN을 훈련  Auto-
encoder 보다 약간의 성능향상
• 정상 image에 대한 latent space의 distribution을 찾아 냄.

Method
Normal image의 cluster
Abnormal image의 features
- Training된 Auto-encode의 Encoder에 Query image를 입력하여
latent vector를 계산
- 계산 된 latent vector가 정상 image의 cluster내에 포함되는지 여부 판단
 여기서는 RADIAL BASES FUNCTIONS(Gauss Kernel,
Parametric modeling of Cluster) 을 사용한 One class SVM을 사용
Features from normal patches(i.e., red dots) cluster together, whereas
features from abnormal patches (i.e., blue dots) are more distant.

we solve the problem of classifying nonlinearly separable pattern in a hybrid
manner involving two stages:
• First: Transform a given set of nonlinearly separable patterns into a new
set for which, under certain conditions, the likelihood of the transformed
patterns becoming linearly separable is high.
• Second: the solution of the classification problem is completed by using
Stochastic Gradient Descent.
Non-linear SVM Classifier
using the RBF(Radial-basis function) kernel

We find w and b by solving the following objective function using Quadratic
Programming.
To define an optimal hyperplane we need to maximize the width of the
margin(w).
Linear SVM(Support Vector Machines)
Support vector

• The simplest way to separate two groups of data is with a straight line (1
dimension), flat plane (2 dimensions) or an N-dimensional hyperplane.
• However, there are situations where a nonlinear region can separate the
groups more efficiently.
• The kernel function transform the data into a higher dimensional feature
space to make it possible to perform the linear separation.
Non-Linear SVM(Support Vector Machines)
kernel trick

To Map from input space to feature space to simplify classification task
Non-linear SVM Classifier using the RBF(Radial-basis function) kernel is
adopted
Non-Linear SVM(Support Vector Machines)
Feature space에서의 inner product(a measure of similarity)

Key Idea of Kernel Methods
K(𝑥𝑖, 𝑥𝑗)
K(𝑥𝑖, 𝑥𝑗) = Φ(𝑥𝑖)· Φ(𝑥𝑗)

Normal Condition :
Cluster bound :
exp{−
[ 𝑥1−𝑐1
2+ 𝑥2−𝑐2
2]
2𝜎2 } ≥ {0<Threshold<<1}
𝑥1 − 𝑐1
2
+ 𝑥2 − 𝑐2
2
≤ r2
𝐾1 + 𝐾2 ≤ r2
x1
x2
.(c1,c2)
r
K1
K2
r2
r2
Key Idea of Kernel Methods

RBFN architecture
Σ
Input layer
Hidden layer
(RBFs)
Output layer
W1 W2 WM
x1 x2 xn
No weight
f(x)
Each of n components of
the input vector x feeds
forward to m basis
functions whose outputs
are linearly combined with
weights w (i.e. dot product
x∙w) into the network
output f(x).
The output layer performs a simple weighted sum (i.e. w ∙x).
If the RBFN is used for regression then this output is fine.
However, if pattern classification is required, then a hard-
limiter or sigmoid function could be placed on the output
neurons to give 0/1 output values
Input data set ∶ 𝑋 = { 𝑥1 𝑥2 … 𝑥 𝑁}

RBFN architecture
 For Gaussian basis functions
 s x w w x c
w w
x c
p i i p i
i
M
i
pj ij
ijj
n
i
M
( )
exp
( )
  
  
  










0
1
0
2
2
11 2


 Assume the variance  across each dimension are
equal
s x w w x cp i
i
pj ij
j
n
i
M
( ) exp ( )

   






0 2
2
11
1
2
→ → →
→

Σ Σ
Category 1 Category 2
Category 1
Category 2
RBFN for classification

RBFN Learning
• Design decision
• number of hidden neurons
• max of neurons = number of input patterns
• more neurons – more complex, smaller tolerance
• Parameters to be learnt
• centers
• radii
• A hidden neuron is more sensitive to data points near its center.
This sensitivity may be tuned by adjusting the radius.
• smaller radius  fits training data better (overfitting)
• larger radius  less sensitivity, less overfitting, network of
smaller size, faster execution
• weights between hidden and output layers

The question now is:
How to train the RBF network?
In other words, how to find:
 The number and the parameters of hidden units (the basis functions)
using unlabeled data (unsupervised learning).
 K-Mean Clustering Algorithm
 The weights between the hidden layer and the output layer.
 Recursive Least-Squares Estimation Algorithm
RBFN Learning

xp
K-means
K-Nearest
Neighbor
Basis
Functions
Linear
Regression
ci
ci
i
A w
RBFN Learning

 Use the K-mean algorithm to find ci
RBFN Learning

K-mean Algorithm
step1: K initial clusters are chosen randomly from the samples
to form K groups.
step2: Each new sample is added to the group whose mean is
the closest to this sample.
step3: Adjust the mean of the group to take account of the new
points.
step4: Repeat step2 until the distance between the old means
and the new means of all clusters is smaller than a
predefined tolerance.

Outcome: There are K clusters with means representing
the centroid of each clusters.
Advantages: (1) A fast and simple algorithm.
(2) Reduce the effects of noisy samples.

 Use K nearest neighbor rule to find the function
width 
k-th nearest neighbor of ci
 The objective is to cover the training points so that a
smooth fit of the training samples can be achieved
2
1
1


K
k
iki cc
K
→ →

 RBF learning by gradient descent
 Let andi p
pj ij
ijj
n
p p px
x c
e x d x s x( ) exp ( ) ( ) ( )
   
 








 


1
2
2
2
1 
 E e xp
p
N



1
2 1
2
( ) .

we have






E
w
E E
ci ij ij
, , and
Apply
→ → → →
→
N : No. of batch

we have the following update equations
 RBF learning by gradient descent

Gaussian Mixture Models and
Expectation-Maximization
Algorithm

28
Normal Distribution (1D Gaussian)
 
2
2
1
( , ) exp
22
x
f x

 
 
 
  
 
 
,mean
2 ,std

29
 d = 2
 x = random data point (2D vector)
 = mean value (2D vector)
 = covariance matrix (2D matrix)
2D Gaussians
 
   1
1
( , ) exp
22 det( )
T
d
x x
f x
 


   
   
 
  


 The same equation holds for a 3D Gaussian

30
2D Gaussians
 
   1
1
( , ) exp
22 det( )
T
d
x x
f x
 


   
   
 
  



31
Exploring Covariance Matrix
  
2
2
1
( , )
cov( , )1
cov( , )
i i i
N
T w
i i
i h
x random vector w h
w h
x x
N h w

 


 
      
 

 is symmetric
 has eigendecomposition (svd)


 * * T
V D V 

1 2 ... d    

32
Covariance Matrix Geometry

1
2
* *
1*
2*
T
V D V
a v
b v


 


b
a

33
3D Gaussians
  
2
2
1 2
( , , )
cov( , ) cov( , )
1
cov( , ) cov( , )
cov( , ) cov( , )
i
rN
T
i i g
i
b
x r g b
g r b r
x x r g b g
N
r b g b

  



 
 
      
 
 


34
GMMs – Gaussian Mixture Models
W
H
 Suppose we have 1000 data points in 2D space (w,h)

35
W
H
GMMs – Gaussian Mixture Models
 Assume each data point is normally distributed
 Obviously, there are 5 sets of underlying gaussians

36
The GMM assumption
 There are K components (Gaussians)
 Each k is specified with three parameters: weight, mean,
covariance matrix
 The total density function is:
 
   1
1
1
1
1
( ) exp
22 det( )
{ , , }
0 1
T
K
j j j
j d
j
j
K
j j j j
K
j j
j
x x
f x
weight j
 


 
  




   
   
   
  
   



37
The EM algorithm (Dempster, Laird and Rubin, 1977)
Raw data GMMs (K = 6) Total Density Function
i
i

38
EM Basics
 Objective:
Given N data points, find maximum likelihood estimation of :
 Algorithm:
1. Guess initial
2. Perform E step (expectation)
 Based on , associate each data point with specific gaussian
3. Perform M step (maximization)
 Based on data points clustering, maximize
4. Repeat 2-3 until convergence (~tens iterations)

1argmax ( ,..., )Nf x x

  




39
EM Details
 E-Step (estimate probability that point t associated to gaussian j):
 M-Step (estimate new parameters):
,
1
( , )
1,..., 1,...,
( , )
j t j j
t j K
i t i ii
f x
w j K t N
f x
 
 

  

,
1
,1
,1
,1
,1
1
( )( )
N
new
j t j
t
N
t j tnew t
j N
t jt
N new new T
t j t j t jnew t
j N
t jt
w
N
w x
w
w x x
w


 







 
 






40
EM Example
Gaussian j
data point t
blue: wt,j

RBF networks MLP
Learning speed Very Fast Very Slow
Convergence Almost guarantee Not guarantee
Response time Slow Fast
Memory
requirement
Very large Small
Hardware
implementation
IBM ZISC036
Nestor Ni1000
www-5.ibm.com/fr/cdlab/zisc.html
Voice Direct 364
www.sensoryinc.com
Generalization Usually better Usually poorer
Hyper-parameter ?
Initial values
are given !

Simulation
• The color image under analysis is split into patches (either
overlapping or not) of size 64x64 pixels.
• A adversarially trained auto-encoder encodes the patches into a low
dimensional representation called feature vector h(a 2,048
dimensional vector).
• A one-class SVM fed with h is used to detect forged patches as
anomalies with respect to features distribution learned from normal
patches.
• Once all patches are classified, a label mask for the entire image is
obtained by grouping together all the patch labels.

• Small - Object size is smaller than the patch size (approximately 32
pixel per side).
• Medium - Object size is comparable to patch size (approximately 64
pixel per side).
• Large - Object size is larger than patch size (approximately128 pixel
per side).
Simulation
검출대상물의 크기에 따라 성능평가

Simulation
Query Image I w/
unfamiliar object
Query Image II w/
unfamiliar object
GT mask I GT mask II

Unsupervised Anomaly Detection with
GANs to Guide Marker Discovery
https://arxiv.org/abs/1703.05921
Postech 이도엽씨가 구현한 Tensorflow 코드
https://github.com/LeeDoYup/AnoGAN
Previous
Approach II

이 연구에서는 아래 그림처럼 정상 data만으로 학습시킨 GAN
모델를 이용하여 Query data에 대하여 정상여부는 물론
비정상 시 비정상 영역을 찾아내고자 함.

1. 정상 data를 이용하여 Generator & Discriminator의 훈련
- Deep convolutional generative adversarial network을 이용하여
latent space(z)로 부터 Generator를 이용하여 생성된 image와
Real image를 구별하도록 Discriminator를 훈련
 정상 data의 latent space(z) 분포를 학습
2. 비정상 data여부와 비정상 영역 파악
- 훈련된 Generator & Discriminator의 parameter를 고정한 채
Query image에 대한 latent space(z)로의 mapping 작업을 수행
훈련된 정상 data의 경우, 기학습된 정상 data의 latent space(z) 로
mapping이 되지만, 비정상 data의 경우 벗어남
 cost function의 오차가 발생
Anomaly Detection은 다음과 같이 2단계로 이루어짐

1. GAN을 이용하여 정상 data 모델링하기
: 정상 data의 generative model(distribution)을 GAN을 이용하여 학습
정상 𝑑𝑎𝑡𝑎 𝐼 𝑚, with m = 1,2,.....,M, where 𝐼 𝑚 ∈ 𝑅 𝑎𝑥𝑏
임의의 위치에서 랜덤하게 cxc크기의 K 2-D image
patches를 추출 x = 𝑥 𝑘,𝑚 ∈ ℵ with k = 1,2,……,K.
D and G are simultaneously optimized through the following two-
player minimax game with value function V (G,D)
The discriminator is trained to maximize the probability of assigning
real training examples the “real” and samples from 𝑝 𝑔the “fake” label

2. Query data의 latent space Mapping
Query image x가 주어질 경우, 이와 가장 유사한 가상 image인 G(z) 에
해당하는 latent space상의 점 z을 찾는다.
x 와 G(z)의 유사여부는 query image가 generator의 훈련시 사용된 정상
data의 분포 𝑝 𝑔를 어느 정도 따르느냐에 의해 결정
z을 찾기 위하여 , latent space distribution Z에서 랜덤하게 샘플된 z1 을
기훈련된 generator에 입력하여 얻은 출력 G(z1)와 x의 차(loss ft’n)를 최
소화하도록 backpropagation을 통하여 latent space의 점z2로 update

z
정상 image의 Latent space(z)가 1차원이라고 가정하고
Z은 다음과 같은 분포로 가정하면
𝜇 𝑧
z𝜇 𝑧
Query image에 대한 latent space(z) mapping은
i) 임의의 값 𝑧1에서 시작하여 loss ft’n을 최소화하도록 update
ii) 주어진 Γ번째 iteration 후 𝑧Γ이 allowable range안에 들어왔는지
여부에 때라 정상, 비정상을 구분
𝑧1 𝑧2 𝑧Γ
Allowable range

• Overall loss or Anomaly score:
• Anomaly score consists of two parts:
• Residual Loss - visual similarity
• Discrimination Loss - enforces the generated image to lie on the manifold
Query Image의 Mapping에 대한 Loss function 정의

Improved discrimination loss based on feature matching
• f(.) – output of intermediate layer of the discriminator
• It is some statistics of an input image
This approach utilizes the trained discriminator not as classifier
but as a feature extractor

3. Anomaly Detection
Anomaly score : query image x가 정상 image에 얼마나 부합하는지 여부
R(x) : Γ번의 backpropagation후 Residual loss
D(x) : Γ번의 backpropagation후 Discrimination Loss
비정상 image : A(x) is large
정상 image : A(x) is small
𝑥 𝑅 = 𝑥 − 𝐺 𝑧Γ
Residual error : image내의 비정상 영역을 나타냄

4. Experiments
실험대상은 망막층을 3차원적으로 관측하는 빛간섭단층촬영(OCT) 영상
• Data, Data Selection and Preprocessing
i) Training sets :
- 2D image patches extracted from 270 clinical OCT volumes of healthy subjects
- The gray values were normalized to range from -1 to 1.
- Extracted in total 1,000,000 2D training patches with an image resolution of
64x64 pixels at randomly sampled positions.

ii) Testing sets :
- patches were extracted from 10 additional healthy cases and 10
pathological cases, which contained retinal fluid
- Test set in total consisted of 8,192 image patches and comprised
normal and pathological samples

iii) Model description
- Adopt DCGAN architecture that resulted in stable GAN training on
images of sizes 64x64 pixels.
- Utilized intermediate representations with 512-256-128-64 channels
(instead of 1024-512-256-128)
- Discrimination loss : Feature representations of the last convolution
layer of the discriminator was used
- Training was performed for 20 epochs utilizing Adam optimizer.
- Ran 500 backpropagation steps for the mapping of new images to the
latent space.
- Used λ= 0.1 in loss function

i) Generative capability of the DCGAN
5. Experiments
Given image
Generated image
Residual overlay
Pixel-level annotations
of retinal fluid
Normal image Anomalous image

ii) Detection performance
ROC curves
Distribution of the residual score(c)
and of the discrimination score(d)
Latent space에서 정상 data(trained data 및 test data 중 정상)간의 분포는
유사하나 Test data 중 비정상과는 확실한 차이를 나타냄

Problems in Previous Approach
- Can’t control the shape and boundary of cluster
- Can’t control the ambiguous point at the boundary
 Let’s find a way to control the shape of cluster
and ambiguous point at the boundary

SVDD is the smallest enclosing ball problem and it’s alternatives are
• The minimum enclosing ball problem with errors
• The minimum enclosing ball problem in a RKHS(Repoducing
Kernel Hilbert Spaces)
• The two class Support vector data description (SVDD)
Support Vector Data Description (SVDD)

• One class is the target class, and all other data is outlier data.
• Create a spherically shaped boundary around the complete target set.
• To minimize the chance of accepting outliers, the volume of this description
is minimized.
• Outlier sensitivity can be controlled by changing the ball-shaped boundary
into a more flexible boundary.
• Example outliers can be included into the training procedure to find a more
efficient description.
SOLUTIONS FOR SOLVING DATA DESCRIPTION

1. The minimum enclosing ball problem [Tax and Duin, 2004]
centerRadius, R

2. The minimum enclosing ball problem with errors

- We assume vectors x are column vectors.
- We have a training set {xi }, i = 1, . . , N for which we want to obtain a description.
- We further assume that the data shows variances in all feature directions.
NORMAL DATA DESCRIPTION
• The sphere is characterized by center a and radius R > 0.
• We minimize the volume of the sphere by minimizing R², and demand that
the sphere contains all training objects xi.
• To allow the possibility of outliers in the training set, the distance from xi to
the center a should not be strictly smaller than R², but larger distances should
be penalized.
- Minimization problem:
F(R, a) = R² + C∑ξi
with constraints ||xi − a||² ≤ R² + ξi, ξi ≥ 0

Lagrange function :
L(R, a, αi, γi, ξi ) = R² + C∑ξi − ∑αi {R² + ξi − (‖xi‖² − 2a · xi + ‖a‖²)} − ∑γi ξi
L should be minimized with respect to R, c, ξi and maximized
with respect to αi and γi:
} With subject to: 0 ≤ αi ≤ C

} Support vectors
There are 3 cases
𝑅2
= 𝑋 𝑏 − 𝑎 2
= 𝑋 𝑏 ⋅ 𝑋 𝑏 - 2 𝑖 𝛼𝑖 (𝑋𝑖 ⋅ 𝑋 𝑏 ) + 𝑖,𝑗 𝛼𝑖 𝛼𝑗 (𝑋𝑖⋅ 𝑋𝑗)
Hypersphere’s center can be determined as
𝑎 =
𝑖
𝛼𝑖 𝑿𝒊
Hypersphere’s radius can be determined by selecting an arbitrary
support vector on the boundary 𝑋 𝑏

TEST A NEW DATA Xk
To test if a new data Xk is within the sphere, the distance to the center
of Sphere has to be calculated. A test data Xk is Normal when this
distance is smaller than radius
||xk − a||² ≤ R2

Please refer to Python Code for SVDD :
https://wikidocs.net/3431

SVDD with negative examples
- When negative examples (objects which should be rejected) are available,
they can be incorporated in the training to improve the description.
- In contrast with the training (target) examples which should be within the
sphere, the negative examples should be outside it.
 Minimization problem:
With constraints:
}

3. The minimum enclosing ball problem in a RKHS
Gaussian kernel:
With subject to: 0 ≤ αi ≤ C
• Minimum enclosing ball problem with errors
• Inner product can be substituted by a general kernel function like
Gaussian kernel
𝑋 𝑘 − 𝑎 2
= K(𝑋 𝑘, 𝑋 𝑘) - 2 𝑖 𝛼𝑖 K(𝑋𝑖, 𝑋 𝑘) + 𝑖,𝑗 𝛼𝑖 𝛼𝑗 K(𝑋𝑖, 𝑋𝑗) ≤ 𝑅2

3. The minimum enclosing ball problem in a RKHS
- For small values of s all objects
become support vectors.
Test object is selected when:
- For very large s the solution
approximates the original
spherically shaped solution.
- Decreasing the parameter C
constraints the values for αi
more, and more objects
become support vectors.
- Also with decreasing C the
error on the target class
increases, but the covered
volume of the data description
decreases.

4. The two class Support vector data description (SVDD)

The two class SVDD vs. one class SVDD

Deep SVDD learns a neural network transformation Ф(· ; W) with weights W
from input space X∈ R
d
to output space F ∈ R
p
that attempts to map most of the
data network representations into a hypersphere characterized by center c
and radius R of minimum volume.
Mappings of normal examples fall within, whereas mappings of anomalies fall
outside the hypersphere.
Deep Support Vector Data Description (Deep SVDD)

Given some training data on X, we define
the soft-boundary Deep SVDD objective as
- First term : minimizing R2 minimizes the volume of the hypersphere.
- Second term is a penalty term for points lying outside the sphere after
being passed through the network, i.e. if its distance to the center
is greater than radius R
- The last term is a regularizer on the network parameters W

To achieve this the network must extract the common factors of variation
of the data.
As a result, normal examples of the data are closely mapped to center c,
whereas anomalous examples are mapped further away from the center
or outside of the hypersphere.
Through this we obtain a compact description of the normal class.
Anomal data Anomal dataNomal data Nomal data

One-Class Deep SVDD objective
SVDD simply employs a quadratic loss for penalizing the distance of
every network representation to c
One-Class Deep SVDD contracts the sphere by minimizing the mean
distance of all data representations to the center.

For a given test point x ϵ X,
anomaly score s can be defined for both variants of Deep SVDD by
the distance of the point to the center of the hypersphere
Anomaly Score Anomaly Score
Conventional Approach Deep SVDD
Normal Anomal Normal Anomal
Anomaly Score
distribution
distribution

One-class classification on MNIST and CIFAR-10
Each convolutional module consists of a convolutional layer followed by
leaky ReLU activations and 2x2 max-pooling.
On MNIST, a CNN with two modules, 8x(5x5x1)-filters followed by 4x(5x5x1)-
filters, and a final dense layer of 32 units.
On CIFAR-10, a CNN with three modules, 32x(5x5x3)-filters,
64x(5x5x3)-filters, and 128x(5x5x3)-filters, followed by a final dense layer of
128 units.
a batch size of 200 and set the weight decay hyper-parameter λ = 10-6
Network architectures

Both MNIST and CIFAR-10 have ten different classes from which we
create ten one-class classification setups.
In each setup, one of the classes is the normal class and samples from the
remaining classes are used to represent anomalies.
Only train with training set examples from the respective normal class.
Training set sizes of n≈6,000 for MNIST and n=5,000 for CIFAR-10.
Both test sets have 10,000 samples including samples from the nine
anomalous classes for each setup.
Pre-process all images with global contrast normalization using the L1
norm and finally rescale to [0; 1] via min-max-scaling.
Data setup

Average AUCs in % with StdDevs (over 10 seeds) per method and one-class
experiment on MNIST and CIFAR-10

Anomaly Detection using
One-Class Neural Networks
arXiv:1802.06360v1
Code : https://github.com/raghavchalapathy/oc-nn

Model architecture of Auto-encoder and
the proposed one-class neural networks

One-Class Support Vector Machine
Objective is to find a Hyper plane and distance from origin, which is
positive on subset A and negative on every thing out side A.
Maximize distance from hyper plane to origin
Subset A
Hypersphere
Hyperplane
𝑟
Negative
𝑤

In order to obtain w and r , we need to solve the following
optimization problem,
One-Class Support Vector Machine
where w is the norm perpendicular to the hyper-plane and r is the
distance of the hyper-plane from origin.
Distance of Feature vector from origin

A simple feed forward network with one hidden layer
having linear or sigmoid activation g(·) and one output node
OC-NN objective can be formulated as:
where w is the scalar output obtained from the hidden to output
layer, V is the weight matrix from input to hidden units. Xn is an
input vector
One-Class NN

Discriminative Feature Learning

A Discriminative Feature Learning
For generic object, scene or action recognition. The deeply learned
features need to be not only separable but also discriminative.

• Only softmax loss has been considered in classification problem
 SOFTMAX LOSS : encouraging the separability of features.
• Discriminative feature learning approach considers center loss as well
 CENTER LOSS: simultaneously learning a center for deep
features of each class and penalizing the distances between
the deep features and their corresponding class centers.
 JOINT SUPERVISION: minimizing the intra-class variations while
keeping the features of different classes separable

Detailed Discussion on Center Loss
• Easy-to-Implement. The gradient and update equation
are easy to derive and the resulting CNN model is
trainable.
• Easy-to-Train. Centers are updated based on mini-batch
with an adjustable learning rate.
• Easy-to-Input. Center loss enjoys the same requirement as
the softmax loss and needs no complex sample mining
and recombination, which is inevitable in contrastive loss
and triple loss.
• Easy-to-Converge. Faster than softmax loss only

• With only softmax loss (λ=0), the deeply learned features are
separable, but not discriminative (significant intra-class variations).
• With proper λ, the discriminative power of deep features can be
significantly enhanced, which is crucial for classification problem

Anomaly detection using deep one class classifier

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Anomaly detection using deep one class classifier

Similar a Anomaly detection using deep one class classifier (20)

Más de 홍배 김

Más de 홍배 김 (20)

Último

Último (20)

Anomaly detection using deep one class classifier