Anomaly detection

이상 감지
(Anomaly Detection)
고등 지능 기술 연구회
(Advanced Intelligence Technology Research Society)
김철(ki4420@gmail.com)
2016-07-09

이상감지란?
데이터의 메인 스트림에서 벗어난 샘플
데이터 마이닝에서 이상감지는 예상 패턴 또는 정상 범
주를 준수하지 않는 아이템, 이벤트, 관찰들의 식별을 의
미.
outlier

이상감지란?(cont.)
Min:Max ≠ Outlier
1.5xIQR rule
IQR(Interquartile Range) = Q3 – Q1
Max
Min

이상 값은 전형적으로 문제의 한 증상으로 해석
일반적인 통계 정의에 따르지 않는 드문 현상

클러스터 알고리즘으로 이상 패턴에 의해 형성된
마이크로 클러스터를 검출

역사
Anomaly detection was proposed for intrusion
detection systems (IDS) by Dorothy Denning in 1986.
초기에는 정상 임계치, 통계량의 전처리, 소프트 컴퓨팅
그리고, 귀납적 학습

응용기술
사이버 침입 탐지, 신용카드 사기, 고장 감지, 시스템 건
전성 모니터링, IoT, etc.
생태계 교란을 감지
데이터에서 이상 값을 제거하는 데 자주 사용

3가지 분류
1. 비지도 이상 감지(Unsupervised anomaly detection)
- 레이블 없는 데이터에서 이상 감지
- K-means 클러스터 알고리즘으로 이상검출
2. 지도 이상 감지(Supervised anomaly detection)
- 정상(Normal), 비정상(Abnormal) 레이블이 존재
- 분류 모델 이용(SVM, Random forests, Logistic, Robust,
KNN, etc.)

3가지 분류(cont.)
3. 준지도 이상 감지(Semi-supervised anomaly detection)
- 정상(Normal) 레이블만 존재하고, 정상 모델에 의해 생성한
likelihood를 비교해서 이상 값을 추출
- NKIA’s LRSTSD based Anomaly Detection
- Twitter’s Seasonal Hybrid ESD (S-H-ESD) based Anomaly
Detection
NKIA’s Anomaly Detection Twitter’s Anomaly Detection

입력 데이터
단변량(Univariate) 다변량(Multivariate)

입력 데이터(cont.)
자료구조
- Binary
- Categorical
- Continuous
- Hybrid

이상값의 종류
Point Anomalies
- 데이터 셋의 뭉치에서 벗어나는 값

이상값의 종류(cont.)
Contextual Anomalies
- 컨텍스트에 동떨어진 값
- 컨텍스트의 개념이 필요
- 조건부 이상치의 참조(Rules)

이상값의 종류(cont.)
Collective Anomalies
- 수집 문제로 발생한 이상값

Output of Anomaly Detection
Label
- Label of normal or anomaly
- 분류문제 접근법에서 true|false or class
Score
- Rank
- 0:1
- Threshold parameter가 필요

이상감지의 평가
F-Measure
- 지도학습, 분류문제 평가
- Formula:
Recall(R) = TP / (TP + FN)
Precision(P) = TP / (TP + FP)
F-measure = 2*R*P/(R+P)
The Area Under an ROC Curve
- AUC(Area Under the Curve)
- Detection Rate(TP), False Alarm Rate(TN)
- 0:1
- Equation:
Confusion Actual class
Normal Anomaly
Predicted
class
Normal TP FP
Anomaly FN TN
이원교차표(Crosstable)
Score Label
.90 ~ 1 Excellent(A)
.80 ~ .90 Good(B)
.70 ~ .80 Fair(C)
.60 ~ .70 Poor(D)
.50 ~ .60 Fail(F)
평가표 ROC(Receiver Operating
Characteristic) Curves
m = # of TP, n = # of TN, 𝑝𝑖 = 𝑇𝑃 𝑅𝑎𝑡𝑒(Detection Rate), 𝑝𝑗 = 𝑇𝑁 𝑅𝑎𝑡𝑒(𝐹𝑎𝑙𝑠𝑒 𝐴𝑙𝑎𝑟𝑚 𝑅𝑎𝑡𝑒)

유명한 이상감지 기법들

Twitter’s Anomaly Detection R pack.
Twitter open-sourced their R package for anomaly
detection.
They call their algorithm Seasonal Hybrid ESD (S-H-
ESD), which is built on Generalized ESD.
Sometimes anomalies can mess up your modeling.

Twitter’s Anomaly Detection R pack.(cont.)
install.packages("devtools")
devtools::install_github("twitter/AnomalyDetection")
library(AnomalyDetection)
install.packages("gtable")
install.packages("scales")
data(raw_data)
res = AnomalyDetectionTs(raw_data, max_anoms=0.02,
direction='both', plot=TRUE)
res$plota

v <- read.csv("D:/r/tsd_paper/cpu_5m_02.csv")
res2 = AnomalyDetectionVec(v, max_anoms=0.02, period=72,
direction='both', plot=TRUE)
res2$plot

Usage
AnomalyDetectionTs(x, max_anoms = 0.1, direction = "pos", alpha = 0.05, only_last = NULL, threshold = "None", e_value =
FALSE, longterm = FALSE, piecewise_median_period_weeks = 2, plot = FALSE, y_log = FALSE, xlabel = "", ylabel = "count", title
= NULL, verbose = FALSE)
Arguments
X : Time series as a two column data frame where the first column consists of the timestamps and the second column consists
of the observations.
max_anoms : Maximum number of anomalies that S-H-ESD will detect as a percentage of the data.
direction : Directionality of the anomalies to be detected. Options are: 'pos' | 'neg' | 'both'.
alpha : The level of statistical significance with which to accept or reject anomalies.
only_last : Find and report anomalies only within the last day or hr in the time series. NULL | 'day' | 'hr'.
threshold : Only report positive going anoms above the threshold specified. Options are: 'None' | 'med_max' | 'p95' | 'p99'.
e_value : Add an additional column to the anoms output containing the expected value.
longterm : Increase anom detection efficacy for time series that are greater than a month. See Details below.
piecewise_median_period_weeks : The piecewise median time window as described in Vallis, Hochenbaum, and Kejariwal (2014).
Defaults to 2.

Usage
AnomalyDetectionTs(x, max_anoms = 0.1, direction = "pos", alpha = 0.05, only_last = NULL, threshold = "None", e_value =
FALSE, longterm = FALSE, piecewise_median_period_weeks = 2, plot = FALSE, y_log = FALSE, xlabel = "", ylabel = "count", title
= NULL, verbose = FALSE)
Arguments(cont.)
plot : A flag indicating if a plot with both the time series and the estimated anoms, indicated by circles, should also be returned.
y_log : Apply log scaling to the y-axis. This helps with viewing plots that have extremely large positive anomalies relative to the
rest of the data.
xlabel : X-axis label to be added to the output plot.
ylabel : Y-axis label to be added to the output plot.
title : Title for the output plot.
verbose : Enable debug messages

To understand how twitter’s algorithm works, you need
to know.
- Student t-distribution
- Extreme Studentized Deviate (ESD) test
- Generalized ESD
- Linear regression
- LOESS
- STL(Seasonal Trend LOESS)

Student t-distribution
정규 분포의 평균을 측정할 때 주로 사용되는 분포
PDF
t

Extreme Studentized Deviate (ESD) test

Generalized ESD

Seasonality(linear regression, LOESS, STL)
The generalized ESD works when you have a set of points from a normal distribution,
but real data has some seasonality. This is where STL comes in. It decomposes the data
into a season part, a trend and whatever’s left over using local regression (LOESS), which
fits a low order polynomial to a subset of the data and stitches them together by
weighting them. Since you can remove the trend and seasonal part with loess, you
should be left with something that is more or less normally distributed. You can apply
generalized ESD on what’s left over to detect anomalies.
#STL: “Seasonal and Trend decomposition using Loess”
Seasonality Local regression(LOESS) Polynomial regression

Twitter: Introducing practical and robust
anomaly detection in a time series
Global/Local
At Twitter, we observe distinct seasonal patterns in most of the time series.
Global: global anomalies typically extend above or below expected seasonality and are
therefore not subject to seasonality and underlying trend
Local: anomalies which occur inside seasonal patterns, are masked and thus are much
more difficult to detect in a robust fashion.
Positive/Negative
Positive: 슈퍼볼 경기 동안의 트윗 폭증 등(이벤트에 대한 용량 산정을 위해 사용)
Negative: 초당 쿼리수(QPS[Queries Per Second])의 증가 등 잠재적인 하드웨어나 데이터
수집 이슈를 발견

Subspace- and correlation-based outlier
detection for high-dimensional data.
주성분 분석(PCA), 요인 분석(Dimension reduction)을 이용하여
차원 축소
부분공간(Subspace)의 대비(Contrast)를 계산하여 이상을 감지

Subspace- and correlation-based outlier
detection for high-dimensional data.(cont.)
HiCS: High Contrast Subspaces for Density-Based Outlier Ranking

RNN(Replicator neural networks)
에러를 최소화해서 입력 패턴을 재생하는 방법
정상 모델을 생성하여 이상값을 추출
A schematic view of a fully connected
Replicator Neural Network.
𝑂𝐹𝑖 = i번째 요소의 Anomaly Factor 스코어
𝑛 = # of features
𝑥𝑖𝑗 = i번째 요소의 j컬럼 관측값
𝑜𝑖𝑗 = i번째 요소의 j컬럼 RNN으로 재생한 정규값

LOF(Local Outlier Factor)
Density-based anomaly detection by KNN
Score를 제공하여 해석이 용이하나 delay time이 좀 있음.
Unsupervised anomaly detection
Basic idea of LOF: comparing the local density of a point with the densities of its neighbors. A has a much lower
density than its neighbors

LOF(Local Outlier Factor)(cont.)
Formula:
Illustration of the
reachability distance.
Objects B and C have the
same reachability distance
(k=3), while D is not a k
nearest neighbor

LOF scores as visualized by ELKI. While the upper right cluster has a
comparable density to the outliers close to the bottom left cluster, they
are detected correctly.

LOF scores of cpu util. vs. Time by Rlof

LRSTSD(Log regression seasonality based
approach of time series decomposition)
Anomaly score formula:
Anomaly score
1일 네트워크 트래픽Tx 7일 네트워크 트래픽Tx
𝐸𝑖 = i번째 에러
𝐴𝑖 = i번째 관측값
𝑈𝑖 = i번째 예측 상한 값
𝐿𝑖 = i번째 예측 하한 값
𝑃 = 전체 값(Parameter)

결론
이상감지는 예측 모델 생성 시 Noise를 제거할 수 있는 기술
 예측률 향상 기대
데이터의 오탐/수집 실패를 감지
 Resampling, 보정 등 적절한 대처가 가능
관측된 이상 값과 문제와의 연관성 분석
 문제에 대한 사전 감지 기술로 활용
 고장 예측

참고문헌
• https://en.wikipedia.org/wiki/Anomaly_detection
• http://datascience.stackexchange.com/questions/2313/mach
ine-learning-where-is-the-difference-between-one-class-
binary-class-and-m
• https://en.wikipedia.org/wiki/Outlier#Detection
• https://www.semanticscholar.org/paper/Outlier-Detection-
Using-Replicator-Neural-Networks-Hawkins-
He/87a09c777dcecab4883e328669ef2af1ba8dd7be
• http://neuro.bstu.by/ai/To-dom/My_research/Papers-0/For-
research/D-mining/Anomaly-D/KDD-cup-
99/NN/dawak02.pdf
• http://slideplayer.com/slide/4194183/
• http://link.springer.com/chapter/10.1007%2F978-981-10-
0281-6_118#page-1
• https://cran.r-project.org/web/packages/Rlof/index.html
• https://warrenmar.wordpress.com/tag/seasonal-hybrid-esd/
• https://ko.wikipedia.org/wiki/%EC%8A%A4%ED%8A%9C%EB
%8D%98%ED%8A%B8_t_%EB%B6%84%ED%8F%AC
• https://en.wikipedia.org/wiki/Soft_computing
• https://www.google.com/trends/explore#q=anomaly%2C%20%2Fm%
2F02vnd10%2C%20%2Fm%2F0bs2j8q&cmpt=q&tz=Etc%2FGMT-9
• http://www.slideserve.com/sidonie/data-mining-for-anomaly-
detection
• http://www.physics.csbsju.edu/stats/box2.html
• http://study.com/academy/lesson/maximums-minimums-outliers-in-
a-data-set-lesson-quiz.html
• http://www.sfu.ca/~jackd/Stat203/Wk02_1_Full.pdf
• http://slideplayer.com/slide/6321088/
• http://gim.unmc.edu/dxtests/roc3.htm
• http://www.cs.ru.nl/~tomh/onderwijs/dm/dm_files/roc_auc.pdf
• http://togaware.com/papers/dawak02.pdf
• https://en.wikipedia.org/wiki/Grubbs%27_test_for_outliers
• https://github.com/twitter/AnomalyDetection
• https://blog.twitter.com/2015/introducing-practical-and-robust-
anomaly-detection-in-a-time-series

Anomaly detection

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Anomaly detection

Similar to Anomaly detection (20)

Recently uploaded

Recently uploaded (20)

Anomaly detection

Editor's Notes