8. 응용기술
사이버 침입 탐지, 신용카드 사기, 고장 감지, 시스템 건
전성 모니터링, IoT, etc.
생태계 교란을 감지
데이터에서 이상 값을 제거하는 데 자주 사용
9. 3가지 분류
1. 비지도 이상 감지(Unsupervised anomaly detection)
- 레이블 없는 데이터에서 이상 감지
- K-means 클러스터 알고리즘으로 이상검출
2. 지도 이상 감지(Supervised anomaly detection)
- 정상(Normal), 비정상(Abnormal) 레이블이 존재
- 분류 모델 이용(SVM, Random forests, Logistic, Robust,
KNN, etc.)
10. 3가지 분류(cont.)
3. 준지도 이상 감지(Semi-supervised anomaly detection)
- 정상(Normal) 레이블만 존재하고, 정상 모델에 의해 생성한
likelihood를 비교해서 이상 값을 추출
- NKIA’s LRSTSD based Anomaly Detection
- Twitter’s Seasonal Hybrid ESD (S-H-ESD) based Anomaly
Detection
NKIA’s Anomaly Detection Twitter’s Anomaly Detection
20. Twitter’s Anomaly Detection R pack.
Twitter open-sourced their R package for anomaly
detection.
They call their algorithm Seasonal Hybrid ESD (S-H-
ESD), which is built on Generalized ESD.
Sometimes anomalies can mess up your modeling.
21. Twitter’s Anomaly Detection R pack.(cont.)
install.packages("devtools")
devtools::install_github("twitter/AnomalyDetection")
library(AnomalyDetection)
install.packages("gtable")
install.packages("scales")
data(raw_data)
res = AnomalyDetectionTs(raw_data, max_anoms=0.02,
direction='both', plot=TRUE)
res$plota
22. Twitter’s Anomaly Detection R pack.(cont.)
v <- read.csv("D:/r/tsd_paper/cpu_5m_02.csv")
res2 = AnomalyDetectionVec(v, max_anoms=0.02, period=72,
direction='both', plot=TRUE)
res2$plot
23. Twitter’s Anomaly Detection R pack.(cont.)
Usage
AnomalyDetectionTs(x, max_anoms = 0.1, direction = "pos", alpha = 0.05, only_last = NULL, threshold = "None", e_value =
FALSE, longterm = FALSE, piecewise_median_period_weeks = 2, plot = FALSE, y_log = FALSE, xlabel = "", ylabel = "count", title
= NULL, verbose = FALSE)
Arguments
X : Time series as a two column data frame where the first column consists of the timestamps and the second column consists
of the observations.
max_anoms : Maximum number of anomalies that S-H-ESD will detect as a percentage of the data.
direction : Directionality of the anomalies to be detected. Options are: 'pos' | 'neg' | 'both'.
alpha : The level of statistical significance with which to accept or reject anomalies.
only_last : Find and report anomalies only within the last day or hr in the time series. NULL | 'day' | 'hr'.
threshold : Only report positive going anoms above the threshold specified. Options are: 'None' | 'med_max' | 'p95' | 'p99'.
e_value : Add an additional column to the anoms output containing the expected value.
longterm : Increase anom detection efficacy for time series that are greater than a month. See Details below.
piecewise_median_period_weeks : The piecewise median time window as described in Vallis, Hochenbaum, and Kejariwal (2014).
Defaults to 2.
24. Twitter’s Anomaly Detection R pack.(cont.)
Usage
AnomalyDetectionTs(x, max_anoms = 0.1, direction = "pos", alpha = 0.05, only_last = NULL, threshold = "None", e_value =
FALSE, longterm = FALSE, piecewise_median_period_weeks = 2, plot = FALSE, y_log = FALSE, xlabel = "", ylabel = "count", title
= NULL, verbose = FALSE)
Arguments(cont.)
plot : A flag indicating if a plot with both the time series and the estimated anoms, indicated by circles, should also be returned.
y_log : Apply log scaling to the y-axis. This helps with viewing plots that have extremely large positive anomalies relative to the
rest of the data.
xlabel : X-axis label to be added to the output plot.
ylabel : Y-axis label to be added to the output plot.
title : Title for the output plot.
verbose : Enable debug messages
25. Twitter’s Anomaly Detection R pack.(cont.)
To understand how twitter’s algorithm works, you need
to know.
- Student t-distribution
- Extreme Studentized Deviate (ESD) test
- Generalized ESD
- Linear regression
- LOESS
- STL(Seasonal Trend LOESS)
29. Twitter’s Anomaly Detection R pack.(cont.)
Seasonality(linear regression, LOESS, STL)
The generalized ESD works when you have a set of points from a normal distribution,
but real data has some seasonality. This is where STL comes in. It decomposes the data
into a season part, a trend and whatever’s left over using local regression (LOESS), which
fits a low order polynomial to a subset of the data and stitches them together by
weighting them. Since you can remove the trend and seasonal part with loess, you
should be left with something that is more or less normally distributed. You can apply
generalized ESD on what’s left over to detect anomalies.
#STL: “Seasonal and Trend decomposition using Loess”
Seasonality Local regression(LOESS) Polynomial regression
30. Twitter: Introducing practical and robust
anomaly detection in a time series
Global/Local
At Twitter, we observe distinct seasonal patterns in most of the time series.
Global: global anomalies typically extend above or below expected seasonality and are
therefore not subject to seasonality and underlying trend
Local: anomalies which occur inside seasonal patterns, are masked and thus are much
more difficult to detect in a robust fashion.
Positive/Negative
Positive: 슈퍼볼 경기 동안의 트윗 폭증 등(이벤트에 대한 용량 산정을 위해 사용)
Negative: 초당 쿼리수(QPS[Queries Per Second])의 증가 등 잠재적인 하드웨어나 데이터
수집 이슈를 발견
31. Subspace- and correlation-based outlier
detection for high-dimensional data.
주성분 분석(PCA), 요인 분석(Dimension reduction)을 이용하여
차원 축소
부분공간(Subspace)의 대비(Contrast)를 계산하여 이상을 감지
32. Subspace- and correlation-based outlier
detection for high-dimensional data.(cont.)
HiCS: High Contrast Subspaces for Density-Based Outlier Ranking
33. RNN(Replicator neural networks)
에러를 최소화해서 입력 패턴을 재생하는 방법
정상 모델을 생성하여 이상값을 추출
A schematic view of a fully connected
Replicator Neural Network.
𝑂𝐹𝑖 = i번째 요소의 Anomaly Factor 스코어
𝑛 = # of features
𝑥𝑖𝑗 = i번째 요소의 j컬럼 관측값
𝑜𝑖𝑗 = i번째 요소의 j컬럼 RNN으로 재생한 정규값
34. LOF(Local Outlier Factor)
Density-based anomaly detection by KNN
Score를 제공하여 해석이 용이하나 delay time이 좀 있음.
Unsupervised anomaly detection
Basic idea of LOF: comparing the local density of a point with the densities of its neighbors. A has a much lower
density than its neighbors
36. LOF(Local Outlier Factor)(cont.)
LOF scores as visualized by ELKI. While the upper right cluster has a
comparable density to the outliers close to the bottom left cluster, they
are detected correctly.
38. LRSTSD(Log regression seasonality based
approach of time series decomposition)
Anomaly score formula:
Anomaly score
1일 네트워크 트래픽Tx 7일 네트워크 트래픽Tx
𝐸𝑖 = i번째 에러
𝐴𝑖 = i번째 관측값
𝑈𝑖 = i번째 예측 상한 값
𝐿𝑖 = i번째 예측 하한 값
𝑃 = 전체 값(Parameter)
39. 결론
이상감지는 예측 모델 생성 시 Noise를 제거할 수 있는 기술
예측률 향상 기대
데이터의 오탐/수집 실패를 감지
Resampling, 보정 등 적절한 대처가 가능
관측된 이상 값과 문제와의 연관성 분석
문제에 대한 사전 감지 기술로 활용
고장 예측