Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Anomaly detection

2.704 visualizaciones

Publicado el

Introduction to anomaly detection

Publicado en: Datos y análisis
  • Inicia sesión para ver los comentarios

Anomaly detection

  1. 1. 이상 감지 (Anomaly Detection) 고등 지능 기술 연구회 (Advanced Intelligence Technology Research Society) 김철( 2016-07-09
  2. 2. 이상감지란? 데이터의 메인 스트림에서 벗어난 샘플 데이터 마이닝에서 이상감지는 예상 패턴 또는 정상 범 주를 준수하지 않는 아이템, 이벤트, 관찰들의 식별을 의 미. outlier
  3. 3. 이상감지란?(cont.) Min:Max ≠ Outlier 1.5xIQR rule IQR(Interquartile Range) = Q3 – Q1 Max Min
  4. 4. 이상감지란?(cont.) 이상 값은 전형적으로 문제의 한 증상으로 해석 일반적인 통계 정의에 따르지 않는 드문 현상
  5. 5. 이상감지란?(cont.) 클러스터 알고리즘으로 이상 패턴에 의해 형성된 마이크로 클러스터를 검출
  6. 6. 역사 Anomaly detection was proposed for intrusion detection systems (IDS) by Dorothy Denning in 1986. 초기에는 정상 임계치, 통계량의 전처리, 소프트 컴퓨팅 그리고, 귀납적 학습
  7. 7. 역사(cont.)
  8. 8. 응용기술 사이버 침입 탐지, 신용카드 사기, 고장 감지, 시스템 건 전성 모니터링, IoT, etc. 생태계 교란을 감지 데이터에서 이상 값을 제거하는 데 자주 사용
  9. 9. 3가지 분류 1. 비지도 이상 감지(Unsupervised anomaly detection) - 레이블 없는 데이터에서 이상 감지 - K-means 클러스터 알고리즘으로 이상검출 2. 지도 이상 감지(Supervised anomaly detection) - 정상(Normal), 비정상(Abnormal) 레이블이 존재 - 분류 모델 이용(SVM, Random forests, Logistic, Robust, KNN, etc.)
  10. 10. 3가지 분류(cont.) 3. 준지도 이상 감지(Semi-supervised anomaly detection) - 정상(Normal) 레이블만 존재하고, 정상 모델에 의해 생성한 likelihood를 비교해서 이상 값을 추출 - NKIA’s LRSTSD based Anomaly Detection - Twitter’s Seasonal Hybrid ESD (S-H-ESD) based Anomaly Detection NKIA’s Anomaly Detection Twitter’s Anomaly Detection
  11. 11. 입력 데이터 단변량(Univariate) 다변량(Multivariate)
  12. 12. 입력 데이터(cont.) 자료구조 - Binary - Categorical - Continuous - Hybrid
  13. 13. 이상값의 종류 Point Anomalies - 데이터 셋의 뭉치에서 벗어나는 값
  14. 14. 이상값의 종류(cont.) Contextual Anomalies - 컨텍스트에 동떨어진 값 - 컨텍스트의 개념이 필요 - 조건부 이상치의 참조(Rules)
  15. 15. 이상값의 종류(cont.) Collective Anomalies - 수집 문제로 발생한 이상값
  16. 16. Output of Anomaly Detection Label - Label of normal or anomaly - 분류문제 접근법에서 true|false or class Score - Rank - 0:1 - Threshold parameter가 필요
  17. 17. 이상감지의 평가 F-Measure - 지도학습, 분류문제 평가 - Formula: Recall(R) = TP / (TP + FN) Precision(P) = TP / (TP + FP) F-measure = 2*R*P/(R+P) The Area Under an ROC Curve - AUC(Area Under the Curve) - Detection Rate(TP), False Alarm Rate(TN) - 0:1 - Equation: Confusion Actual class Normal Anomaly Predicted class Normal TP FP Anomaly FN TN 이원교차표(Crosstable) Score Label .90 ~ 1 Excellent(A) .80 ~ .90 Good(B) .70 ~ .80 Fair(C) .60 ~ .70 Poor(D) .50 ~ .60 Fail(F) 평가표 ROC(Receiver Operating Characteristic) Curves m = # of TP, n = # of TN, 𝑝𝑖 = 𝑇𝑃 𝑅𝑎𝑡𝑒(Detection Rate), 𝑝𝑗 = 𝑇𝑁 𝑅𝑎𝑡𝑒(𝐹𝑎𝑙𝑠𝑒 𝐴𝑙𝑎𝑟𝑚 𝑅𝑎𝑡𝑒)
  18. 18. Taxonomy*
  19. 19. 유명한 이상감지 기법들
  20. 20. Twitter’s Anomaly Detection R pack. Twitter open-sourced their R package for anomaly detection. They call their algorithm Seasonal Hybrid ESD (S-H- ESD), which is built on Generalized ESD. Sometimes anomalies can mess up your modeling.
  21. 21. Twitter’s Anomaly Detection R pack.(cont.) install.packages("devtools") devtools::install_github("twitter/AnomalyDetection") library(AnomalyDetection) install.packages("gtable") install.packages("scales") data(raw_data) res = AnomalyDetectionTs(raw_data, max_anoms=0.02, direction='both', plot=TRUE) res$plota
  22. 22. Twitter’s Anomaly Detection R pack.(cont.) v <- read.csv("D:/r/tsd_paper/cpu_5m_02.csv") res2 = AnomalyDetectionVec(v, max_anoms=0.02, period=72, direction='both', plot=TRUE) res2$plot
  23. 23. Twitter’s Anomaly Detection R pack.(cont.) Usage AnomalyDetectionTs(x, max_anoms = 0.1, direction = "pos", alpha = 0.05, only_last = NULL, threshold = "None", e_value = FALSE, longterm = FALSE, piecewise_median_period_weeks = 2, plot = FALSE, y_log = FALSE, xlabel = "", ylabel = "count", title = NULL, verbose = FALSE) Arguments X : Time series as a two column data frame where the first column consists of the timestamps and the second column consists of the observations. max_anoms : Maximum number of anomalies that S-H-ESD will detect as a percentage of the data. direction : Directionality of the anomalies to be detected. Options are: 'pos' | 'neg' | 'both'. alpha : The level of statistical significance with which to accept or reject anomalies. only_last : Find and report anomalies only within the last day or hr in the time series. NULL | 'day' | 'hr'. threshold : Only report positive going anoms above the threshold specified. Options are: 'None' | 'med_max' | 'p95' | 'p99'. e_value : Add an additional column to the anoms output containing the expected value. longterm : Increase anom detection efficacy for time series that are greater than a month. See Details below. piecewise_median_period_weeks : The piecewise median time window as described in Vallis, Hochenbaum, and Kejariwal (2014). Defaults to 2.
  24. 24. Twitter’s Anomaly Detection R pack.(cont.) Usage AnomalyDetectionTs(x, max_anoms = 0.1, direction = "pos", alpha = 0.05, only_last = NULL, threshold = "None", e_value = FALSE, longterm = FALSE, piecewise_median_period_weeks = 2, plot = FALSE, y_log = FALSE, xlabel = "", ylabel = "count", title = NULL, verbose = FALSE) Arguments(cont.) plot : A flag indicating if a plot with both the time series and the estimated anoms, indicated by circles, should also be returned. y_log : Apply log scaling to the y-axis. This helps with viewing plots that have extremely large positive anomalies relative to the rest of the data. xlabel : X-axis label to be added to the output plot. ylabel : Y-axis label to be added to the output plot. title : Title for the output plot. verbose : Enable debug messages
  25. 25. Twitter’s Anomaly Detection R pack.(cont.) To understand how twitter’s algorithm works, you need to know. - Student t-distribution - Extreme Studentized Deviate (ESD) test - Generalized ESD - Linear regression - LOESS - STL(Seasonal Trend LOESS)
  26. 26. Twitter’s Anomaly Detection R pack.(cont.) Student t-distribution 정규 분포의 평균을 측정할 때 주로 사용되는 분포 PDF t
  27. 27. Twitter’s Anomaly Detection R pack.(cont.) Extreme Studentized Deviate (ESD) test
  28. 28. Twitter’s Anomaly Detection R pack.(cont.) Generalized ESD
  29. 29. Twitter’s Anomaly Detection R pack.(cont.) Seasonality(linear regression, LOESS, STL) The generalized ESD works when you have a set of points from a normal distribution, but real data has some seasonality. This is where STL comes in. It decomposes the data into a season part, a trend and whatever’s left over using local regression (LOESS), which fits a low order polynomial to a subset of the data and stitches them together by weighting them. Since you can remove the trend and seasonal part with loess, you should be left with something that is more or less normally distributed. You can apply generalized ESD on what’s left over to detect anomalies. #STL: “Seasonal and Trend decomposition using Loess” Seasonality Local regression(LOESS) Polynomial regression
  30. 30. Twitter: Introducing practical and robust anomaly detection in a time series Global/Local At Twitter, we observe distinct seasonal patterns in most of the time series. Global: global anomalies typically extend above or below expected seasonality and are therefore not subject to seasonality and underlying trend Local: anomalies which occur inside seasonal patterns, are masked and thus are much more difficult to detect in a robust fashion. Positive/Negative Positive: 슈퍼볼 경기 동안의 트윗 폭증 등(이벤트에 대한 용량 산정을 위해 사용) Negative: 초당 쿼리수(QPS[Queries Per Second])의 증가 등 잠재적인 하드웨어나 데이터 수집 이슈를 발견
  31. 31. Subspace- and correlation-based outlier detection for high-dimensional data. 주성분 분석(PCA), 요인 분석(Dimension reduction)을 이용하여 차원 축소 부분공간(Subspace)의 대비(Contrast)를 계산하여 이상을 감지
  32. 32. Subspace- and correlation-based outlier detection for high-dimensional data.(cont.) HiCS: High Contrast Subspaces for Density-Based Outlier Ranking
  33. 33. RNN(Replicator neural networks) 에러를 최소화해서 입력 패턴을 재생하는 방법 정상 모델을 생성하여 이상값을 추출 A schematic view of a fully connected Replicator Neural Network. 𝑂𝐹𝑖 = i번째 요소의 Anomaly Factor 스코어 𝑛 = # of features 𝑥𝑖𝑗 = i번째 요소의 j컬럼 관측값 𝑜𝑖𝑗 = i번째 요소의 j컬럼 RNN으로 재생한 정규값
  34. 34. LOF(Local Outlier Factor) Density-based anomaly detection by KNN Score를 제공하여 해석이 용이하나 delay time이 좀 있음. Unsupervised anomaly detection Basic idea of LOF: comparing the local density of a point with the densities of its neighbors. A has a much lower density than its neighbors
  35. 35. LOF(Local Outlier Factor)(cont.) Formula: Illustration of the reachability distance. Objects B and C have the same reachability distance (k=3), while D is not a k nearest neighbor
  36. 36. LOF(Local Outlier Factor)(cont.) LOF scores as visualized by ELKI. While the upper right cluster has a comparable density to the outliers close to the bottom left cluster, they are detected correctly.
  37. 37. LOF(Local Outlier Factor)(cont.) LOF scores of cpu util. vs. Time by Rlof
  38. 38. LRSTSD(Log regression seasonality based approach of time series decomposition) Anomaly score formula: Anomaly score 1일 네트워크 트래픽Tx 7일 네트워크 트래픽Tx 𝐸𝑖 = i번째 에러 𝐴𝑖 = i번째 관측값 𝑈𝑖 = i번째 예측 상한 값 𝐿𝑖 = i번째 예측 하한 값 𝑃 = 전체 값(Parameter)
  39. 39. 결론 이상감지는 예측 모델 생성 시 Noise를 제거할 수 있는 기술  예측률 향상 기대 데이터의 오탐/수집 실패를 감지  Resampling, 보정 등 적절한 대처가 가능 관측된 이상 값과 문제와의 연관성 분석  문제에 대한 사전 감지 기술로 활용  고장 예측
  40. 40. 참고문헌 • • ine-learning-where-is-the-difference-between-one-class- binary-class-and-m • • Using-Replicator-Neural-Networks-Hawkins- He/87a09c777dcecab4883e328669ef2af1ba8dd7be • research/D-mining/Anomaly-D/KDD-cup- 99/NN/dawak02.pdf • • 0281-6_118#page-1 • • • %8D%98%ED%8A%B8_t_%EB%B6%84%ED%8F%AC • • 2F02vnd10%2C%20%2Fm%2F0bs2j8q&cmpt=q&tz=Etc%2FGMT-9 • detection • • a-data-set-lesson-quiz.html • • • • • • • • anomaly-detection-in-a-time-series