SlideShare a Scribd company logo
1 of 30
Sta$s$cal 
Learning 
Based 
Anomaly 
Detec$on 
@ 
Twi9er 
Arun Kejariwal 
(@arun_kejariwal) 
Joint work with Jordan Hochenbaum and Owen Vallis 
November 2014
Internet 
trends 
• Real-time 
[1] 
h9p://techcrunch.com/2014/05/05/amazon-­‐extends-­‐its-­‐shopping-­‐cart-­‐to-­‐twi9er/ 
AK 
2 
[1]
Twi9er: 
Global 
Town 
Square 
AK 
3
Data 
Fidelity 
• Data-driven decision making 
q Evolving product landscape 
• Data partners 
q Nielsen 
q Dataminr 
• Operational 
q Performance and Availability 
AK 
4
Data 
Fidelity: 
Challenges 
• Anomalies 
q Exogenic factors 
§ User behavior 
§ Events 
§ Data center 
q Endogenic factors 
§ Agile development 
o Fail fast 
§ Data collection 
• Millions of time series [1,2] 
q Scalability 
AK 
5 
[1] 
h9p://strata.oreilly.com/2013/09/how-­‐twi9er-­‐monitors-­‐millions-­‐of-­‐$me-­‐series.html 
[2] 
h9p://strataconf.com/strata2014/public/schedule/detail/32431
Anomaly 
Detec$on: 
Why 
Bother? 
• Analyze User Engagement 
q Events 
§ Super Bowl, Japanese New Year 
q Year over year analysis (input to forecasting) 
• Identify Attacks 
q DoS 
q Malware attacks 
• Identify Bots 
q Separating actual users from spam 
AK 
6
Anomaly 
Detec$on 
• Visual 
q Prone to errors 
q Not scalable 
§ Machine generated data 
11% of the digital universe in 2005 
to > 40% by 2020 [1] 
§ Cloud Infrastructure 2013-2017 CAGR ~50% [2] 
• Algorithmic approach 
q Automate! 
[1] 
h9p://www.emc.com/about/news/press/2012/20121211-­‐01.htm 
AK 
7 
[2] 
h9p://www.forbes.com/sites/gilpress/2013/12/12/16-­‐1-­‐billion-­‐big-­‐data-­‐market-­‐2014-­‐predic$ons-­‐from-­‐idc-­‐and-­‐iia/
Anomaly 
Detec$on: 
Background 
• Over 50 years of research [1] 
q Statistics 
§ Extreme Value Theory 
§ Robust Statistics, Grubb’s Test, ESD 
q Econometrics 
q Finance 
§ Value at Risk (VaR) 
q Signal Processing 
q Music Information Retrieval 
q Networking 
q E- Commerce 
q Performance Regression 
[1] 
“Anomaly 
Detec$on” 
by 
Chandola 
et 
al. 
ACM 
Compu$ng 
Surveys, 
2009. 
AK 
8 
Jon 
from 
Etsy 
Toufic 
from 
Metafor
Anomaly 
Detec$on: 
Overview 
• Definition 
q “An anomaly is an observation that deviates so much from other observations so 
as to arouse suspicions that it is was generated by a different mechanism” [1,2] 
[1] 
“Iden$fica$on 
of 
outliers” 
by 
Hawkins, 
Douglas 
M. 
London: 
Chapman 
and 
Hall, 
1980. 
AK 
9 
[2] 
“Outlier 
Analysis” 
by 
Charu 
C. 
Aggarwal. 
Springer, 
2013.
Anomaly 
Detec$on 
• Characterization 
q Magnitude 
q Width 
q Frequency 
q Direction 
AK 
10
Anomaly 
Detec$on 
(contd.) 
• Two flavors 
q Global 
§ Max Value 
q Local 
§ Intra-day 
AK 
11 
Global 
Local
Anomaly 
Detec$on 
(contd.) 
• Traditional Approaches 
q Metrics 
§ Mean μ 
§ Variance σ 
q Rule of thumb 
§ μ + 3*σ 
q Which time series? 
§ Raw 
§ Moving Averages 
o SMA, EWMA, PEWMA 
AK 
12 
3 * σ
Anomaly 
Detec$on 
(contd.) 
• Impact of multi-modal distribution 
q μ Shift ~ 0.2% 
q Inflates σ by 4.5% 
§ Miss quite a few anomalies 
q What do multiple modes correspond to? 
§ Seasonality 
AK 
13
• Robust Statistics 
q MAD 
§ Robust Breakdown point 
o Median 50% vs. Mean 0% 
q σMAD 
§ K = 1.4826 for normally distributed data 
AK 
14 
Anomaly 
Detec$on 
(contd.)
• Limitations of using MAD 
AK 
15 
Anomaly 
Detec$on 
(contd.)
• Grubb’s Test 
q Critical value is derived from data using a statistical confidence (α) 
• Limitations 
q Assumes data distribution is normal 
q Good for detecting ONLY 1 outlier 
q Seasonality unaware 
AK 
16 
Anomaly 
Detec$on 
(contd.)
• ESD (Generalized Extreme Studentized Deviate) [1] 
q Critical value (λi) re-calculated every iteration 
q Largest i such that Ri > λi determines # of anomalies 
q An upper-bound on the number of anomalies is an input parameter 
• Limitations 
q Generalized ESD assumes a “normal” distribution 
q Seasonality unaware 
AK 
17 
Anomaly 
Detec$on 
(contd.) 
[1] 
Rosner, 
Bernard. 
“Percentage 
Points 
for 
a 
Generalized 
ESD 
Many-­‐outlier 
Procedure.” 
Technometrics 
25, 
no. 
2 
(1983): 
165–172.
Our 
Approach
• Addressing Seasonality 
q Key Idea 
§ Time Series Decomposition 
AK 
19 
Anomaly 
Detec$on 
(contd.)
• Determining seasonal component 
q Regression on sub-cycle plots [1] 
AK 
20 
Anomaly 
Detec$on 
(contd.) 
[1] 
“STL: 
A 
seasonal-­‐trend 
decomposi$on 
procedure 
based 
on 
loess” 
by 
Cleveland, 
et 
al. 
Journal 
of 
Official 
Sta$s$cs, 
Vol. 
6, 
Issue 
1, 
1990.
• Impact of removal of seasonal and trend 
q Transforms our multi-modal data into unimodal data. 
§ Amenable to ESD/MAD! 
AK 
21 
Anomaly 
Detec$on 
(contd.) 
The decomposed Residual 
becomes "Uni-modal". This 
significantly shrinks the value of 
sigma. 
The original "Multi-Modal" 
Raw Data has a much wider 
value for sigma, leading ESD 
to miss a lot of the outliers.
Trend Smoothing Distortion 
Creates “Phantom” Anomalies 
• Challenges remain! 
AK 
22 
Anomaly 
Detec$on 
(contd.)
• Marrying Robust Statistics with Seasonal Decomposition 
AK 
23 
Anomaly 
Detec$on 
(contd.) 
Median is Free from Distortion
• Applying ESD on the Residual 
AK 
24 
Anomaly 
Detec$on 
(contd.) 
Decomposition Exposes Anomalies
• Recap 
q Extract the seasonal component using STL 
§ Filters out periodic spikes 
q Residual = Raw - Seasonalraw- Medianraw 
q Run ESD on residual (using median and MAD) 
AK 
25 
Anomaly 
Detec$on 
(contd.)
• Illustrative example 
AK 
26 
Anomaly 
Detec$on 
(contd.)
• Applications 
q Three perspectives 
§ Capacity 
o CPU utilization 
o Garbage collection 
o Network activity 
§ User behavior 
o Events 
• Impressions 
• Link clicks 
o Spam 
§ Forecasting 
AK 
27 
Anomaly 
Detec$on 
(contd.)
• Deployed in production 
q Used by large number of services at Twitter 
q Automatic e-mail notification 
§ Only sent if anomalies are present 
§ Anomalies annotated 
§ CSV with anomaly locations attached 
AK 
28 
Anomaly 
Detec$on 
(contd.)
• Skyline from Etsy 
q https://github.com/etsy/skyline/blob/master/src/analyzer/algorithms.py 
• Coming soon! 
q R package 
AK 
29 
Open 
Sourcing
Join 
the 
Flock 
Like 
problem 
solving? 
Like 
challenges? 
Be 
at 
cukng 
Edge 
Make 
an 
impact 
• We are hiring!! 
q https://twitter.com/JoinTheFlock 
q https://twitter.com/jobs 
q Contact us: @arun_kejariwal 
AK 
30

More Related Content

What's hot

What's hot (20)

식습관 스몰데이터 분석을 통한 장트러블 극복기
식습관 스몰데이터 분석을 통한 장트러블 극복기식습관 스몰데이터 분석을 통한 장트러블 극복기
식습관 스몰데이터 분석을 통한 장트러블 극복기
 
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
 
’글램’ 연락처 기재 유저 탐지용 자연어 처리 모델 개발 프로젝트
’글램’ 연락처 기재 유저 탐지용 자연어 처리 모델 개발 프로젝트’글램’ 연락처 기재 유저 탐지용 자연어 처리 모델 개발 프로젝트
’글램’ 연락처 기재 유저 탐지용 자연어 처리 모델 개발 프로젝트
 
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
 
[패스트캠퍼스] 야구선수 연봉예측
[패스트캠퍼스] 야구선수 연봉예측[패스트캠퍼스] 야구선수 연봉예측
[패스트캠퍼스] 야구선수 연봉예측
 
[225]빅데이터를 위한 분산 딥러닝 플랫폼 만들기
[225]빅데이터를 위한 분산 딥러닝 플랫폼 만들기[225]빅데이터를 위한 분산 딥러닝 플랫폼 만들기
[225]빅데이터를 위한 분산 딥러닝 플랫폼 만들기
 
성장을 좋아하는 사람이, 성장하고 싶은 사람에게
성장을 좋아하는 사람이, 성장하고 싶은 사람에게성장을 좋아하는 사람이, 성장하고 싶은 사람에게
성장을 좋아하는 사람이, 성장하고 싶은 사람에게
 
데이터를 얻으려는 노오오력
데이터를 얻으려는 노오오력데이터를 얻으려는 노오오력
데이터를 얻으려는 노오오력
 
[COMPAS] 고양시 공공자전거 분석과제(장려상)
[COMPAS] 고양시 공공자전거 분석과제(장려상)[COMPAS] 고양시 공공자전거 분석과제(장려상)
[COMPAS] 고양시 공공자전거 분석과제(장려상)
 
Data mining on Social Media
Data mining on Social MediaData mining on Social Media
Data mining on Social Media
 
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
 
KNIME Software Overview
KNIME Software OverviewKNIME Software Overview
KNIME Software Overview
 
Mining a Large Web Corpus
Mining a Large Web CorpusMining a Large Web Corpus
Mining a Large Web Corpus
 
Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020
 
데이터야놀자발표_데이터로토이서비스만들기_조동민 (2).pdf
데이터야놀자발표_데이터로토이서비스만들기_조동민 (2).pdf데이터야놀자발표_데이터로토이서비스만들기_조동민 (2).pdf
데이터야놀자발표_데이터로토이서비스만들기_조동민 (2).pdf
 
L'histoire de l'action de grâces - The story of thanksgiving
L'histoire de l'action de grâces - The story of thanksgivingL'histoire de l'action de grâces - The story of thanksgiving
L'histoire de l'action de grâces - The story of thanksgiving
 
Creative Commons Licenses and Presentation
Creative Commons Licenses and PresentationCreative Commons Licenses and Presentation
Creative Commons Licenses and Presentation
 
Brandon obrien streaming_data
Brandon obrien streaming_dataBrandon obrien streaming_data
Brandon obrien streaming_data
 
Deview2014 Live Broadcasting 추천시스템 발표 자료
Deview2014 Live Broadcasting 추천시스템 발표 자료Deview2014 Live Broadcasting 추천시스템 발표 자료
Deview2014 Live Broadcasting 추천시스템 발표 자료
 

Viewers also liked

Data Data Everywhere: Not An Insight to Take Action Upon
Data Data Everywhere: Not An Insight to Take Action UponData Data Everywhere: Not An Insight to Take Action Upon
Data Data Everywhere: Not An Insight to Take Action Upon
Arun Kejariwal
 
Finding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impact
Arun Kejariwal
 
Gimme More! Supporting User Growth in a Performant and Efficient Fashion
Gimme More! Supporting User Growth in a Performant and Efficient FashionGimme More! Supporting User Growth in a Performant and Efficient Fashion
Gimme More! Supporting User Growth in a Performant and Efficient Fashion
Arun Kejariwal
 
Days In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy serviceDays In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy service
Arun Kejariwal
 

Viewers also liked (20)

Data Data Everywhere: Not An Insight to Take Action Upon
Data Data Everywhere: Not An Insight to Take Action UponData Data Everywhere: Not An Insight to Take Action Upon
Data Data Everywhere: Not An Insight to Take Action Upon
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Finding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impact
 
Velocity 2015-final
Velocity 2015-finalVelocity 2015-final
Velocity 2015-final
 
Real Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and Systems
 
Anomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using HeronAnomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using Heron
 
Anomaly Detection @Twitter
Anomaly Detection @TwitterAnomaly Detection @Twitter
Anomaly Detection @Twitter
 
Isolating Events from the Fail Whale
Isolating Events from the Fail WhaleIsolating Events from the Fail Whale
Isolating Events from the Fail Whale
 
Gimme More! Supporting User Growth in a Performant and Efficient Fashion
Gimme More! Supporting User Growth in a Performant and Efficient FashionGimme More! Supporting User Growth in a Performant and Efficient Fashion
Gimme More! Supporting User Growth in a Performant and Efficient Fashion
 
When Data is Everywhere, Where Do You Start?: Using Drupal to Manage, Distrib...
When Data is Everywhere, Where Do You Start?: Using Drupal to Manage, Distrib...When Data is Everywhere, Where Do You Start?: Using Drupal to Manage, Distrib...
When Data is Everywhere, Where Do You Start?: Using Drupal to Manage, Distrib...
 
Everyone Is an Analyst and Data Is Everywhere, But Research Has Never Been Ne...
Everyone Is an Analyst and Data Is Everywhere, But Research Has Never Been Ne...Everyone Is an Analyst and Data Is Everywhere, But Research Has Never Been Ne...
Everyone Is an Analyst and Data Is Everywhere, But Research Has Never Been Ne...
 
Everyone is a Data Analyst Adobe EMEA Summit 2014
Everyone is a Data Analyst Adobe EMEA Summit 2014Everyone is a Data Analyst Adobe EMEA Summit 2014
Everyone is a Data Analyst Adobe EMEA Summit 2014
 
Days In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy serviceDays In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy service
 
A Systematic Approach to Capacity Planning in the Real World
A Systematic Approach to Capacity Planning in the Real WorldA Systematic Approach to Capacity Planning in the Real World
A Systematic Approach to Capacity Planning in the Real World
 
Time series Analysis & fpp package
Time series Analysis & fpp packageTime series Analysis & fpp package
Time series Analysis & fpp package
 
PyGotham 2016
PyGotham 2016PyGotham 2016
PyGotham 2016
 
Anomaly detection : QuantUniversity Workshop
Anomaly detection : QuantUniversity Workshop Anomaly detection : QuantUniversity Workshop
Anomaly detection : QuantUniversity Workshop
 
Data, data, everywhere… - SEE UK - 2016
Data, data, everywhere… - SEE UK - 2016Data, data, everywhere… - SEE UK - 2016
Data, data, everywhere… - SEE UK - 2016
 
Anomaly detection Meetup Slides
Anomaly detection Meetup SlidesAnomaly detection Meetup Slides
Anomaly detection Meetup Slides
 

Similar to Statistical Learning Based Anomaly Detection @ Twitter

Sampling-SDM2012_Jun
Sampling-SDM2012_JunSampling-SDM2012_Jun
Sampling-SDM2012_Jun
MDO_Lab
 
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Spark Summit
 
Graph Sample and Hold: A Framework for Big Graph Analytics
Graph Sample and Hold: A Framework for Big Graph AnalyticsGraph Sample and Hold: A Framework for Big Graph Analytics
Graph Sample and Hold: A Framework for Big Graph Analytics
Nesreen K. Ahmed
 
impervious cover
impervious coverimpervious cover
impervious cover
James Yang
 
Weather Data: Why Accuracy is More Complicated Than You Think
Weather Data: Why Accuracy is More Complicated Than You ThinkWeather Data: Why Accuracy is More Complicated Than You Think
Weather Data: Why Accuracy is More Complicated Than You Think
METER Group, Inc. USA
 
Flight Delay Prediction Model (2)
Flight Delay Prediction Model (2)Flight Delay Prediction Model (2)
Flight Delay Prediction Model (2)
Shubham Gupta
 
autonomus Bike Progress
autonomus Bike Progressautonomus Bike Progress
autonomus Bike Progress
Nadeem Qandeel
 

Similar to Statistical Learning Based Anomaly Detection @ Twitter (20)

Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine Learning
 
Monte Carlo Schedule Risk Analysis
Monte Carlo Schedule Risk AnalysisMonte Carlo Schedule Risk Analysis
Monte Carlo Schedule Risk Analysis
 
Spc
SpcSpc
Spc
 
Monte Carlo and Schedule Risk Analysis
Monte Carlo and Schedule Risk AnalysisMonte Carlo and Schedule Risk Analysis
Monte Carlo and Schedule Risk Analysis
 
Wqtc2013 invest ofperformanceprobswitheds-20130910
Wqtc2013 invest ofperformanceprobswitheds-20130910Wqtc2013 invest ofperformanceprobswitheds-20130910
Wqtc2013 invest ofperformanceprobswitheds-20130910
 
Sampling-SDM2012_Jun
Sampling-SDM2012_JunSampling-SDM2012_Jun
Sampling-SDM2012_Jun
 
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
 
Graph Sample and Hold: A Framework for Big Graph Analytics
Graph Sample and Hold: A Framework for Big Graph AnalyticsGraph Sample and Hold: A Framework for Big Graph Analytics
Graph Sample and Hold: A Framework for Big Graph Analytics
 
TAO Refresh - Automation of Data Spike Flagging Quality
TAO Refresh - Automation of Data Spike Flagging Quality TAO Refresh - Automation of Data Spike Flagging Quality
TAO Refresh - Automation of Data Spike Flagging Quality
 
Forecasting time series powerful and simple
Forecasting time series powerful and simpleForecasting time series powerful and simple
Forecasting time series powerful and simple
 
[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...
[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...
[20240415_LabSeminar_Huy]Deciphering Spatio-Temporal Graph Forecasting: A Cau...
 
Combining remote sensing earth observations and in situ networks: detection o...
Combining remote sensing earth observations and in situ networks: detection o...Combining remote sensing earth observations and in situ networks: detection o...
Combining remote sensing earth observations and in situ networks: detection o...
 
Running windmills with machine learning - Anette Bergo
Running windmills with machine learning - Anette BergoRunning windmills with machine learning - Anette Bergo
Running windmills with machine learning - Anette Bergo
 
impervious cover
impervious coverimpervious cover
impervious cover
 
Lightweight Neighborhood Cardinality Estimation in Dynamic Wireless Networks ...
Lightweight Neighborhood Cardinality Estimation in Dynamic Wireless Networks ...Lightweight Neighborhood Cardinality Estimation in Dynamic Wireless Networks ...
Lightweight Neighborhood Cardinality Estimation in Dynamic Wireless Networks ...
 
Weather Data: Why Accuracy is More Complicated Than You Think
Weather Data: Why Accuracy is More Complicated Than You ThinkWeather Data: Why Accuracy is More Complicated Than You Think
Weather Data: Why Accuracy is More Complicated Than You Think
 
Flight Delay Prediction Model (2)
Flight Delay Prediction Model (2)Flight Delay Prediction Model (2)
Flight Delay Prediction Model (2)
 
Looking out for anomalies
Looking out for anomaliesLooking out for anomalies
Looking out for anomalies
 
7 8. emi - analog instruments and digital instruments
7 8. emi - analog instruments and digital instruments7 8. emi - analog instruments and digital instruments
7 8. emi - analog instruments and digital instruments
 
autonomus Bike Progress
autonomus Bike Progressautonomus Bike Progress
autonomus Bike Progress
 

More from Arun Kejariwal

Anomaly Detection At The Edge
Anomaly Detection At The EdgeAnomaly Detection At The Edge
Anomaly Detection At The Edge
Arun Kejariwal
 
Serverless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseServerless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the Enterprise
Arun Kejariwal
 
Designing Modern Streaming Data Applications
Designing Modern Streaming Data ApplicationsDesigning Modern Streaming Data Applications
Designing Modern Streaming Data Applications
Arun Kejariwal
 
Techniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud FootprintTechniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud Footprint
Arun Kejariwal
 
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the CloudA Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the Cloud
Arun Kejariwal
 

More from Arun Kejariwal (13)

Anomaly Detection At The Edge
Anomaly Detection At The EdgeAnomaly Detection At The Edge
Anomaly Detection At The Edge
 
Serverless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseServerless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the Enterprise
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time Series
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time Series
 
Model Serving via Pulsar Functions
Model Serving via Pulsar FunctionsModel Serving via Pulsar Functions
Model Serving via Pulsar Functions
 
Designing Modern Streaming Data Applications
Designing Modern Streaming Data ApplicationsDesigning Modern Streaming Data Applications
Designing Modern Streaming Data Applications
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data Streams
 
Deep Learning for Time Series Data
Deep Learning for Time Series DataDeep Learning for Time Series Data
Deep Learning for Time Series Data
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data Streams
 
Live Anomaly Detection
Live Anomaly DetectionLive Anomaly Detection
Live Anomaly Detection
 
Modern real-time streaming architectures
Modern real-time streaming architecturesModern real-time streaming architectures
Modern real-time streaming architectures
 
Techniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud FootprintTechniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud Footprint
 
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the CloudA Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the Cloud
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Statistical Learning Based Anomaly Detection @ Twitter

  • 1. Sta$s$cal Learning Based Anomaly Detec$on @ Twi9er Arun Kejariwal (@arun_kejariwal) Joint work with Jordan Hochenbaum and Owen Vallis November 2014
  • 2. Internet trends • Real-time [1] h9p://techcrunch.com/2014/05/05/amazon-­‐extends-­‐its-­‐shopping-­‐cart-­‐to-­‐twi9er/ AK 2 [1]
  • 3. Twi9er: Global Town Square AK 3
  • 4. Data Fidelity • Data-driven decision making q Evolving product landscape • Data partners q Nielsen q Dataminr • Operational q Performance and Availability AK 4
  • 5. Data Fidelity: Challenges • Anomalies q Exogenic factors § User behavior § Events § Data center q Endogenic factors § Agile development o Fail fast § Data collection • Millions of time series [1,2] q Scalability AK 5 [1] h9p://strata.oreilly.com/2013/09/how-­‐twi9er-­‐monitors-­‐millions-­‐of-­‐$me-­‐series.html [2] h9p://strataconf.com/strata2014/public/schedule/detail/32431
  • 6. Anomaly Detec$on: Why Bother? • Analyze User Engagement q Events § Super Bowl, Japanese New Year q Year over year analysis (input to forecasting) • Identify Attacks q DoS q Malware attacks • Identify Bots q Separating actual users from spam AK 6
  • 7. Anomaly Detec$on • Visual q Prone to errors q Not scalable § Machine generated data 11% of the digital universe in 2005 to > 40% by 2020 [1] § Cloud Infrastructure 2013-2017 CAGR ~50% [2] • Algorithmic approach q Automate! [1] h9p://www.emc.com/about/news/press/2012/20121211-­‐01.htm AK 7 [2] h9p://www.forbes.com/sites/gilpress/2013/12/12/16-­‐1-­‐billion-­‐big-­‐data-­‐market-­‐2014-­‐predic$ons-­‐from-­‐idc-­‐and-­‐iia/
  • 8. Anomaly Detec$on: Background • Over 50 years of research [1] q Statistics § Extreme Value Theory § Robust Statistics, Grubb’s Test, ESD q Econometrics q Finance § Value at Risk (VaR) q Signal Processing q Music Information Retrieval q Networking q E- Commerce q Performance Regression [1] “Anomaly Detec$on” by Chandola et al. ACM Compu$ng Surveys, 2009. AK 8 Jon from Etsy Toufic from Metafor
  • 9. Anomaly Detec$on: Overview • Definition q “An anomaly is an observation that deviates so much from other observations so as to arouse suspicions that it is was generated by a different mechanism” [1,2] [1] “Iden$fica$on of outliers” by Hawkins, Douglas M. London: Chapman and Hall, 1980. AK 9 [2] “Outlier Analysis” by Charu C. Aggarwal. Springer, 2013.
  • 10. Anomaly Detec$on • Characterization q Magnitude q Width q Frequency q Direction AK 10
  • 11. Anomaly Detec$on (contd.) • Two flavors q Global § Max Value q Local § Intra-day AK 11 Global Local
  • 12. Anomaly Detec$on (contd.) • Traditional Approaches q Metrics § Mean μ § Variance σ q Rule of thumb § μ + 3*σ q Which time series? § Raw § Moving Averages o SMA, EWMA, PEWMA AK 12 3 * σ
  • 13. Anomaly Detec$on (contd.) • Impact of multi-modal distribution q μ Shift ~ 0.2% q Inflates σ by 4.5% § Miss quite a few anomalies q What do multiple modes correspond to? § Seasonality AK 13
  • 14. • Robust Statistics q MAD § Robust Breakdown point o Median 50% vs. Mean 0% q σMAD § K = 1.4826 for normally distributed data AK 14 Anomaly Detec$on (contd.)
  • 15. • Limitations of using MAD AK 15 Anomaly Detec$on (contd.)
  • 16. • Grubb’s Test q Critical value is derived from data using a statistical confidence (α) • Limitations q Assumes data distribution is normal q Good for detecting ONLY 1 outlier q Seasonality unaware AK 16 Anomaly Detec$on (contd.)
  • 17. • ESD (Generalized Extreme Studentized Deviate) [1] q Critical value (λi) re-calculated every iteration q Largest i such that Ri > λi determines # of anomalies q An upper-bound on the number of anomalies is an input parameter • Limitations q Generalized ESD assumes a “normal” distribution q Seasonality unaware AK 17 Anomaly Detec$on (contd.) [1] Rosner, Bernard. “Percentage Points for a Generalized ESD Many-­‐outlier Procedure.” Technometrics 25, no. 2 (1983): 165–172.
  • 19. • Addressing Seasonality q Key Idea § Time Series Decomposition AK 19 Anomaly Detec$on (contd.)
  • 20. • Determining seasonal component q Regression on sub-cycle plots [1] AK 20 Anomaly Detec$on (contd.) [1] “STL: A seasonal-­‐trend decomposi$on procedure based on loess” by Cleveland, et al. Journal of Official Sta$s$cs, Vol. 6, Issue 1, 1990.
  • 21. • Impact of removal of seasonal and trend q Transforms our multi-modal data into unimodal data. § Amenable to ESD/MAD! AK 21 Anomaly Detec$on (contd.) The decomposed Residual becomes "Uni-modal". This significantly shrinks the value of sigma. The original "Multi-Modal" Raw Data has a much wider value for sigma, leading ESD to miss a lot of the outliers.
  • 22. Trend Smoothing Distortion Creates “Phantom” Anomalies • Challenges remain! AK 22 Anomaly Detec$on (contd.)
  • 23. • Marrying Robust Statistics with Seasonal Decomposition AK 23 Anomaly Detec$on (contd.) Median is Free from Distortion
  • 24. • Applying ESD on the Residual AK 24 Anomaly Detec$on (contd.) Decomposition Exposes Anomalies
  • 25. • Recap q Extract the seasonal component using STL § Filters out periodic spikes q Residual = Raw - Seasonalraw- Medianraw q Run ESD on residual (using median and MAD) AK 25 Anomaly Detec$on (contd.)
  • 26. • Illustrative example AK 26 Anomaly Detec$on (contd.)
  • 27. • Applications q Three perspectives § Capacity o CPU utilization o Garbage collection o Network activity § User behavior o Events • Impressions • Link clicks o Spam § Forecasting AK 27 Anomaly Detec$on (contd.)
  • 28. • Deployed in production q Used by large number of services at Twitter q Automatic e-mail notification § Only sent if anomalies are present § Anomalies annotated § CSV with anomaly locations attached AK 28 Anomaly Detec$on (contd.)
  • 29. • Skyline from Etsy q https://github.com/etsy/skyline/blob/master/src/analyzer/algorithms.py • Coming soon! q R package AK 29 Open Sourcing
  • 30. Join the Flock Like problem solving? Like challenges? Be at cukng Edge Make an impact • We are hiring!! q https://twitter.com/JoinTheFlock q https://twitter.com/jobs q Contact us: @arun_kejariwal AK 30