3. 3
eBay’s Challenges in Monitoring
10+ large hadoop clusters
10,000+ nodes
50,000+ jobs per day
50,000,000+ tasks per day
500+ types of hadoop/hbase metrics
Billions of audit events per day
Large Scale in Real Time Various Business Logic
Hadoop
Hbase
Spark
Data Security
Hardware
Cloud
Database
Complex and Scalable Policy
Join multiple data sources
Threshold based, windows based
Multiple metrics correlation
Metrics pre-aggregations
Machine learning based
Engineering Modularization
Varieties of data sources
Varieties of data collectors
Complex business logic
Alert rules can’t be hot deployed
Scalability issue with single process
4. What’s Eagle
4
The uniform monitoring and alerting framework to
monitor large-scale distributed system like hadoop,
spark, cloud, etc. in real time.
Eagle = Eagle Framework + Eagle Apps
5. Eagle Ecosystem
5
Apps
DAM
JPA
HBase
Spark
Interface
Web Portal
REST Services
Ambari Plugin
Integration
Kafka
Storm
HBase
Druid
Elastic Search
Eagle Framework
Provide full-stack monitoring framework for efficiently
developing highly scalable real-time monitoring applications.
Eagle Apps
Provide built-in monitoring applications for domains like hadoop,
spark, hbase, storm and cloud.
Eagle Integration
Integrate with distributed real-time execution environment like
storm, message bus like kafka and storage layer like hbase, and
also support extensions.
Eagle Interface
Allow to access or manage eagle through REST service, web UI
or Ambari plugin.
Eagle
Framework
7. 7
JPA: Job Performance Analyzer
Historical job analysis
Running job analysis
Anomaly host detection
Job data skew detection
Job performance suggestion
Anomaly Prediction based on machine learning
Monitor and analyze job performance in real-time
8. 8
Historical Job Analyzer
• Job historical performance trend
• Task and attempt distribution
• Various level (cluster/job/user/host) of
resource utilization
• Anomaly historical performance detection
• TooLowBytesConsumedPerCPUSecond
• JobStatisticLongDuration
• TooLargeReduceNumAlert
• TooLargeShuffleSizeAlert
9. 9
Running Job Analyzer
• Monitoring running job in real time
• Minute-level job progress snapshots
• Minute-level resource usage
snapshots
• CPU, HDFS I/O, Disk I/O, slot seconds
• Roll up to user/queue/cluster level
• Anomaly running status detection
• TooLongJobDuration
• NoProgressForLong
• TooManyTaskFailure
10. Use Case Detect node anomaly by analyzing task failure ratio across all nodes
Assumption Task failure ratio for every node should be approximately equal
Algorithm Node by node compare (symmetry violation) and per node trend
10
Task Failure based Anomaly Host Detection
12. Counters & Features
Use Case Detect data skew by statistics and distributions for attempt execution durations and counters
Assumption Duration and counters should be in normal distribution
12
Real-time Data Skew Detection
mapDuration
reduceDuration
mapInputRecords
reduceInputRecords
combineInputRecords
mapSpilledRecords
reduceShuffleRecords
mapLocalFileBytesRead
reduceLocalFileBytesRead
mapHDFSBytesRead
reduceHDFSBytesRead
Modeling & Statistics
Avg
Min
Max
Distributions
Max z-score
Top-N
Correlation
Threshold & Detection
Counters
Correlation > 0.9
& Max(Z-Score) > 90%
14. 14
Anomaly Prediction based on Machine Learning
• Anomaly Metric Predictive Detection
• Offline: Analyzing and combining 500+ metrics together for causal anomaly
detections (IG -> PCA -> GMM -> MCC)
• Online: Predictively alert for anomaly metrics
Normal (Green) and Abnormal (Red)
Data and Probability Distribution and Threshold
Selection
PCA (Principal Component Analysis)
16. 16
DAM: Data Activity Monitoring
Secure hadoop in real-time
Security Use Cases
Security Architecture Overview
Security Components Highlights
Security Machine Learning Integration
17. 17
Security Use Cases
Data Loss Prevention
Get alerted and stop a malicious user trying to copy, delete, move sensitive data from the Hadoop cluster.
Malicious Logins
Detect login when malicious user tries to guess password. Eagle creates user profiles using machine learning
algorithm to detect anomalies
Unauthorized access
Detect and stop a malicious user trying to access classified data without privilege.
Malicious user operation
Detect and stop a malicious user trying to delete large amount of data. Operation type is one parameter of Eagle
user profiles. Eagle supports multiple native operation types.
19. 19
Security Component Highlights
Policy Manager
Expressive language - create and modify policies for alerting and remediation on certain data activity
monitoring events.
Data classification
Integrate with Dataguise & Apache Ranger.
Policy-based Remediation
Ability to detect and stop a threat, improve operational efficiencies, and reduce regulatory compliance costs.
User Profiling
Based on Machine learning to automatically generate anomaly detection policy
User Activity Exploration
Ability to drill down into alert details to understand the data security threat
20. 20
Security Machine Learning Integration
• User Activity Profiling
• Offline: Determine bandwidth from training dataset the kernel density
function parameters (KDE)
• Online: If a test data point lies outside the trained bandwidth, it is anomaly
(Policy)
PCs(Principle Components) in EVD
(Eigenvalue Value Decomposition)Kernel Density Function
21. 21
Security Machine Learning Integration
• User Activity Profiling on Spark
Historical Audit
Events
Real-time Audit
Events
Batch Preprocess
User Profile Model
Generation (KDE + EVD
Algorithm)
Eagle StorageHDFS
Stream
Preprocess
Policy Engine
Online detection on Storm
Offline training on Spark
Archived data
Real-time stream
Kafka
Persist model
Dynamically load models & policies
Alert Consumer
Persist alert
Eagle Security
Plugins
23. 23
• Data collector -> data processing -> metric pre-agg/alert engine -> storage -> dashboards
• We need create framework to cover full stack in monitoring system
Monitoring Programming Paradigm
28. 28
Extensible & Scalable Policy Framework
Usability
• Declarative Policy Definition Syntax
• Stream Metadata (event attribute name, attribute type, attribute value resolver, …)
Scalability
• Dynamic policy partitioning across compute nodes based on configurable partition class
• Dynamic policy deployment
• Event partitioning by storm and policy partitioning by Eagle (N events * M policies)
Extensibility
• Support new policy evaluation engine, for example Siddhi, Esper, Machine learning etc.
29. 29
Usability of Policy Framework
Case HBase Region server high call queue length
Policy In the past 30 minutes, there are more than 20 times call queue length>2000
from RegionCallQueueLength[value>2000]#window.Extension:messageTimeWindow(30 min)
select host, value, avg(value) as avgValue, count(*) as count
group by host
having count >= 20
insert into HighRegionServerCallQueueLengthStream;
30. 30
Scalability of Policy Evaluation
Dynamic Policy Partition
• N Users with 3 partitions, M
policies with 2 partitions, then 3*2
physical tasks
• Physical partition + Policy-level
partition
31. 31
Extensibility of Policy Framework
public interface PolicyEvaluatorServiceProvider {
public String getPolicyType();
public Class<? extends PolicyEvaluator> getPolicyEvaluator();
public Class<? extends PolicyDefinitionParser> getPolicyDefinitionParser();
public Class<? extends PolicyEvaluatorBuilder> getPolicyEvaluatorBuilder();
public List<Module> getBindingModules();
}
Policy Evaluator Provider use SPI to register policy engine implementations
Built-in Supported Policy Engine
• Siddhi Complex Event Processing Engine
• Machine Learning based Policy Engine
32. Eagle Query Framework
32
Persistence
• Metric
• Event
• Metadata
• Alert
• Log
• Customized
Structure
• …
Query
• Search
• Filter
• Aggregation
• Sort
• Expression
• ….
The light-weight metadata-driven store layer to serve
commonly shared storage & query requirements of most monitoring system
33. 33
• Interactive: IPython notebook-like
interactive visualization analysis and
troubleshooting.
• Dashboard: Customizable dashboard layout
and drill-down path, persist and share.
Customizable Dashboard
Provide real-time interactive visualization and analytics capability supporting variety of
data sources like eagle, druid and so on.
35. 35
Open Source
First Use Case
Eagle to secure Hadoop in real time based on Eagle framework
External Partners
Hortonworks, Dataguise, Paypal and Apache Ranger
Following Components to Open Source
JPA (“Job Performance Analyzer”), HBase and GC Monitoring and so on
is opening source soon
36. 36
Reference
Eagle at Hadoop Summit 2015, San Jose
http://2015.hadoopsummit.org
Slides | Video
Eagle at Big Data Summit 2014, Shanghai
http://2014ebay.csdn.net/m/zone/ebay_en
Slides | Video
37. 37
The End & Thanks
If you want to go fast, go alone.
If you want to go far, go together.
-- African Proverb
Hao Chen
hchen9@ebay.com | @haozch
38. 38
We are Hiring Now
https://careers.ebayinc.com
Or contact me: hchen9@ebay.com
Editor's Notes
Anomaly detection algorithm
Continuously crawl job history files immediately after it is completed
Calculate minute level job failure ratio for each node
A node is identified to be anomalous when either of the following 2 conditions happen
This node continuously fails tasks within this node
This node has significant higher failure ratio than rest of nodes within the cluster
Anomaly detection algorithm
Continuously crawl job history files immediately after it is completed
Calculate minute level job failure ratio for each node
A node is identified to be anomalous when either of the following 2 conditions happen
This node continuously fails tasks within this node
This node has significant higher failure ratio than rest of nodes within the cluster
Anomaly detection algorithm
Continuously crawl job history files immediately after it is completed
Calculate minute level job failure ratio for each node
A node is identified to be anomalous when either of the following 2 conditions happen
This node continuously fails tasks within this node
This node has significant higher failure ratio than rest of nodes within the cluster
Inspired by TSDB, Ganglia, Nagios, Zabbix etc. Most of them focus on infrastructure level data collection and alert, but they don’t consider business logic complexity – how to prepare data
Inspired by TSDB, Ganglia, Nagios, Zabbix etc. Most of them focus on infrastructure level data collection and alert, but they don’t consider business logic complexity – how to prepare data
Inspired by TSDB, Ganglia, Nagios, Zabbix etc. Most of them focus on infrastructure level data collection and alert, but they don’t consider business logic complexity – how to prepare data
Inspired by TSDB, Ganglia, Nagios, Zabbix etc. Most of them focus on infrastructure level data collection and alert, but they don’t consider business logic complexity – how to prepare data
Anomaly detection algorithm
Continuously crawl job history files immediately after it is completed
Calculate minute level job failure ratio for each node
A node is identified to be anomalous when either of the following 2 conditions happen
This node continuously fails tasks within this node
This node has significant higher failure ratio than rest of nodes within the cluster
Anomaly detection algorithm
Continuously crawl job history files immediately after it is completed
Calculate minute level job failure ratio for each node
A node is identified to be anomalous when either of the following 2 conditions happen
This node continuously fails tasks within this node
This node has significant higher failure ratio than rest of nodes within the cluster
Anomaly detection algorithm
Continuously crawl job history files immediately after it is completed
Calculate minute level job failure ratio for each node
A node is identified to be anomalous when either of the following 2 conditions happen
This node continuously fails tasks within this node
This node has significant higher failure ratio than rest of nodes within the cluster
IG: Information Gain, 信息增益, 概率分布或者信息论,是非对称的,用以度量两种概率分布P和Q的差异。信息增益描述了当使用Q进行编码时,再使用P进行编码的差异。通常P代表样本或观察值的分布,也有可能是精确计算的理论分布。Q代表一种理论,模型,描述或者对P的近似。目的: 特征选择
PCA: 主成分分析(Principal Component Analysis,PCA), 将多个变量通过线性变换以选出较少个数重要变量的一种多元统计分析方法。又称主分量分析。http://baike.baidu.com/view/45376.htm?fromtitle=principal+component+analysis&type=syn
GMM: 高斯混合模型(或者混合高斯模型),也可以简写为MOG(Mixture of Gaussian)。用高斯概率密度函数(正态分布曲线)精确地量化事物,将一个事物分解为若干的基于高斯概率密度函数(正态分布曲线)形成的模型。http://baike.baidu.com/view/3767607.htm
MCC: 马修相关系数,http://baike.baidu.com/view/3767607.html
IG: Information Gain, 信息增益, 概率分布或者信息论,是非对称的,用以度量两种概率分布P和Q的差异。信息增益描述了当使用Q进行编码时,再使用P进行编码的差异。通常P代表样本或观察值的分布,也有可能是精确计算的理论分布。Q代表一种理论,模型,描述或者对P的近似。目的: 特征选择
PCA: 主成分分析(Principal Component Analysis,PCA), 将多个变量通过线性变换以选出较少个数重要变量的一种多元统计分析方法。又称主分量分析。http://baike.baidu.com/view/45376.htm?fromtitle=principal+component+analysis&type=syn
GMM: 高斯混合模型(或者混合高斯模型),也可以简写为MOG(Mixture of Gaussian)。用高斯概率密度函数(正态分布曲线)精确地量化事物,将一个事物分解为若干的基于高斯概率密度函数(正态分布曲线)形成的模型。http://baike.baidu.com/view/3767607.htm
MCC: 马修相关系数,http://baike.baidu.com/view/3767607.html
Data loss prevention
Get alerted and stop a malicious user trying to copy, delete, move sensitive data from the hadoop cluster.
Malicious Logins
Detect login when malicious user tries to guess password. Eagle creates user profiles using machine learning algorithm to detect anomalies. This anomaly detection together with the policy for user logins would trigger an alert and block this user from accessing sensitive datasets.
Unauthorized access
Detect and stop a malicious user trying to access classified data without privilege. Unauthorized access is one fact of Eagle user profiles, and machine learning algorithm will detect anomaly. This anomaly detection together with the policy of user access to classified data without authorization would trigger an alert to user's manager.
Malicious user operation
Detect and stop a malicious user trying to delete large amount of data. Operation type is one parameter of Eagle user profiles. Eagle supports multiple native operation types.
Dataguise delivers data privacy protection and risk assessment analytics that allow organizations to safely leverage and share enterprise data. Our solutions simplify governance as they proactively locate sensitive data, automatically protect it with appropriate remediation polices, and provide actionable compliance intelligence to decision makers, in real-time. In Hadoop deployments, our solutions inspect incoming data and protect it before it is stored. These capabilities simplify risk management, improve operational efficiencies, and reduce regulatory compliance costs.
Data loss prevention
Get alerted and stop a malicious user trying to copy, delete, move sensitive data from the hadoop cluster.
Malicious Logins
Detect login when malicious user tries to guess password. Eagle creates user profiles using machine learning algorithm to detect anomalies. This anomaly detection together with the policy for user logins would trigger an alert and block this user from accessing sensitive datasets.
Unauthorized access
Detect and stop a malicious user trying to access classified data without privilege. Unauthorized access is one fact of Eagle user profiles, and machine learning algorithm will detect anomaly. This anomaly detection together with the policy of user access to classified data without authorization would trigger an alert to user's manager.
Malicious user operation
Detect and stop a malicious user trying to delete large amount of data. Operation type is one parameter of Eagle user profiles. Eagle supports multiple native operation types.
Dataguise delivers data privacy protection and risk assessment analytics that allow organizations to safely leverage and share enterprise data. Our solutions simplify governance as they proactively locate sensitive data, automatically protect it with appropriate remediation polices, and provide actionable compliance intelligence to decision makers, in real-time. In Hadoop deployments, our solutions inspect incoming data and protect it before it is stored. These capabilities simplify risk management, improve operational efficiencies, and reduce regulatory compliance costs.
Data loss prevention
Get alerted and stop a malicious user trying to copy, delete, move sensitive data from the hadoop cluster.
Malicious Logins
Detect login when malicious user tries to guess password. Eagle creates user profiles using machine learning algorithm to detect anomalies. This anomaly detection together with the policy for user logins would trigger an alert and block this user from accessing sensitive datasets.
Unauthorized access
Detect and stop a malicious user trying to access classified data without privilege. Unauthorized access is one fact of Eagle user profiles, and machine learning algorithm will detect anomaly. This anomaly detection together with the policy of user access to classified data without authorization would trigger an alert to user's manager.
Malicious user operation
Detect and stop a malicious user trying to delete large amount of data. Operation type is one parameter of Eagle user profiles. Eagle supports multiple native operation types.
Dataguise delivers data privacy protection and risk assessment analytics that allow organizations to safely leverage and share enterprise data. Our solutions simplify governance as they proactively locate sensitive data, automatically protect it with appropriate remediation polices, and provide actionable compliance intelligence to decision makers, in real-time. In Hadoop deployments, our solutions inspect incoming data and protect it before it is stored. These capabilities simplify risk management, improve operational efficiencies, and reduce regulatory compliance costs.
Histogram Density Estimation: 直方密度估计
Kernel density estimation-核密度估计
EVD: 线性代数,特征值分解,矩阵之集,Eigenvalue Value Decomposition, http://www.stats.ox.ac.uk/~sejdinov/teaching/HT15_lecture2-nup.pdf
高斯混合模型更多的用于分类,Parzen等KDE方法更多的用于概率密度的估计
http://blog.sina.com.cn/s/blog_6923201d01010tjo.html
Histogram Density Estimation: 直方密度估计
Kernel density estimation-核密度估计
EVD: 线性代数,特征值分解,矩阵之集,Eigenvalue Value Decomposition, http://www.stats.ox.ac.uk/~sejdinov/teaching/HT15_lecture2-nup.pdf
高斯混合模型更多的用于分类,Parzen等KDE方法更多的用于概率密度的估计
http://blog.sina.com.cn/s/blog_6923201d01010tjo.html
Anomaly detection algorithm
Continuously crawl job history files immediately after it is completed
Calculate minute level job failure ratio for each node
A node is identified to be anomalous when either of the following 2 conditions happen
This node continuously fails tasks within this node
This node has significant higher failure ratio than rest of nodes within the cluster
As a framework, Eagle does not assume :
Data source (where, what)
Business logic execution path (how)
Policy engine implementation (how)
Data sink (where, what)
As a framework, Eagle does the following:
SQL-like service API
High-performing query framework
Lightweight streaming process java API
Extensible policy engine implementation
Scalable and distributed rule evaluation
Metadata driven stream processing
Data source extensibility
Data sink extensibility
Interactive dashboard
Anomaly detection algorithm
Continuously crawl job history files immediately after it is completed
Calculate minute level job failure ratio for each node
A node is identified to be anomalous when either of the following 2 conditions happen
This node continuously fails tasks within this node
This node has significant higher failure ratio than rest of nodes within the cluster
Supports syntax:
Search
Aggregate
Time Series Histogram
Expression Filter
Paginations
Metadata definition ORM
High performance RESTful API
SQL-like declarative query syntax
Supporting HBase and RDBMS as storage
Logically partition by tags defined in annotation
Co-processor support
Secondary index support
Generic service client library
Supports syntax:
Search
Aggregate
Time Series Histogram
Expression Filter
Paginations
Metadata definition ORM
High performance RESTful API
SQL-like declarative query syntax
Supporting HBase and RDBMS as storage
Logically partition by tags defined in annotation
Co-processor support
Secondary index support
Generic service client library
eBay内部,随着越来越多的大型分布式系统在企业级平台中部署,monitoring for large-scale 分布式系统的需求尤其强烈,eagle 将给予eagle framework 为核心基础,不断结合business logic特性逐渐壮大其Eagle Apps的生态圈,同时不断优化核心框架本身。
同时我们相信不止是ebay,大部分企业级平台,部署和维护这些大型分布式系统时,都会遇到共同的问题,集群越大,各方面监控所面临的挑战也越大,我们相信Eagle这针对于大型分布式系统监控的优势也会越突出。我们也一直非常期待同大家进行相关的交流和探讨,因此作为抛砖引玉,我们会以开源的形式开放eagle的代码,一方面ebay在这方面的大型分布式系统监控方面的努力可以对那些需要解决类似的公司有所帮助或者参考,同时也希望得到业界的反馈,对于我们的解决方式上进行深入交流,我们自己也可以从中有所收获,甚至,大家可以一起合作创建一个定位与大型分布式系统的开源监控平台。