4. is a distributed real-time monitoring and alerting
engine for hadoop from eBay
Open sourced as Apache Incubator Project on Oct 26th 2015
Secure Hadoop in Realtime a data activity monitoring solution to instantly identify
access to sensitive data, recognize attacks/ malicious activity and block access in real
time.
See http://eagle.incubator.apache.org or http://github.com/apache/incubator-eagle
Apache Eagle
5. Eagle
was
initialized
by
end
of
2013
for
hadoop ecosystem
monitoring
as
any
existing
tool
like
zabbix,
ganglia
can
not
handle
the
huge
volume
of
metrics/logs
generated
by
hadoop system
in
eBay.
2014/2015
10,000
nodes
150,000+
cores
200
PB
2000+
user
3000+
nodes
10,000+
cores
50+
PB
2012
2011
1000+
nodes
10,000+
cores
10+
PB
100+
nodes
1000
+
cores
1
PB
2010
2009
50+
nodes
2007
1-‐10
nodes
Hadoop
Data
• Security
• Activity
Hadoop
Platform
• Heath
• Availability
• Performance
Hadoop
@
eBay
Inc
Initiative
– Why
build
Eagle?
7+
CLUSTERS
10000+ NODES
200+
PB DATA
10
B+
EVENTS
/
DAY
500+
METRIC
TYPES
50,000+
JOBS
/
DAY
50,000,000+
TASKS
/
DAY
10. Distributed
Policy
Engine
METADATA MANAGER
AlertExecutor_{1}
AlertExecutor_{2}
…
AlertExecutor_{N}
Real
Time
Alerts
Alerts
Policy
Management
Policy
Dynamical
Policy
Deployment
Real-‐time
Event
Stream
Stream_{1}
Stream_{*}
Dynamical
Stream
Schema
Stream
Processing
• Real-‐time
Streaming: Apache
Storm
(Execution
Engine)
+
Kafka
(Message
Bus)
• Declarative
Policy: SQL
(CEP)
on
Streaming
+
Hot
Deploy
• Linear
Scalability:
Data
volume scale +
Computation
scale
• Metadata-‐Driven:
Schema
Management
and
Dynamical
Policy
Lifecycle
11. METADATA MANAGER
AlertExecutor_{1}
AlertExecutor_{2}
…
AlertExecutor_{N}
Real
Time
Alerts
Alerts
Policy
Management
Policy
Dynamical
Policy
Deployment
Real-‐time
Event
Stream
Stream_{1}
Stream_{*}
Dynamical
Stream
Schema
Stream
Processing
from MetricStream[(name == 'ReplLag') and (value > 1000)]
select * insert into outputStream;
• Real-‐time
Streaming: Apache
Storm
(Execution
Engine)
+
Kafka
(Message
Bus)
• Declarative
Policy: SQL
(CEP)
on
Streaming
+
Hot
Deploy
• Linear
Scalability:
Data
volume scale +
Computation
scale
• Metadata-‐Driven:
Schema
Management
and
Dynamical
Policy
Lifecycle
Distributed
Policy
Engine
12. Declarative
Policy
• Filter
• Join
• Aggregation:
Avg,
Sum
,
Min,
Max,
etc
• Group by
• Having
• Stream handlers
for
window:
TimeWindow,
Batch
Window,
Length
Window
• Conditions
and
Expressions:
and,
or,
not,
==,!=,
>=,
>,
<=,
<,
and
arithmetic
operations
• Pattern
Processing
• Sequence
processing
• Event
Tables:
intergrate historical
data
in
realtime processing
• SQL-‐Like
Query:
Query,
Stream
Definition
and
Query
Plan
compilation
Distributed
SQL
on
Streaming
:
Siddhi
CEP
+
Storm
by
default
from MetricStream[(name == 'ReplLag') and (value > 1000)] select * insert into
outputStream;
13. Declarative
Policy
-‐ Examples
from hadoopJmxMetricEventStream
[metric == "hadoop.namenode.fsnamesystemstate.capacityused" and value > 0.9]
select metric, host, value, timestamp, component, site insert into alertStream;
Example
1:
Alert
if
hadoop namenode capacity
usage
exceed
90
percentages
from every
a = hadoopJmxMetricEventStream[metric=="hadoop.namenode.fsnamesystem.hastate"]
->
b = hadoopJmxMetricEventStream[metric==a.metric and b.host == a.host and
a.value != value)]
within 10 min
select a.host, a.value as oldHaState, b.value as newHaState, b.timestamp as
timestamp, b.metric as metric, b.component as component, b.site as site insert
into alertStream;
Example
2:
Alert
if
hadoop namenode HA
switches
14. 1
Distributed
Policy
Engine
-‐ Scalability
2
Distributed Streaming Cluster Environment
AlertExecutor_{1}
AlertExecutor_{2}
…
AlertExecutor_{N}
Stream_{1}
Stream_{*}
Stream
Processing
Dynamic
policy
partition
by
{event}
*
{policy}
• N
Users
with
3
partitions,
M
policies
with
2
partitions,
then
3*2
physical
tasks
• Physical
partition
+
policy-‐level
partition
Linear
Scalability Principle
15. 3
Algorithm Weights
of
Executors
By
Partition
User
Random 0.0484 0.152 0.3535 0.105 0.203 0.072 0.042 0.024
Greedy 0.0837 0.0837 0.0837 0.0837 0.0737 0.0637 0.0437 0.0837
Stream
Partition
Skew
(15:1)
Distributed
Policy
Engine – Optimization
Stream
Partition
Problem
https://en.wikipedia.org/wiki/Partition_problem
16. Distributed Real-time Policy Engine
Siddhi
CEP
Policy
Evaluator
Machine
Learning
Policy
Evaluator• Support
WSO2
Siddhi
CEP
as
first
class
• Extensible
Policy
Engine
Implementation
• Extensible
Policy
Lifecycle
Management
• Metadata-‐based
Module
Management
Extensible
Policy
Evaluator
public interface PolicyEvaluatorServiceProvider {
public String getPolicyType(); // literal string to identify one type of policy
public Class getPolicyEvaluator(); // get policy evaluator implementation
public List getBindingModules(); // policy text with json format to object mapping
}
METADATA MANAGER
Policy/Metadata
Distributed
Policy
Engine – Extensibility
Policy
Engine
Extensibility
21. Multi-‐Tenants
-‐ Dynamical
Correlation
• Dynamical
Correlation
on
Runtime:
• Sort,
Groupby,
Join,
Window
• Hot
Deploy
Logic
• Policy
Management
• Multi-‐Correlation
on
the
Single
Stream
• Group
by
different
fields
of
same
stream
• Resort
same
stream
by
different
order
• Join
certain
stream
in
different
way
• Multi
Correlation
on
Multi
Streams
• Cross
Streams
Join
• Real-‐time
&
Historical
Stream
Join
23. Eagle
Use
Cases
Data
Activity
Monitoring
Secure
Hadoop
in
Realtime a
data
activity
monitoring
solution
to
instantly
identify
access
to
sensitive
data,
recognize
attacks/
malicious
activity
and
block
access
in
real
time.
Job
Performance
Monitoring
Hadoop,
Spark
Job
Profiling
&
Performance
Monitoring,
Cluster
Health
Anomaly
Detection
1
2
4
eBay
Unified
Monitoring
Platform
Provide
unified
monitoring-‐as-‐service
for
everything
around
infrastructure
or
business.
3
24. Eagle
Data
Activity
Monitoring
Data
Loss
Prevention
Get
alerted
and
stop
a
malicious
user
trying
to
copy,
delete,
move
sensitive
data
from
the
Hadoop
cluster.
Malicious
Logins
Detect
login
when
malicious
user
tries
to
guess
password.
Eagle
creates
user
profiles
using
machine
learning
algorithm
to
detect
anomalies
Unauthorized
access
Detect
and
stop
a
malicious
user
trying
to
access
classified
data
without
privilege.
Malicious
user
operation
Detect
and
stop
a
malicious
user
trying
to
delete
large
amount
of
data.
Operation
type
is
one
parameter
of
Eagle
user
profiles.
Eagle
supports
multiple
native
operation
types.
User
Privileges
Common
Data
Sets
Patterns
CommandsZones
Query
Columns
25. § HDFS
Policies
§ Access
to
Sensitive
files
§ HDFS
Commands
used
(read,
write,
update…)
§ Client
host
§ Destination
§ Security
Zones
§ Hive
Policies
§ Access
to
tables
with
PII
Data
§ SQL
Query
Profiles
§ Client
Host
§ Security
Zone
Eagle
Data
Activity
Monitoring
Hadoop
Data
Security:
Detect
anomalies
in
accessing
HDFS
and
Hive
26. Offline:
Determine
bandwidth
from
training
dataset
the
kernel
density
function
parameters
(KDE)
Online:
If
a
test
data
point
lies
outside
the
trained
bandwidth,
it
is
anomaly
(Policy)
PCs(Principle
Components)
in
EVD
(Eigenvalue
Value
Decomposition)
Kernel
Density
Function
Eagle
Machine
Learning
User
Profile
27. Use
Case Detect
node
anomaly
by
analyzing
task
failure
ratio
across
all
nodes
Assumption Task
failure
ratio
for
every
node
should
be
approximately
equal
Algorithm Node
by
node
compare
(symmetry
violation)
and
per
node
trend
Eagle
Job
Performance
Monitoring
29. Counters & Features
Use
Case Detect
data
skew
by
statistics
and
distributions
for
attempt
execution
durations
and
counters
Assumption
Duration
and
counters
should
be
in
normal
distribution
mapDuration
reduceDuration
mapInputRecords
reduceInputRecords
combineInputRecords
mapSpilledRecords
reduceShuffleRecords
mapLocalFileBytesRead
reduceLocalFileBytesRead
mapHDFSBytesRead
reduceHDFSBytesRead
Modeling & Statistics
Avg
Min
Max
Distributions
Max z-score
Top-N
Correlation
Threshold & Detection
Counters
Correlation > 0.9
& Max(Z-Score) > 90%
Hadoop
Job
Data
Skew
Detection
30. Counters & Features
Use
Case Detect
data
skew
by
statistics
and
distributions
for
attempt
execution
durations
and
counters
Assumption
Duration
and
counters
should
be
in
normal
distribution
mapDuration
reduceDuration
mapInputRecords
reduceInputRecords
combineInputRecords
mapSpilledRecords
reduceShuffleRecords
mapLocalFileBytesRead
reduceLocalFileBytesRead
mapHDFSBytesRead
reduceHDFSBytesRead
Modeling & Statistics
Avg
Min
Max
Distributions
Max z-score
Top-N
Correlation
Threshold & Detection
Counters
Correlation > 0.9
& Max(Z-Score) > 90%
Hadoop
Job
Data
Skew
Detection
31. Learn
More
about
Apache
Eagle
Community
• Website:
http://eagle.incubator.apache.org
• Github:
http://github.com/apache/incubator-‐eagle
• Mailing
list:
dev@eagle.incubator.apache.org
Resources
• Documentation:
http://eagle.incubator.apache.org/docs/
• Docker images:
https://hub.docker.com/r/apacheeagle/sandbox/
Publications
&
Patents
• EAGLE:
USER
PROFILE-‐BASED
ANOMALY
DETECTION
IN
HADOOP
CLUSTER
(IEEE)
• EAGLE:
DISTRIBUTED
REALTIME
MONITORING
FRAMEWORK
FOR
HADOOP
CLUSTER