SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
Apache Eagle: Secure Hadoop in Real Time
http://eagle.incubator.apache.org/
Hao Chen (@haozch) | Ralph Su (@raphaelsu)
About Us
Apache Eagle Co-creator,PMC & Committer
hao@apache.org
Sr. Software Engineer at eBay
hchen9@ebay.com
Apache Eagle PMC & Committer
ralphsu@apache.org
MTS. Software Engineer at eBay
liasu@ebay.com
Ralph Su / 苏良飞
Hao Chen / 陈浩
Agenda
• Introduction
• Architecture
• Case Study
• Q & A
is a distributed real-time monitoring and alerting
engine for hadoop from eBay
Open sourced as Apache Incubator Project on Oct 26th 2015
Secure Hadoop in Realtime a data activity monitoring solution to instantly identify
access to sensitive data, recognize attacks/ malicious activity and block access in real
time.
See http://eagle.incubator.apache.org or http://github.com/apache/incubator-eagle
Apache Eagle
Eagle	
  was	
  initialized	
  by	
  end	
  of	
  2013	
  for	
  hadoop ecosystem	
  monitoring	
  as	
  any	
  existing	
  
tool	
  like	
  zabbix,	
  ganglia	
  can	
  not	
  handle	
  the	
  huge	
  volume	
  of	
  metrics/logs	
  generated	
  by	
  
hadoop system	
  in	
  eBay.
2014/2015
10,000	
  nodes
150,000+	
  cores
200	
  PB
2000+	
  user
3000+	
  nodes
10,000+	
  cores
50+	
  PB
2012
2011
1000+	
  nodes
10,000+	
  cores
10+	
  PB
100+	
  nodes
1000	
  +	
  cores
1	
  PB
2010
2009
50+	
  nodes
2007
1-­‐10	
  nodes
Hadoop	
  Data	
  
• Security
• Activity
Hadoop	
  Platform	
  
• Heath
• Availability
• Performance
Hadoop	
  @	
  eBay	
  Inc
Initiative	
  – Why	
  build	
  Eagle?
7+	
   CLUSTERS
10000+ NODES
200+	
  PB DATA
10	
  B+	
   EVENTS	
  /	
  DAY
500+	
   METRIC	
  TYPES
50,000+	
   JOBS	
  /	
  DAY
50,000,000+	
   TASKS	
  /	
  DAY
Initiative	
  – Core	
  Challenges
Large	
  Scale	
  Processing	
  and	
  Storage
• Scale	
  IO	
  Complexity	
  (Stream)
• Scale	
  Computation	
  Complexity	
  (Policy)
• Scale	
  Storage	
  &	
  Query	
  (Event,	
  Log,	
  Metric)
1
Real-­‐Time	
  Alerting
• Real-­‐time	
  Data	
  Collection
• Real-­‐time	
  Stream	
  Processing
• Real-­‐time	
  Alerting
2
Expressive Correlation	
  Model
• Complex	
  policy	
  model
• Stream	
  GroupBy,	
  Join,	
  Window
• Machine	
  Learning
3
Hadoop	
  Ecosystem	
  Integration
• Data	
  Source	
  Integration
• Anomaly	
  Detection	
  Model
4
1
Eagle	
  Framework
Distributed	
  real-­‐time	
  framework	
  for	
  efficiently	
  developing	
   highly	
  
scalable	
  monitoring	
   applications
Eagle	
  Ecosystem
2 Eagle	
  Apps
Security/	
  Hadoop/	
  Operational	
  Intelligence	
  /	
  …
3 Eagle	
  Interface
REST	
  Service	
  /	
  Management	
  UI	
  /	
  Customizable	
  Analytics	
  
Visualization
4 Eagle	
  Integration
Ambari /	
  Docker /	
  Ranger	
  /	
  Dataguise
Apps
à Security
à Hadoop
à Cloud
à Database
Interface
à Web Portal
à REST Services
à Analytics Visualization
Integration
à Ambari
à Docker
à Ranger
à Dataguise Eagle
Framework
Open	
  Source
Community-­‐driven	
   and	
  Cross-­‐community	
  cooperation
5
Eagle	
  @	
  eBay	
  Inc.
7+	
   CLUSTERS
10000+ NODES
200+	
  PB DATA
10	
  B+	
   EVENTS	
  /	
  DAY
500+	
   METRIC	
  TYPES
50,000+	
   JOBS	
  /	
  DAY
50,000,000+	
   TASKS	
  /	
  DAY
Eagle	
  Deployment	
  at	
  eBay	
  Production
• 100+	
  security	
  policies
• 8	
  nodes
• 30	
  worker	
  process
• 64	
  kafka partition
Eagle	
  Performance
• Avg Latency:	
   ~	
  50	
  ms
• Max	
  Throughput/Cluster:	
  300	
  k	
  /s
Eagle	
  Architecture
Distributed	
  Policy	
  Engine
METADATA   MANAGER
AlertExecutor_{1}
AlertExecutor_{2}
…
AlertExecutor_{N}
Real	
  Time	
  Alerts
Alerts
Policy	
  
Management
Policy
Dynamical	
   Policy	
   Deployment
Real-­‐time	
  
Event	
  Stream
Stream_{1}
Stream_{*}
Dynamical	
   Stream	
  Schema
Stream	
  
Processing
• Real-­‐time	
  Streaming: Apache	
  Storm	
  (Execution	
  Engine)	
  +	
  Kafka	
  (Message	
  Bus)
• Declarative	
  Policy: SQL	
  (CEP)	
  on	
  Streaming	
  +	
  Hot	
  Deploy
• Linear	
  Scalability:	
  Data	
  volume scale +	
  Computation	
  scale
• Metadata-­‐Driven:	
  Schema	
  Management	
  and	
  Dynamical	
  Policy	
  Lifecycle
METADATA   MANAGER
AlertExecutor_{1}
AlertExecutor_{2}
…
AlertExecutor_{N}
Real	
  Time	
  Alerts
Alerts
Policy	
  
Management
Policy
Dynamical	
   Policy	
   Deployment
Real-­‐time	
  
Event	
  Stream
Stream_{1}
Stream_{*}
Dynamical	
   Stream	
  Schema
Stream	
  
Processing
from MetricStream[(name == 'ReplLag') and (value > 1000)]
select * insert into outputStream;
• Real-­‐time	
  Streaming: Apache	
  Storm	
  (Execution	
  Engine)	
  +	
  Kafka	
  (Message	
  Bus)
• Declarative	
  Policy: SQL	
  (CEP)	
  on	
  Streaming	
  +	
  Hot	
  Deploy
• Linear	
  Scalability:	
  Data	
  volume scale +	
  Computation	
  scale
• Metadata-­‐Driven:	
  Schema	
  Management	
  and	
  Dynamical	
  Policy	
  Lifecycle
Distributed	
  Policy	
  Engine
Declarative	
  Policy
• Filter
• Join
• Aggregation:	
  Avg,	
  Sum	
  ,	
  Min,	
  Max,	
  etc
• Group by
• Having
• Stream handlers	
  for	
  window:	
  TimeWindow,	
  Batch	
  Window,	
  
Length	
  Window	
  
• Conditions	
  and	
  Expressions:	
  and,	
  or,	
  not,	
  ==,!=,	
  >=,	
  >,	
  <=,	
  <,	
  
and	
  arithmetic	
  operations
• Pattern	
  Processing
• Sequence	
  processing
• Event	
  Tables:	
  intergrate historical	
  data	
  in	
  realtime processing
• SQL-­‐Like	
  Query:	
  Query,	
  Stream	
  Definition	
   and	
  Query	
  Plan	
  
compilation
Distributed	
  SQL	
  on	
  Streaming	
  :	
  
Siddhi	
  CEP	
  +	
  Storm	
  by	
  default
from MetricStream[(name == 'ReplLag') and (value > 1000)] select * insert into
outputStream;
Declarative	
  Policy	
  -­‐ Examples
from hadoopJmxMetricEventStream
[metric == "hadoop.namenode.fsnamesystemstate.capacityused" and value > 0.9]
select metric, host, value, timestamp, component, site insert into alertStream;
Example	
  1:	
  Alert	
  if	
  hadoop namenode capacity	
  usage	
  exceed	
  90	
  percentages
from every
a = hadoopJmxMetricEventStream[metric=="hadoop.namenode.fsnamesystem.hastate"]
->
b = hadoopJmxMetricEventStream[metric==a.metric and b.host == a.host and
a.value != value)]
within 10 min
select a.host, a.value as oldHaState, b.value as newHaState, b.timestamp as
timestamp, b.metric as metric, b.component as component, b.site as site insert
into alertStream;
Example	
  2:	
  Alert	
  if	
  hadoop namenode HA	
  switches
1
Distributed	
  Policy	
  Engine	
  -­‐ Scalability
2
Distributed   Streaming   Cluster  Environment
AlertExecutor_{1}
AlertExecutor_{2}
…
AlertExecutor_{N}
Stream_{1}
Stream_{*}
Stream	
  
Processing
Dynamic	
  policy	
  partition	
  by	
  {event}	
  *	
  {policy}
• N	
  Users	
  with	
  3	
  partitions,	
  M	
  policies	
  with	
  2	
  partitions,	
  then	
  3*2	
  physical	
  tasks
• Physical	
  partition	
  +	
  policy-­‐level	
  partition
Linear	
  Scalability Principle
3
Algorithm Weights	
  of	
  Executors	
   By	
  Partition	
   User
Random 0.0484 0.152 0.3535 0.105 0.203 0.072 0.042 0.024
Greedy 0.0837 0.0837 0.0837 0.0837 0.0737 0.0637 0.0437 0.0837
Stream	
  Partition	
  Skew	
  (15:1)
Distributed	
  Policy	
  Engine – Optimization
Stream	
  Partition	
  Problem	
  	
  
https://en.wikipedia.org/wiki/Partition_problem
Distributed  Real-­time    Policy  Engine
Siddhi	
  CEP	
  Policy	
  
Evaluator
Machine	
  Learning	
  
Policy	
  Evaluator• Support	
  WSO2	
  Siddhi	
  CEP	
  as	
  first	
  class
• Extensible	
  Policy	
  Engine	
  Implementation
• Extensible	
  Policy	
  Lifecycle	
  Management
• Metadata-­‐based	
  Module	
  Management
Extensible	
   Policy	
  
Evaluator
public interface PolicyEvaluatorServiceProvider {
public String getPolicyType(); // literal string to identify one type of policy
public Class getPolicyEvaluator(); // get policy evaluator implementation
public List getBindingModules(); // policy text with json format to object mapping
}
METADATA  MANAGER
Policy/Metadata
Distributed	
  Policy	
  Engine – Extensibility
Policy	
  Engine	
  Extensibility
Stream	
  Processing	
  Framework
Optimizer
1.	
  Development 2.	
  Optimization 3.	
  Compile	
  to	
  native	
  app
Use	
  eagle	
  alert	
  framework	
  as	
  library
• Light-­‐weight	
  ORM	
  Framework	
  for	
  HBase/RDMBS
• Full-­‐function	
   SQL-­‐Like	
  REST	
  Query	
  
• Optimized	
  Rowkey design	
  for	
  time-­‐series	
  data
• Native	
  HBase Coprocessor
• Secondary	
  Index	
  Support
@Table("alertdef")
@ColumnFamily("f")
@Prefix("alertdef")
@Service(AlertConstants.ALERT_DEFINITION_SERVICE_ENDPOINT_NAME)
@JsonIgnoreProperties(ignoreUnknown = true)
@TimeSeries(false)
@Tags({"site", "dataSource", "alertExecutorId", "policyId",
"policyType"})
@Indexes({
@Index(name="Index_1_alertExecutorId", columns = { "alertExecutorID"
}, unique = true),
})
public class AlertDefinitionAPIEntity extends TaggedLogAPIEntity{
@Column("a")
private String desc;
@Column("b")
private String policyDef;
@Column("c")
private String dedupeDef;
Query=AlertDefinitionService[@dataSource="hiveQueryLog"]{@policyDef}
Large	
  Scale	
  Storage	
  and	
  Query
Uniform	
  HBase rowkey design
• Metric
• Entity
• Log
Rowkey ::= Prefix | Partition Keys | timestamp | tagName | tagValue | …
Rowkey ::= Metric Name | Partition Keys | timestamp | tagName | tagValue | …
Rowkey ::= Default Prefix | Partition Keys | timestamp | tagName | tagValue | …
Rowkey ::= Log Type | Partition Keys | timestamp | tagName | tagValue | …
Rowvalue ::= Log Content
Large	
  Scale	
  Storage	
  and	
  Query
http://opentsdb.net
Multi-­‐Tenants	
  – Topology	
  Scheduler
• Dynamical	
  Topology	
  Management
• No-­‐downtime	
  Topology	
  Maintenance
• Topology	
  High	
  Availability	
  &	
  Balance
• Resource	
  Scheduling	
  &	
  Isolation:	
  Runtime,Woker,	
  Topology	
  or	
  Cluster
Multi-­‐Tenants	
  -­‐ Dynamical	
  Correlation
• Dynamical	
  Correlation	
  on	
  Runtime:
• Sort,	
  Groupby,	
  Join,	
  Window
• Hot	
  Deploy	
  Logic
• Policy	
  Management
• Multi-­‐Correlation	
  on	
  the	
  Single	
  Stream
• Group	
  by	
  different	
  fields	
  of	
  same	
  stream
• Resort	
  same	
  stream	
  by	
  different	
  order
• Join	
  certain	
  stream	
  in	
  different	
  way
• Multi	
  Correlation	
  on	
  Multi	
  Streams
• Cross	
  Streams	
  Join
• Real-­‐time	
  &	
  Historical	
  Stream	
  Join
Multi-­‐Tenants	
  -­‐ Dynamical	
  Correlation
Eagle	
  Use	
  Cases
Data	
  Activity	
  Monitoring
Secure	
  Hadoop	
  in	
  Realtime a	
  data	
  activity	
  monitoring	
  solution	
  to	
  instantly	
  identify	
  access	
  to	
  sensitive	
  data,	
  	
  recognize	
  
attacks/	
  malicious	
  activity	
  and	
  block	
  access	
  in	
  real	
  time.
Job	
  Performance	
  Monitoring
Hadoop,	
  Spark	
  Job	
  Profiling	
  &	
  Performance	
  Monitoring,	
  Cluster	
  Health	
  Anomaly	
  Detection
1
2
4
eBay	
  Unified	
  Monitoring	
  Platform
Provide	
  unified	
  monitoring-­‐as-­‐service	
  for	
  everything	
  around	
  infrastructure	
  or	
  business.
3
Eagle	
  Data	
  Activity	
  Monitoring
Data	
  Loss	
  Prevention
Get	
  alerted	
  and	
  stop	
  a	
  malicious	
  user	
  trying	
  to	
  copy,	
  delete,	
  move	
  
sensitive	
  data	
  from	
  the	
  Hadoop	
  cluster.
Malicious	
  Logins
Detect	
  login	
  when	
  malicious	
  user	
  tries	
  to	
  guess	
  password.	
  Eagle	
  
creates	
  user	
  profiles	
  using	
  machine	
  learning	
  algorithm	
  to	
  detect	
  
anomalies
Unauthorized	
  access
Detect	
  and	
  stop	
  a	
  malicious	
  user	
  trying	
  to	
  access	
  classified	
  data	
  
without	
  privilege.	
  
Malicious	
  user	
  operation
Detect	
  and	
  stop	
  a	
  malicious	
  user	
  trying	
  to	
  delete	
  large	
  amount	
  of	
  
data.	
  Operation	
  type	
  is	
  one	
  parameter	
  of	
  Eagle	
  user	
  profiles.	
  Eagle	
  
supports	
  multiple	
  native	
  operation	
  types.
User
Privileges
Common	
  
Data	
  Sets
Patterns
CommandsZones
Query
Columns
§ HDFS	
  Policies
§ Access	
  to	
  Sensitive	
  files
§ HDFS	
  Commands	
  used	
  (read,	
  write,	
  update…)
§ Client	
  host	
  
§ Destination
§ Security	
  Zones
§ Hive	
  Policies
§ Access	
  to	
  tables	
  with	
  PII	
  Data
§ SQL	
  Query	
  Profiles
§ Client	
  Host
§ Security	
  Zone
Eagle	
  Data	
  Activity	
  Monitoring
Hadoop	
  Data	
  Security:	
  Detect	
  anomalies	
  in	
  accessing	
  HDFS	
  and	
  Hive
Offline:	
  Determine	
  bandwidth	
  from	
  training	
  dataset	
  the	
  kernel	
  density	
  function	
   parameters	
  (KDE)
Online:	
  If	
  a	
  test	
  data	
  point	
  lies	
  outside	
  the	
  trained	
  bandwidth,	
   it	
  is	
  anomaly	
  (Policy)
PCs(Principle	
  Components)	
  in	
  EVD
(Eigenvalue	
  Value	
  Decomposition)
Kernel	
  Density	
  Function
Eagle	
  Machine	
  Learning	
  User	
  Profile
Use	
  Case Detect	
  node	
  anomaly	
  by	
  analyzing	
  task	
  failure	
  ratio	
  across	
  all	
  nodes
Assumption Task	
  failure	
  ratio	
  for	
  every	
  node	
  should	
  be	
  approximately	
  equal
Algorithm Node	
  by	
  node	
  compare	
  (symmetry	
  violation)	
  and	
  per	
  node	
  trend
Eagle	
  Job	
  Performance	
  Monitoring
Alerting:	
  
Anomaly	
  Detection	
  Alerting
Insight:	
  
Task	
  failure	
  drill-­‐down
Insight:	
  
Task	
  failure	
  drill-­‐down
Task	
  Failure	
  based	
  Anomaly	
  Host	
  Detection	
  
Counters  &  Features
Use	
  Case Detect	
  data	
  skew	
  by	
  statistics	
  and	
  distributions	
  for	
  attempt	
  execution	
  durations	
  and	
  
counters
Assumption	
  	
  	
  Duration	
  and	
  counters	
  should	
  be	
  in	
  normal	
  distribution
mapDuration
reduceDuration
mapInputRecords
reduceInputRecords
combineInputRecords
mapSpilledRecords
reduceShuffleRecords
mapLocalFileBytesRead
reduceLocalFileBytesRead
mapHDFSBytesRead
reduceHDFSBytesRead
Modeling  &  Statistics
Avg
Min
Max  
Distributions
Max  z-­score
Top-­N
Correlation
Threshold  &  Detection
Counters
Correlation  >  0.9  
&  Max(Z-­Score)  >  90%  
Hadoop	
  Job	
  Data	
  Skew	
  Detection
Counters  &  Features
Use	
  Case Detect	
  data	
  skew	
  by	
  statistics	
  and	
  distributions	
  for	
  attempt	
  execution	
  durations	
  and	
  
counters
Assumption	
  	
  	
  Duration	
  and	
  counters	
  should	
  be	
  in	
  normal	
  distribution
mapDuration
reduceDuration
mapInputRecords
reduceInputRecords
combineInputRecords
mapSpilledRecords
reduceShuffleRecords
mapLocalFileBytesRead
reduceLocalFileBytesRead
mapHDFSBytesRead
reduceHDFSBytesRead
Modeling  &  Statistics
Avg
Min
Max  
Distributions
Max  z-­score
Top-­N
Correlation
Threshold  &  Detection
Counters
Correlation  >  0.9  
&  Max(Z-­Score)  >  90%  
Hadoop	
  Job	
  Data	
  Skew	
  Detection
Learn	
  More	
  about	
  Apache	
  Eagle
Community
• Website:	
  http://eagle.incubator.apache.org
• Github:	
  http://github.com/apache/incubator-­‐eagle
• Mailing	
  list:	
  dev@eagle.incubator.apache.org
Resources
• Documentation:	
  http://eagle.incubator.apache.org/docs/
• Docker images:	
  https://hub.docker.com/r/apacheeagle/sandbox/
Publications	
  &	
  Patents
• EAGLE:	
  USER	
  PROFILE-­‐BASED	
  ANOMALY	
  DETECTION	
  IN	
  HADOOP	
  CLUSTER	
  (IEEE)
• EAGLE:	
  DISTRIBUTED	
  REALTIME	
  MONITORING	
  FRAMEWORK	
  FOR	
  HADOOP	
  CLUSTER
Q	
  &	
  A
http://eagle.incubator.apache.org
The	
  slide	
  is	
  licensed	
  under	
  Creative	
  Commons	
  Attribution	
  4.0	
  International	
  license.

Más contenido relacionado

La actualidad más candente

Building and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieBuilding and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieDataWorks Summit/Hadoop Summit
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseDataWorks Summit
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsDataWorks Summit
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Cécile Poyet
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesDataWorks Summit
 
Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...DataWorks Summit
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit
 
Machine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of DataMachine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of DataDataWorks Summit/Hadoop Summit
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Hortonworks
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceDataWorks Summit/Hadoop Summit
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsDataWorks Summit
 
Build Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsightBuild Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsightDataWorks Summit/Hadoop Summit
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Data Con LA
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Enabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataEnabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataDataWorks Summit
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...DataWorks Summit
 

La actualidad más candente (20)

Building and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieBuilding and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache Oozie
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Automated Analytics at Scale
Automated Analytics at ScaleAutomated Analytics at Scale
Automated Analytics at Scale
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
 
Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
 
Accelerating Data Warehouse Modernization
Accelerating Data Warehouse ModernizationAccelerating Data Warehouse Modernization
Accelerating Data Warehouse Modernization
 
Machine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of DataMachine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of Data
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
 
What's new in Ambari
What's new in AmbariWhat's new in Ambari
What's new in Ambari
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data Applications
 
Build Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsightBuild Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsight
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Filling the Data Lake
Filling the Data LakeFilling the Data Lake
Filling the Data Lake
 
Enabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataEnabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government data
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
 

Destacado

Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016Arun Karthick Manoharan
 
Apache Eagle in Action
Apache Eagle in ActionApache Eagle in Action
Apache Eagle in ActionHao Chen
 
Apache Eagle Dublin Hadoop Summit 2016
Apache Eagle   Dublin Hadoop Summit 2016Apache Eagle   Dublin Hadoop Summit 2016
Apache Eagle Dublin Hadoop Summit 2016Edward Zhang
 
WSO2 and 2 Degrees Case Study
WSO2 and 2 Degrees Case StudyWSO2 and 2 Degrees Case Study
WSO2 and 2 Degrees Case StudyWSO2
 
WSO2 & eBay Case Study
WSO2 & eBay Case StudyWSO2 & eBay Case Study
WSO2 & eBay Case StudyWSO2
 
The Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedInThe Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedInCarl Steinbach
 
Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015Hao Chen
 
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureRyan Hennig
 
IoT Crash Course Hadoop Summit SJ
IoT Crash Course Hadoop Summit SJIoT Crash Course Hadoop Summit SJ
IoT Crash Course Hadoop Summit SJDaniel Madrigal
 
Making the leap to BI on Hadoop by Mariani, dave @ atscale
Making the leap to BI on Hadoop by Mariani, dave @ atscaleMaking the leap to BI on Hadoop by Mariani, dave @ atscale
Making the leap to BI on Hadoop by Mariani, dave @ atscaleTin Ho
 
Lightbend Training for Scala, Akka, Play Framework and Apache Spark
Lightbend Training for Scala, Akka, Play Framework and Apache SparkLightbend Training for Scala, Akka, Play Framework and Apache Spark
Lightbend Training for Scala, Akka, Play Framework and Apache SparkLightbend
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit
 
What the #$* is a Business Catalog and why you need it
What the #$* is a Business Catalog and why you need it What the #$* is a Business Catalog and why you need it
What the #$* is a Business Catalog and why you need it DataWorks Summit/Hadoop Summit
 

Destacado (20)

Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Apache Eagle in Action
Apache Eagle in ActionApache Eagle in Action
Apache Eagle in Action
 
Apache Eagle Dublin Hadoop Summit 2016
Apache Eagle   Dublin Hadoop Summit 2016Apache Eagle   Dublin Hadoop Summit 2016
Apache Eagle Dublin Hadoop Summit 2016
 
WSO2 and 2 Degrees Case Study
WSO2 and 2 Degrees Case StudyWSO2 and 2 Degrees Case Study
WSO2 and 2 Degrees Case Study
 
WSO2 & eBay Case Study
WSO2 & eBay Case StudyWSO2 & eBay Case Study
WSO2 & eBay Case Study
 
eBay Case Study
eBay Case StudyeBay Case Study
eBay Case Study
 
The Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedInThe Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedIn
 
Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015
 
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and Future
 
Real Time BI with Hadoop
Real Time BI with HadoopReal Time BI with Hadoop
Real Time BI with Hadoop
 
Omid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBaseOmid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBase
 
IoT Crash Course Hadoop Summit SJ
IoT Crash Course Hadoop Summit SJIoT Crash Course Hadoop Summit SJ
IoT Crash Course Hadoop Summit SJ
 
Using Hadoop for Cognitive Analytics
Using Hadoop for Cognitive AnalyticsUsing Hadoop for Cognitive Analytics
Using Hadoop for Cognitive Analytics
 
Making the leap to BI on Hadoop by Mariani, dave @ atscale
Making the leap to BI on Hadoop by Mariani, dave @ atscaleMaking the leap to BI on Hadoop by Mariani, dave @ atscale
Making the leap to BI on Hadoop by Mariani, dave @ atscale
 
Curb your insecurity with HDP
Curb your insecurity with HDPCurb your insecurity with HDP
Curb your insecurity with HDP
 
The Path to Wellness through Big Data
The Path to Wellness through Big DataThe Path to Wellness through Big Data
The Path to Wellness through Big Data
 
Lightbend Training for Scala, Akka, Play Framework and Apache Spark
Lightbend Training for Scala, Akka, Play Framework and Apache SparkLightbend Training for Scala, Akka, Play Framework and Apache Spark
Lightbend Training for Scala, Akka, Play Framework and Apache Spark
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
What the #$* is a Business Catalog and why you need it
What the #$* is a Business Catalog and why you need it What the #$* is a Business Catalog and why you need it
What the #$* is a Business Catalog and why you need it
 

Similar a Apache Eagle: Secure Hadoop in Real Time

Apache Eagle: 来自eBay的分布式实时Hadoop数据安全引擎
Apache Eagle: 来自eBay的分布式实时Hadoop数据安全引擎Apache Eagle: 来自eBay的分布式实时Hadoop数据安全引擎
Apache Eagle: 来自eBay的分布式实时Hadoop数据安全引擎Qingwen zhao
 
Apache Eagle Architecture Evolvement
Apache Eagle Architecture EvolvementApache Eagle Architecture Evolvement
Apache Eagle Architecture EvolvementHao Chen
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudScott Miao
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Cask Data
 
Apache Eagle: Architecture Evolvement and New Features
Apache Eagle: Architecture Evolvement and New FeaturesApache Eagle: Architecture Evolvement and New Features
Apache Eagle: Architecture Evolvement and New FeaturesHao Chen
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshSion Smith
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 
Log Data Analysis Platform by Valentin Kropov
Log Data Analysis Platform by Valentin KropovLog Data Analysis Platform by Valentin Kropov
Log Data Analysis Platform by Valentin KropovSoftServe
 
Log Data Analysis Platform
Log Data Analysis PlatformLog Data Analysis Platform
Log Data Analysis PlatformValentin Kropov
 
Regain Control Thanks To Prometheus
Regain Control Thanks To PrometheusRegain Control Thanks To Prometheus
Regain Control Thanks To PrometheusEtienne Coutaud
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopDataWorks Summit
 
Building a cloud based managed BigData platform for the enterprise
Building a cloud based managed BigData platform for the enterpriseBuilding a cloud based managed BigData platform for the enterprise
Building a cloud based managed BigData platform for the enterpriseHemanth Yamijala
 
Apache Eagle: eBay构建开源分布式实时预警引擎实践
Apache Eagle: eBay构建开源分布式实时预警引擎实践Apache Eagle: eBay构建开源分布式实时预警引擎实践
Apache Eagle: eBay构建开源分布式实时预警引擎实践Hao Chen
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Sourceaspyker
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Timothy Spann
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...Amazon Web Services
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemAccumulo Summit
 
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit OrlandoGimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit OrlandoRomit Mehta
 

Similar a Apache Eagle: Secure Hadoop in Real Time (20)

ebay
ebayebay
ebay
 
Apache Eagle: 来自eBay的分布式实时Hadoop数据安全引擎
Apache Eagle: 来自eBay的分布式实时Hadoop数据安全引擎Apache Eagle: 来自eBay的分布式实时Hadoop数据安全引擎
Apache Eagle: 来自eBay的分布式实时Hadoop数据安全引擎
 
Apache Eagle Architecture Evolvement
Apache Eagle Architecture EvolvementApache Eagle Architecture Evolvement
Apache Eagle Architecture Evolvement
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloud
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
Apache Eagle: Architecture Evolvement and New Features
Apache Eagle: Architecture Evolvement and New FeaturesApache Eagle: Architecture Evolvement and New Features
Apache Eagle: Architecture Evolvement and New Features
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data Mesh
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
Log Data Analysis Platform by Valentin Kropov
Log Data Analysis Platform by Valentin KropovLog Data Analysis Platform by Valentin Kropov
Log Data Analysis Platform by Valentin Kropov
 
Log Data Analysis Platform
Log Data Analysis PlatformLog Data Analysis Platform
Log Data Analysis Platform
 
Regain Control Thanks To Prometheus
Regain Control Thanks To PrometheusRegain Control Thanks To Prometheus
Regain Control Thanks To Prometheus
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
 
Building a cloud based managed BigData platform for the enterprise
Building a cloud based managed BigData platform for the enterpriseBuilding a cloud based managed BigData platform for the enterprise
Building a cloud based managed BigData platform for the enterprise
 
Apache Eagle: eBay构建开源分布式实时预警引擎实践
Apache Eagle: eBay构建开源分布式实时预警引擎实践Apache Eagle: eBay构建开源分布式实时预警引擎实践
Apache Eagle: eBay构建开源分布式实时预警引擎实践
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
 
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit OrlandoGimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
 

Más de DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

Más de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Último

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 

Último (20)

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 

Apache Eagle: Secure Hadoop in Real Time

  • 1. Apache Eagle: Secure Hadoop in Real Time http://eagle.incubator.apache.org/ Hao Chen (@haozch) | Ralph Su (@raphaelsu)
  • 2. About Us Apache Eagle Co-creator,PMC & Committer hao@apache.org Sr. Software Engineer at eBay hchen9@ebay.com Apache Eagle PMC & Committer ralphsu@apache.org MTS. Software Engineer at eBay liasu@ebay.com Ralph Su / 苏良飞 Hao Chen / 陈浩
  • 4. is a distributed real-time monitoring and alerting engine for hadoop from eBay Open sourced as Apache Incubator Project on Oct 26th 2015 Secure Hadoop in Realtime a data activity monitoring solution to instantly identify access to sensitive data, recognize attacks/ malicious activity and block access in real time. See http://eagle.incubator.apache.org or http://github.com/apache/incubator-eagle Apache Eagle
  • 5. Eagle  was  initialized  by  end  of  2013  for  hadoop ecosystem  monitoring  as  any  existing   tool  like  zabbix,  ganglia  can  not  handle  the  huge  volume  of  metrics/logs  generated  by   hadoop system  in  eBay. 2014/2015 10,000  nodes 150,000+  cores 200  PB 2000+  user 3000+  nodes 10,000+  cores 50+  PB 2012 2011 1000+  nodes 10,000+  cores 10+  PB 100+  nodes 1000  +  cores 1  PB 2010 2009 50+  nodes 2007 1-­‐10  nodes Hadoop  Data   • Security • Activity Hadoop  Platform   • Heath • Availability • Performance Hadoop  @  eBay  Inc Initiative  – Why  build  Eagle? 7+   CLUSTERS 10000+ NODES 200+  PB DATA 10  B+   EVENTS  /  DAY 500+   METRIC  TYPES 50,000+   JOBS  /  DAY 50,000,000+   TASKS  /  DAY
  • 6. Initiative  – Core  Challenges Large  Scale  Processing  and  Storage • Scale  IO  Complexity  (Stream) • Scale  Computation  Complexity  (Policy) • Scale  Storage  &  Query  (Event,  Log,  Metric) 1 Real-­‐Time  Alerting • Real-­‐time  Data  Collection • Real-­‐time  Stream  Processing • Real-­‐time  Alerting 2 Expressive Correlation  Model • Complex  policy  model • Stream  GroupBy,  Join,  Window • Machine  Learning 3 Hadoop  Ecosystem  Integration • Data  Source  Integration • Anomaly  Detection  Model 4
  • 7. 1 Eagle  Framework Distributed  real-­‐time  framework  for  efficiently  developing   highly   scalable  monitoring   applications Eagle  Ecosystem 2 Eagle  Apps Security/  Hadoop/  Operational  Intelligence  /  … 3 Eagle  Interface REST  Service  /  Management  UI  /  Customizable  Analytics   Visualization 4 Eagle  Integration Ambari /  Docker /  Ranger  /  Dataguise Apps à Security à Hadoop à Cloud à Database Interface à Web Portal à REST Services à Analytics Visualization Integration à Ambari à Docker à Ranger à Dataguise Eagle Framework Open  Source Community-­‐driven   and  Cross-­‐community  cooperation 5
  • 8. Eagle  @  eBay  Inc. 7+   CLUSTERS 10000+ NODES 200+  PB DATA 10  B+   EVENTS  /  DAY 500+   METRIC  TYPES 50,000+   JOBS  /  DAY 50,000,000+   TASKS  /  DAY Eagle  Deployment  at  eBay  Production • 100+  security  policies • 8  nodes • 30  worker  process • 64  kafka partition Eagle  Performance • Avg Latency:   ~  50  ms • Max  Throughput/Cluster:  300  k  /s
  • 10. Distributed  Policy  Engine METADATA   MANAGER AlertExecutor_{1} AlertExecutor_{2} … AlertExecutor_{N} Real  Time  Alerts Alerts Policy   Management Policy Dynamical   Policy   Deployment Real-­‐time   Event  Stream Stream_{1} Stream_{*} Dynamical   Stream  Schema Stream   Processing • Real-­‐time  Streaming: Apache  Storm  (Execution  Engine)  +  Kafka  (Message  Bus) • Declarative  Policy: SQL  (CEP)  on  Streaming  +  Hot  Deploy • Linear  Scalability:  Data  volume scale +  Computation  scale • Metadata-­‐Driven:  Schema  Management  and  Dynamical  Policy  Lifecycle
  • 11. METADATA   MANAGER AlertExecutor_{1} AlertExecutor_{2} … AlertExecutor_{N} Real  Time  Alerts Alerts Policy   Management Policy Dynamical   Policy   Deployment Real-­‐time   Event  Stream Stream_{1} Stream_{*} Dynamical   Stream  Schema Stream   Processing from MetricStream[(name == 'ReplLag') and (value > 1000)] select * insert into outputStream; • Real-­‐time  Streaming: Apache  Storm  (Execution  Engine)  +  Kafka  (Message  Bus) • Declarative  Policy: SQL  (CEP)  on  Streaming  +  Hot  Deploy • Linear  Scalability:  Data  volume scale +  Computation  scale • Metadata-­‐Driven:  Schema  Management  and  Dynamical  Policy  Lifecycle Distributed  Policy  Engine
  • 12. Declarative  Policy • Filter • Join • Aggregation:  Avg,  Sum  ,  Min,  Max,  etc • Group by • Having • Stream handlers  for  window:  TimeWindow,  Batch  Window,   Length  Window   • Conditions  and  Expressions:  and,  or,  not,  ==,!=,  >=,  >,  <=,  <,   and  arithmetic  operations • Pattern  Processing • Sequence  processing • Event  Tables:  intergrate historical  data  in  realtime processing • SQL-­‐Like  Query:  Query,  Stream  Definition   and  Query  Plan   compilation Distributed  SQL  on  Streaming  :   Siddhi  CEP  +  Storm  by  default from MetricStream[(name == 'ReplLag') and (value > 1000)] select * insert into outputStream;
  • 13. Declarative  Policy  -­‐ Examples from hadoopJmxMetricEventStream [metric == "hadoop.namenode.fsnamesystemstate.capacityused" and value > 0.9] select metric, host, value, timestamp, component, site insert into alertStream; Example  1:  Alert  if  hadoop namenode capacity  usage  exceed  90  percentages from every a = hadoopJmxMetricEventStream[metric=="hadoop.namenode.fsnamesystem.hastate"] -> b = hadoopJmxMetricEventStream[metric==a.metric and b.host == a.host and a.value != value)] within 10 min select a.host, a.value as oldHaState, b.value as newHaState, b.timestamp as timestamp, b.metric as metric, b.component as component, b.site as site insert into alertStream; Example  2:  Alert  if  hadoop namenode HA  switches
  • 14. 1 Distributed  Policy  Engine  -­‐ Scalability 2 Distributed   Streaming   Cluster  Environment AlertExecutor_{1} AlertExecutor_{2} … AlertExecutor_{N} Stream_{1} Stream_{*} Stream   Processing Dynamic  policy  partition  by  {event}  *  {policy} • N  Users  with  3  partitions,  M  policies  with  2  partitions,  then  3*2  physical  tasks • Physical  partition  +  policy-­‐level  partition Linear  Scalability Principle
  • 15. 3 Algorithm Weights  of  Executors   By  Partition   User Random 0.0484 0.152 0.3535 0.105 0.203 0.072 0.042 0.024 Greedy 0.0837 0.0837 0.0837 0.0837 0.0737 0.0637 0.0437 0.0837 Stream  Partition  Skew  (15:1) Distributed  Policy  Engine – Optimization Stream  Partition  Problem     https://en.wikipedia.org/wiki/Partition_problem
  • 16. Distributed  Real-­time    Policy  Engine Siddhi  CEP  Policy   Evaluator Machine  Learning   Policy  Evaluator• Support  WSO2  Siddhi  CEP  as  first  class • Extensible  Policy  Engine  Implementation • Extensible  Policy  Lifecycle  Management • Metadata-­‐based  Module  Management Extensible   Policy   Evaluator public interface PolicyEvaluatorServiceProvider { public String getPolicyType(); // literal string to identify one type of policy public Class getPolicyEvaluator(); // get policy evaluator implementation public List getBindingModules(); // policy text with json format to object mapping } METADATA  MANAGER Policy/Metadata Distributed  Policy  Engine – Extensibility Policy  Engine  Extensibility
  • 17. Stream  Processing  Framework Optimizer 1.  Development 2.  Optimization 3.  Compile  to  native  app Use  eagle  alert  framework  as  library
  • 18. • Light-­‐weight  ORM  Framework  for  HBase/RDMBS • Full-­‐function   SQL-­‐Like  REST  Query   • Optimized  Rowkey design  for  time-­‐series  data • Native  HBase Coprocessor • Secondary  Index  Support @Table("alertdef") @ColumnFamily("f") @Prefix("alertdef") @Service(AlertConstants.ALERT_DEFINITION_SERVICE_ENDPOINT_NAME) @JsonIgnoreProperties(ignoreUnknown = true) @TimeSeries(false) @Tags({"site", "dataSource", "alertExecutorId", "policyId", "policyType"}) @Indexes({ @Index(name="Index_1_alertExecutorId", columns = { "alertExecutorID" }, unique = true), }) public class AlertDefinitionAPIEntity extends TaggedLogAPIEntity{ @Column("a") private String desc; @Column("b") private String policyDef; @Column("c") private String dedupeDef; Query=AlertDefinitionService[@dataSource="hiveQueryLog"]{@policyDef} Large  Scale  Storage  and  Query
  • 19. Uniform  HBase rowkey design • Metric • Entity • Log Rowkey ::= Prefix | Partition Keys | timestamp | tagName | tagValue | … Rowkey ::= Metric Name | Partition Keys | timestamp | tagName | tagValue | … Rowkey ::= Default Prefix | Partition Keys | timestamp | tagName | tagValue | … Rowkey ::= Log Type | Partition Keys | timestamp | tagName | tagValue | … Rowvalue ::= Log Content Large  Scale  Storage  and  Query http://opentsdb.net
  • 20. Multi-­‐Tenants  – Topology  Scheduler • Dynamical  Topology  Management • No-­‐downtime  Topology  Maintenance • Topology  High  Availability  &  Balance • Resource  Scheduling  &  Isolation:  Runtime,Woker,  Topology  or  Cluster
  • 21. Multi-­‐Tenants  -­‐ Dynamical  Correlation • Dynamical  Correlation  on  Runtime: • Sort,  Groupby,  Join,  Window • Hot  Deploy  Logic • Policy  Management • Multi-­‐Correlation  on  the  Single  Stream • Group  by  different  fields  of  same  stream • Resort  same  stream  by  different  order • Join  certain  stream  in  different  way • Multi  Correlation  on  Multi  Streams • Cross  Streams  Join • Real-­‐time  &  Historical  Stream  Join
  • 23. Eagle  Use  Cases Data  Activity  Monitoring Secure  Hadoop  in  Realtime a  data  activity  monitoring  solution  to  instantly  identify  access  to  sensitive  data,    recognize   attacks/  malicious  activity  and  block  access  in  real  time. Job  Performance  Monitoring Hadoop,  Spark  Job  Profiling  &  Performance  Monitoring,  Cluster  Health  Anomaly  Detection 1 2 4 eBay  Unified  Monitoring  Platform Provide  unified  monitoring-­‐as-­‐service  for  everything  around  infrastructure  or  business. 3
  • 24. Eagle  Data  Activity  Monitoring Data  Loss  Prevention Get  alerted  and  stop  a  malicious  user  trying  to  copy,  delete,  move   sensitive  data  from  the  Hadoop  cluster. Malicious  Logins Detect  login  when  malicious  user  tries  to  guess  password.  Eagle   creates  user  profiles  using  machine  learning  algorithm  to  detect   anomalies Unauthorized  access Detect  and  stop  a  malicious  user  trying  to  access  classified  data   without  privilege.   Malicious  user  operation Detect  and  stop  a  malicious  user  trying  to  delete  large  amount  of   data.  Operation  type  is  one  parameter  of  Eagle  user  profiles.  Eagle   supports  multiple  native  operation  types. User Privileges Common   Data  Sets Patterns CommandsZones Query Columns
  • 25. § HDFS  Policies § Access  to  Sensitive  files § HDFS  Commands  used  (read,  write,  update…) § Client  host   § Destination § Security  Zones § Hive  Policies § Access  to  tables  with  PII  Data § SQL  Query  Profiles § Client  Host § Security  Zone Eagle  Data  Activity  Monitoring Hadoop  Data  Security:  Detect  anomalies  in  accessing  HDFS  and  Hive
  • 26. Offline:  Determine  bandwidth  from  training  dataset  the  kernel  density  function   parameters  (KDE) Online:  If  a  test  data  point  lies  outside  the  trained  bandwidth,   it  is  anomaly  (Policy) PCs(Principle  Components)  in  EVD (Eigenvalue  Value  Decomposition) Kernel  Density  Function Eagle  Machine  Learning  User  Profile
  • 27. Use  Case Detect  node  anomaly  by  analyzing  task  failure  ratio  across  all  nodes Assumption Task  failure  ratio  for  every  node  should  be  approximately  equal Algorithm Node  by  node  compare  (symmetry  violation)  and  per  node  trend Eagle  Job  Performance  Monitoring
  • 28. Alerting:   Anomaly  Detection  Alerting Insight:   Task  failure  drill-­‐down Insight:   Task  failure  drill-­‐down Task  Failure  based  Anomaly  Host  Detection  
  • 29. Counters  &  Features Use  Case Detect  data  skew  by  statistics  and  distributions  for  attempt  execution  durations  and   counters Assumption      Duration  and  counters  should  be  in  normal  distribution mapDuration reduceDuration mapInputRecords reduceInputRecords combineInputRecords mapSpilledRecords reduceShuffleRecords mapLocalFileBytesRead reduceLocalFileBytesRead mapHDFSBytesRead reduceHDFSBytesRead Modeling  &  Statistics Avg Min Max   Distributions Max  z-­score Top-­N Correlation Threshold  &  Detection Counters Correlation  >  0.9   &  Max(Z-­Score)  >  90%   Hadoop  Job  Data  Skew  Detection
  • 30. Counters  &  Features Use  Case Detect  data  skew  by  statistics  and  distributions  for  attempt  execution  durations  and   counters Assumption      Duration  and  counters  should  be  in  normal  distribution mapDuration reduceDuration mapInputRecords reduceInputRecords combineInputRecords mapSpilledRecords reduceShuffleRecords mapLocalFileBytesRead reduceLocalFileBytesRead mapHDFSBytesRead reduceHDFSBytesRead Modeling  &  Statistics Avg Min Max   Distributions Max  z-­score Top-­N Correlation Threshold  &  Detection Counters Correlation  >  0.9   &  Max(Z-­Score)  >  90%   Hadoop  Job  Data  Skew  Detection
  • 31. Learn  More  about  Apache  Eagle Community • Website:  http://eagle.incubator.apache.org • Github:  http://github.com/apache/incubator-­‐eagle • Mailing  list:  dev@eagle.incubator.apache.org Resources • Documentation:  http://eagle.incubator.apache.org/docs/ • Docker images:  https://hub.docker.com/r/apacheeagle/sandbox/ Publications  &  Patents • EAGLE:  USER  PROFILE-­‐BASED  ANOMALY  DETECTION  IN  HADOOP  CLUSTER  (IEEE) • EAGLE:  DISTRIBUTED  REALTIME  MONITORING  FRAMEWORK  FOR  HADOOP  CLUSTER
  • 32. Q  &  A http://eagle.incubator.apache.org The  slide  is  licensed  under  Creative  Commons  Attribution  4.0  International  license.