SlideShare una empresa de Scribd logo
1 de 31
Realtime
Components &
Architectures
Prepared for Big Data Madison
Ryan Bosshart // Systems Engineer
Agenda
• (Near) Real-Time Problems – Actual Cloudera Use-Cases
• Applicable Frameworks and Architectures
• DDOS Example & Code
Click to enter confidentiality information
3
Connected Medical Devices
• Batch:
– How does a patients disease progress
over time?
– How does physician training affect
disease state?
– How can we recommend better
therapies?
• Realtime:
– What is the patient’s disease state right
now?
– Alert on potential device malfunctions.
Click to enter confidentiality information
4
Connected Cars
• Batch:
– Manufacturer wants to know
optimal charge performance.
• Real-time:
– Consumer wants to know if teen is
driving car right now. How fast are
they accelerating / driving?
– Vehicle Service – e.g. grab an up-
to-date “diagnosis bundle” before
service.
5
Victim’s
Infrastructu
re
Security
• Batch analytics:
– What countries are most
common?
• Realtime:
– How do we detect and stop
attackers right now!
6
Netflow Data
Click to enter confidentiality information
Bytes Contents Description
0-3 srcaddr Source IP address
4-7 dstaddr Destination IP address
8-11 nexthop IP address of next hop router
12-13 input SNMP index of input interface
14-15 output SNMP index of output interface
16-19 dPkts Packets in the flow
20-23 dOctets Total number of Layer 3 bytes in the packets of the
flow
24-27 first SysUptime at start of flow
28-31 last SysUptime at the time the last packet of the flow
was received
32-33 srcport TCP/UDP source port number or equivalent
34-35 dstport TCP/UDP destination port number or equivalent
36 pad1 Unused (zero) bytes
37 tcp_flags Cumulative OR of TCP flags
38 prot IP protocol type (for example, TCP = 6; UDP = 17)
39 tos IP type of service (ToS)
40-41 src_as Autonomous system number of the source, either
origin or peer
42-43 dst_as Autonomous system number of the destination,
either origin or peer
44 src_mask Source address prefix mask bits
45 dst_mask Destination address prefix mask bits
46-47 pad2 Unused (zero) bytes
7
5
Ingesting and Processing Netflow Data
Click to enter confidentiality information
IP Traffic Annotate
Netflo
w
Pub-
Sub
Analyze Data
& Train Model
Classify
Events as
DDOS or
Legit
Analyze
Long
Term
https://github.com/Markus-Go/bonesi
IP Geolocation
Ingest
and
Annotat
e
1
Stage
Netflow
Events
2
Store &
process3
Store
Model4
Process
Realtim
e
Events
5
Alert
and
Analyze
6
8
Need to Process in Different
Ways
• Stream Ingestion – low latency
persistence to HDFS, Hbase, Solr,
etc.
• Near Real-Time Processing with
External Context – alerting , flagging
, transforms, filtering.
• Complex Near Real-Time
Processing - complex aggregations,
windowed computations, machine
learning, etc.
Need to Persist in Different
Ways
• Kafka – pub-sub messaging, fast,
scalable, durable
• Solr – natural language search, low-
latency, scalable
• Hbase – online, real-time gets, puts,
micro-scans
• HDFS – analytical SQL, scans.
9
Architecture
Patterns for Ingest
and Annotation
10
Ingest…
IP Traffic Annotate
Netflo
w
https://github.com/Markus-Go/bonesi
IP Geolocation
Ingest
and
Annotat
e
1
11
Logs
HDFSFlumeLogs
Logs
Sources
Sinks
Flume - Capture and Ingest Streaming Data
kafka
jms
log4j
directory
thrift
solr
elasticsearch
hbase
kafka
12
Flume – Interceptors
“netflow”
topic
Flume Source
NetFlow
Logs
Flume Interceptor
Memor
y
Hbase
Client
GeoDB
13
Ingest with StreamSets
Intelligent
Monitoring
Adaptable
Flows
Continuous
Platform
Streaming
Sanitization
GeoDB
14
Ingest…
IP Traffic
Netflo
w
IP Geolocation
Ingest
and
Annotat
e
1
15
Real-time Pub-Sub
Apache Kafka
16
5
Pub-Sub
Click to enter confidentiality information
Netflo
w
Pub-
Sub
Analyze Data
& Train Model
Classify
Events as
DDOS or
Legit
IP Geolocation
17
Why Kafka?
200
9
18
Why Kafka? Increasing complexity
200
9
201
4
19
Why Kafka? Decoupling
201
4
2015+
?
20
What is Kafka?
• Kafka is a distributed, topic-
oriented, partitioned, replicated
commit log.
• Kafka is also pub-sub
messaging system.
• Messages can be text (e.g.
syslog), but binary is best
(preferably Avro!).
21
Flume HDFS Sink
Possible Consumer Architectures
©2014 Cloudera, Inc. All rights reserved.
Kafka Cluster
Topic
Partition A
Partition B
Partition C
Sink
Sink
Sink
HDFS
Flume SolR Sink
Sink
Sink
Sink
SolR
Flume SolR Sink
Sink
Sink
Sink
HBase
Spark Streaming
DirectStream
Topology
22
Logs
HDFSFlumeLogs
Logs
Capture and Ingest Streaming Data – Now
with Kafka!
Kafka
Source
HDFS
Sink
Kafka
Channel
HDFSLogs
Kafka
23
Processing and
Consumption
24
Processing & Consumption
Kafka HDFS (Storage)
Train
Model
in
Spark
(Batch)
Analyze
Long
Term
Trends
All
Events
Realtime
Events
Apply Model on Dstreams
Spark Streaming
Read
Mode
l
Alerts
Classifie
d
IPs
Impala (SQL)
25
Unification of Batch & Streaming
Click to enter confidentiality information
// Create data set from Hadoop file
val dataset = sparkContext.hadoopFile(“file”)
// Join each batch in stream with the dataset
kafkaDStream.transform { batchRDD =>
batchRDD.join(dataset).filter(...)
}
Interoperability
// Learn model offline
val model = KMeans.train(dataset, ...)
// Apply model online on stream
val kafkaStream = KafkaUtils.createDStream(...)
kafkaStream.map { event => model.predict(featurize(event)) }
26
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
flatMap flatMap flatMap
save save save
batch @ t+1batch @ t batch @ t+2
tweets DStream
hashTags DStream
Stream composed of small (1-
10s) batch computations
“Micro-batch” Architecture
27
Streaming to
HDFS
28
Real-Time Analytics in Hadoop Today
RT Detection in the Real World = Storage Complexity
New Partition
Most Recent Partition
Historic Data
HBase
Parquet
File
• Wait for running operations to complete
• Define new Impala partition referencing
the newly written Parquet file
Incoming Data
(Messaging
System)
Impala, Spark, Hive
on HDFS /newdata/smallfil
e
/yesterday/largefi
le
Spark
Streamin
gLogs
Logs
Cron
Job
OR
29
Real-Time Analytics in Hadoop with Kudu
Simpler Architecture, Superior Performance over Hybrid Approaches
Impala, Spark on
Kudu
Incoming Data
(Messaging
System)
Reporting
Request
30
Demo
Using Netflow Data & Detecting a DDOS Attack
Questions?

Más contenido relacionado

Destacado

` Traffic Classification based on Machine Learning
` Traffic Classification based on Machine Learning ` Traffic Classification based on Machine Learning
` Traffic Classification based on Machine Learning
butest
 

Destacado (20)

Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Under DDoS: Instant Access to Live Information
Under DDoS: Instant Access to Live InformationUnder DDoS: Instant Access to Live Information
Under DDoS: Instant Access to Live Information
 
DDosMon A Global DDoS Monitoring Project
DDosMon A Global DDoS Monitoring ProjectDDosMon A Global DDoS Monitoring Project
DDosMon A Global DDoS Monitoring Project
 
Scientific Computing with Python Webinar --- May 22, 2009
Scientific Computing with Python Webinar --- May 22, 2009Scientific Computing with Python Webinar --- May 22, 2009
Scientific Computing with Python Webinar --- May 22, 2009
 
Arrays
ArraysArrays
Arrays
 
2nd section
2nd section2nd section
2nd section
 
A Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with PythonA Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with Python
 
Images and Vision in Python
Images and Vision in PythonImages and Vision in Python
Images and Vision in Python
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Hadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
Hadoop Summit 2016 - Evolution of Big Data Pipelines At IntuitHadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
Hadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
 
PCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningPCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System Tuning
 
Enter The Matrix
Enter The MatrixEnter The Matrix
Enter The Matrix
 
Wasc Honeypot Update App Sec2007
Wasc Honeypot Update App Sec2007Wasc Honeypot Update App Sec2007
Wasc Honeypot Update App Sec2007
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Hortonworks & Bilot Data Driven Transformations with Hadoop
Hortonworks & Bilot Data Driven Transformations with HadoopHortonworks & Bilot Data Driven Transformations with Hadoop
Hortonworks & Bilot Data Driven Transformations with Hadoop
 
Securing Spark Applications
Securing Spark ApplicationsSecuring Spark Applications
Securing Spark Applications
 
` Traffic Classification based on Machine Learning
` Traffic Classification based on Machine Learning ` Traffic Classification based on Machine Learning
` Traffic Classification based on Machine Learning
 
Lambda Architecture with Spark
Lambda Architecture with SparkLambda Architecture with Spark
Lambda Architecture with Spark
 
파이썬 Numpy 선형대수 이해하기
파이썬 Numpy 선형대수 이해하기파이썬 Numpy 선형대수 이해하기
파이썬 Numpy 선형대수 이해하기
 

Similar a Realtime Detection of DDOS attacks using Apache Spark and MLLib

Splunk app for stream
Splunk app for stream Splunk app for stream
Splunk app for stream
csching
 

Similar a Realtime Detection of DDOS attacks using Apache Spark and MLLib (20)

Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOC
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
 
Influx data basic
Influx data basicInflux data basic
Influx data basic
 
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
 
Fiware - communicating with ROS robots using Fast RTPS
Fiware - communicating with ROS robots using Fast RTPSFiware - communicating with ROS robots using Fast RTPS
Fiware - communicating with ROS robots using Fast RTPS
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Distributed Systems: How to connect your real-time applications
Distributed Systems: How to connect your real-time applicationsDistributed Systems: How to connect your real-time applications
Distributed Systems: How to connect your real-time applications
 
Open source network forensics and advanced pcap analysis
Open source network forensics and advanced pcap analysisOpen source network forensics and advanced pcap analysis
Open source network forensics and advanced pcap analysis
 
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
 
batbern43 Self Service on a Big Data Platform
batbern43 Self Service on a Big Data Platformbatbern43 Self Service on a Big Data Platform
batbern43 Self Service on a Big Data Platform
 
Detecting Hacks: Anomaly Detection on Networking Data
Detecting Hacks: Anomaly Detection on Networking DataDetecting Hacks: Anomaly Detection on Networking Data
Detecting Hacks: Anomaly Detection on Networking Data
 
Fiware: Connecting to robots
Fiware: Connecting to robotsFiware: Connecting to robots
Fiware: Connecting to robots
 
Splunk app for stream
Splunk app for stream Splunk app for stream
Splunk app for stream
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_Computing
 
Fast RTPS Workshop at FIWARE Summit 2018
Fast RTPS Workshop at FIWARE Summit 2018Fast RTPS Workshop at FIWARE Summit 2018
Fast RTPS Workshop at FIWARE Summit 2018
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Último (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 

Realtime Detection of DDOS attacks using Apache Spark and MLLib

  • 1. Realtime Components & Architectures Prepared for Big Data Madison Ryan Bosshart // Systems Engineer
  • 2. Agenda • (Near) Real-Time Problems – Actual Cloudera Use-Cases • Applicable Frameworks and Architectures • DDOS Example & Code Click to enter confidentiality information
  • 3. 3 Connected Medical Devices • Batch: – How does a patients disease progress over time? – How does physician training affect disease state? – How can we recommend better therapies? • Realtime: – What is the patient’s disease state right now? – Alert on potential device malfunctions. Click to enter confidentiality information
  • 4. 4 Connected Cars • Batch: – Manufacturer wants to know optimal charge performance. • Real-time: – Consumer wants to know if teen is driving car right now. How fast are they accelerating / driving? – Vehicle Service – e.g. grab an up- to-date “diagnosis bundle” before service.
  • 5. 5 Victim’s Infrastructu re Security • Batch analytics: – What countries are most common? • Realtime: – How do we detect and stop attackers right now!
  • 6. 6 Netflow Data Click to enter confidentiality information Bytes Contents Description 0-3 srcaddr Source IP address 4-7 dstaddr Destination IP address 8-11 nexthop IP address of next hop router 12-13 input SNMP index of input interface 14-15 output SNMP index of output interface 16-19 dPkts Packets in the flow 20-23 dOctets Total number of Layer 3 bytes in the packets of the flow 24-27 first SysUptime at start of flow 28-31 last SysUptime at the time the last packet of the flow was received 32-33 srcport TCP/UDP source port number or equivalent 34-35 dstport TCP/UDP destination port number or equivalent 36 pad1 Unused (zero) bytes 37 tcp_flags Cumulative OR of TCP flags 38 prot IP protocol type (for example, TCP = 6; UDP = 17) 39 tos IP type of service (ToS) 40-41 src_as Autonomous system number of the source, either origin or peer 42-43 dst_as Autonomous system number of the destination, either origin or peer 44 src_mask Source address prefix mask bits 45 dst_mask Destination address prefix mask bits 46-47 pad2 Unused (zero) bytes
  • 7. 7 5 Ingesting and Processing Netflow Data Click to enter confidentiality information IP Traffic Annotate Netflo w Pub- Sub Analyze Data & Train Model Classify Events as DDOS or Legit Analyze Long Term https://github.com/Markus-Go/bonesi IP Geolocation Ingest and Annotat e 1 Stage Netflow Events 2 Store & process3 Store Model4 Process Realtim e Events 5 Alert and Analyze 6
  • 8. 8 Need to Process in Different Ways • Stream Ingestion – low latency persistence to HDFS, Hbase, Solr, etc. • Near Real-Time Processing with External Context – alerting , flagging , transforms, filtering. • Complex Near Real-Time Processing - complex aggregations, windowed computations, machine learning, etc. Need to Persist in Different Ways • Kafka – pub-sub messaging, fast, scalable, durable • Solr – natural language search, low- latency, scalable • Hbase – online, real-time gets, puts, micro-scans • HDFS – analytical SQL, scans.
  • 11. 11 Logs HDFSFlumeLogs Logs Sources Sinks Flume - Capture and Ingest Streaming Data kafka jms log4j directory thrift solr elasticsearch hbase kafka
  • 12. 12 Flume – Interceptors “netflow” topic Flume Source NetFlow Logs Flume Interceptor Memor y Hbase Client GeoDB
  • 16. 16 5 Pub-Sub Click to enter confidentiality information Netflo w Pub- Sub Analyze Data & Train Model Classify Events as DDOS or Legit IP Geolocation
  • 18. 18 Why Kafka? Increasing complexity 200 9 201 4
  • 20. 20 What is Kafka? • Kafka is a distributed, topic- oriented, partitioned, replicated commit log. • Kafka is also pub-sub messaging system. • Messages can be text (e.g. syslog), but binary is best (preferably Avro!).
  • 21. 21 Flume HDFS Sink Possible Consumer Architectures ©2014 Cloudera, Inc. All rights reserved. Kafka Cluster Topic Partition A Partition B Partition C Sink Sink Sink HDFS Flume SolR Sink Sink Sink Sink SolR Flume SolR Sink Sink Sink Sink HBase Spark Streaming DirectStream Topology
  • 22. 22 Logs HDFSFlumeLogs Logs Capture and Ingest Streaming Data – Now with Kafka! Kafka Source HDFS Sink Kafka Channel HDFSLogs Kafka
  • 24. 24 Processing & Consumption Kafka HDFS (Storage) Train Model in Spark (Batch) Analyze Long Term Trends All Events Realtime Events Apply Model on Dstreams Spark Streaming Read Mode l Alerts Classifie d IPs Impala (SQL)
  • 25. 25 Unification of Batch & Streaming Click to enter confidentiality information // Create data set from Hadoop file val dataset = sparkContext.hadoopFile(“file”) // Join each batch in stream with the dataset kafkaDStream.transform { batchRDD => batchRDD.join(dataset).filter(...) } Interoperability // Learn model offline val model = KMeans.train(dataset, ...) // Apply model online on stream val kafkaStream = KafkaUtils.createDStream(...) kafkaStream.map { event => model.predict(featurize(event)) }
  • 26. 26 val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") flatMap flatMap flatMap save save save batch @ t+1batch @ t batch @ t+2 tweets DStream hashTags DStream Stream composed of small (1- 10s) batch computations “Micro-batch” Architecture
  • 28. 28 Real-Time Analytics in Hadoop Today RT Detection in the Real World = Storage Complexity New Partition Most Recent Partition Historic Data HBase Parquet File • Wait for running operations to complete • Define new Impala partition referencing the newly written Parquet file Incoming Data (Messaging System) Impala, Spark, Hive on HDFS /newdata/smallfil e /yesterday/largefi le Spark Streamin gLogs Logs Cron Job OR
  • 29. 29 Real-Time Analytics in Hadoop with Kudu Simpler Architecture, Superior Performance over Hybrid Approaches Impala, Spark on Kudu Incoming Data (Messaging System) Reporting Request
  • 30. 30 Demo Using Netflow Data & Detecting a DDOS Attack