In this talk we will show how Hadoop Ecosystem tools like Apache Kafka, Spark, and MLLib can be used in various real-time architectures and how they can be used to perform real-time detection of a DDOS attack. We will explain some of the challenges in building real-time architectures, followed by walking through the DDOS detection example and a live demo. This talk is appropriate for anyone interested in Security, IoT, Apache Kafka, Spark, or Hadoop.
Presenter Ryan Bosshart is a Systems Engineer at Cloudera and is the first 3 time presenter at BigDataMadison!
2. Agenda
• (Near) Real-Time Problems – Actual Cloudera Use-Cases
• Applicable Frameworks and Architectures
• DDOS Example & Code
Click to enter confidentiality information
3. 3
Connected Medical Devices
• Batch:
– How does a patients disease progress
over time?
– How does physician training affect
disease state?
– How can we recommend better
therapies?
• Realtime:
– What is the patient’s disease state right
now?
– Alert on potential device malfunctions.
Click to enter confidentiality information
4. 4
Connected Cars
• Batch:
– Manufacturer wants to know
optimal charge performance.
• Real-time:
– Consumer wants to know if teen is
driving car right now. How fast are
they accelerating / driving?
– Vehicle Service – e.g. grab an up-
to-date “diagnosis bundle” before
service.
6. 6
Netflow Data
Click to enter confidentiality information
Bytes Contents Description
0-3 srcaddr Source IP address
4-7 dstaddr Destination IP address
8-11 nexthop IP address of next hop router
12-13 input SNMP index of input interface
14-15 output SNMP index of output interface
16-19 dPkts Packets in the flow
20-23 dOctets Total number of Layer 3 bytes in the packets of the
flow
24-27 first SysUptime at start of flow
28-31 last SysUptime at the time the last packet of the flow
was received
32-33 srcport TCP/UDP source port number or equivalent
34-35 dstport TCP/UDP destination port number or equivalent
36 pad1 Unused (zero) bytes
37 tcp_flags Cumulative OR of TCP flags
38 prot IP protocol type (for example, TCP = 6; UDP = 17)
39 tos IP type of service (ToS)
40-41 src_as Autonomous system number of the source, either
origin or peer
42-43 dst_as Autonomous system number of the destination,
either origin or peer
44 src_mask Source address prefix mask bits
45 dst_mask Destination address prefix mask bits
46-47 pad2 Unused (zero) bytes
7. 7
5
Ingesting and Processing Netflow Data
Click to enter confidentiality information
IP Traffic Annotate
Netflo
w
Pub-
Sub
Analyze Data
& Train Model
Classify
Events as
DDOS or
Legit
Analyze
Long
Term
https://github.com/Markus-Go/bonesi
IP Geolocation
Ingest
and
Annotat
e
1
Stage
Netflow
Events
2
Store &
process3
Store
Model4
Process
Realtim
e
Events
5
Alert
and
Analyze
6
8. 8
Need to Process in Different
Ways
• Stream Ingestion – low latency
persistence to HDFS, Hbase, Solr,
etc.
• Near Real-Time Processing with
External Context – alerting , flagging
, transforms, filtering.
• Complex Near Real-Time
Processing - complex aggregations,
windowed computations, machine
learning, etc.
Need to Persist in Different
Ways
• Kafka – pub-sub messaging, fast,
scalable, durable
• Solr – natural language search, low-
latency, scalable
• Hbase – online, real-time gets, puts,
micro-scans
• HDFS – analytical SQL, scans.
16. 16
5
Pub-Sub
Click to enter confidentiality information
Netflo
w
Pub-
Sub
Analyze Data
& Train Model
Classify
Events as
DDOS or
Legit
IP Geolocation
20. 20
What is Kafka?
• Kafka is a distributed, topic-
oriented, partitioned, replicated
commit log.
• Kafka is also pub-sub
messaging system.
• Messages can be text (e.g.
syslog), but binary is best
(preferably Avro!).
24. 24
Processing & Consumption
Kafka HDFS (Storage)
Train
Model
in
Spark
(Batch)
Analyze
Long
Term
Trends
All
Events
Realtime
Events
Apply Model on Dstreams
Spark Streaming
Read
Mode
l
Alerts
Classifie
d
IPs
Impala (SQL)
25. 25
Unification of Batch & Streaming
Click to enter confidentiality information
// Create data set from Hadoop file
val dataset = sparkContext.hadoopFile(“file”)
// Join each batch in stream with the dataset
kafkaDStream.transform { batchRDD =>
batchRDD.join(dataset).filter(...)
}
Interoperability
// Learn model offline
val model = KMeans.train(dataset, ...)
// Apply model online on stream
val kafkaStream = KafkaUtils.createDStream(...)
kafkaStream.map { event => model.predict(featurize(event)) }
26. 26
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
flatMap flatMap flatMap
save save save
batch @ t+1batch @ t batch @ t+2
tweets DStream
hashTags DStream
Stream composed of small (1-
10s) batch computations
“Micro-batch” Architecture
28. 28
Real-Time Analytics in Hadoop Today
RT Detection in the Real World = Storage Complexity
New Partition
Most Recent Partition
Historic Data
HBase
Parquet
File
• Wait for running operations to complete
• Define new Impala partition referencing
the newly written Parquet file
Incoming Data
(Messaging
System)
Impala, Spark, Hive
on HDFS /newdata/smallfil
e
/yesterday/largefi
le
Spark
Streamin
gLogs
Logs
Cron
Job
OR
29. 29
Real-Time Analytics in Hadoop with Kudu
Simpler Architecture, Superior Performance over Hybrid Approaches
Impala, Spark on
Kudu
Incoming Data
(Messaging
System)
Reporting
Request