SlideShare una empresa de Scribd logo
1 de 55
Event Detection
Pipelines with Apache
Kafka
Hadoop Summit, Brussels 2015
Jeff Holoman
2© Cloudera, Inc. All rights reserved.
The “Is this talk interesting enough to sit through?” slide
• How we got here
• Why Kafka
• Use Case
• Challenges
• Kafka in Context
What I’m going to say: Buzzword Bingo!
If I don’t say all of these I owe you a beverage
Kafka
Machine
Learning
Real-time
Delivery Semantics
Spark StreamingHadoop
Storm
Durability
Guarantees
Ingest Pipelines
Event Detection
AvroJSON
3© Cloudera, Inc. All rights reserved.
How we got here
3
Application
RDBMS
We Wanted to Do some stuff in
Hadoop
Hadoop
RDBMS
RDBMS
RDBMS
Application Application Application
Batch
File
transfer
Application
Reporting
4© Cloudera, Inc. All rights reserved.
How we got here
4
Application
RDBMS
We Wanted to Do some stuff in
Hadoop
Hadoop
RDBMS
RDBMS
RDBMS
Application Application Application
Batch
File
transfer
Application
Reporting
5© Cloudera, Inc. All rights reserved.
About Kafka
• Publish/Subscribe Messaging System From LinkedIn
• High throughput (100’s of k messages/sec)
• Low latency (sub-second to low seconds)
• Fault-tolerant (Replicated and Distributed)
• Supports Agnostic Messaging
• Standardizes format and delivery
6© Cloudera, Inc. All rights reserved.
Kafka decouples data pipelines
Why Kafka
6
Source System Source System Source System Source System
Hadoop Security Systems
Real-time
monitoring
Data Warehouse
Kafka
Producers
Broker
Consumers
7© Cloudera, Inc. All rights reserved.
Use Case
Fraud Detection in Consumer Banking
8© Cloudera, Inc. All rights reserved.
Event Detection - Fraud
• Offline
• Model Building
• Discovery
• Forensics
• Case Management
• Pattern Analysis
• Online
9© Cloudera, Inc. All rights reserved.
Online Mobile ATM POS
Integration
10© Cloudera, Inc. All rights reserved.
Online Mobile ATM POS
Integration
Event Processing
11© Cloudera, Inc. All rights reserved.
Online Mobile ATM POS
Integration
Event Processing
Repository
Reporting Forensics Analytics
12© Cloudera, Inc. All rights reserved.
Online Mobile ATM POS
Integration
Event Processing
Storage
HDFS
SolR
Processing
Impala
Map/Reduce
Spark
3rd Party R, SAS etc
Mainfram
e/RDBMS
13© Cloudera, Inc. All rights reserved.
Online Mobile ATM POS
Integration
Event Processing
Storage
HDFS
SolR
Processing
Impala
Map/Reduce
Spark
3rd Party
Rules /
Models
Automated &
Manual
Analytical
Adjustments
and Pattern
detection
R, SAS etc
Mainframe/R
DBMS
Case Management
14© Cloudera, Inc. All rights reserved.
Event Detection - Fraud
• Offline
• Model Building
• Discovery
• Forensics
• Case Management
• Pattern Analysis
• Online
• Ingest
• Enrichment (Profiles, feature selection, etc.)
• Early warning / detection (model serving / model application)
• Persistence
15© Cloudera, Inc. All rights reserved.
Online Mobile ATM POS
Integration
Event Processing
Repository
Case Management
Reporting Forensics Analytics
Alerting
Reference
Data
Rules /
Models
16© Cloudera, Inc. All rights reserved.
Event Detection
A Concrete Example
17© Cloudera, Inc. All rights reserved.
18© Cloudera, Inc. All rights reserved.
This is not a Data
Science Talk.
But lets talk about it anyway
19© Cloudera, Inc. All rights reserved.
Event Detection
• Attempt to detect if an event of interest has occurred
• Temporal or Spatial (or both)
• High number of non-events creates challenges
• Fraud Detection - semi-supervised ML
• You want to optimize for accuracy but also balance the risk of
false positives
• Very important to monitor the model
20© Cloudera, Inc. All rights reserved.
Generally
• Learn model for an expected signal value
• Calculate a score based on the current event
• Alert (or don’t) on that value
• Simple right?
21© Cloudera, Inc. All rights reserved.
Some Numbers
• No data loss is acceptable
• Event processing must complete ASAP, <500ms
• Support approximately 400M transactions per day in aggregate
• Highest Volume Flow:
• Current – 2k transactions/s
• Projected – 10k transactions/s
• Each flow has at least three steps
• Adapter, Persistence, Hadoop Persistence
• Most complex with approximately seven steps
22© Cloudera, Inc. All rights reserved.
Technology Stack
23© Cloudera, Inc. All rights reserved.
Online Mobile ATM POS
Spring Integration
Storage
HDFS
SolR
Processing
Impala
Map/Reduce
Spark
3rd Party
R, SAS etc
Mainframe/
DB2, Oracle
JVM JVM JVM HBase
RPC
Java API
Flume via
Avro RPC
client (Netty)
Files
SQOOP
Web
Applications
Web
ApplicationsWeb
Applications
JDBC
REST
Java /
PMML
24© Cloudera, Inc. All rights reserved.
JVM 1
JVM 2
JVM N
Host 1
JVM 1
Host 2
JVM 1
JVM 2
JVM N
Host 3
JVM 1
Host 4
Agent 1-N
Prod Edge Node
File Channel
Storage Processing
HDFS
Impala
Map/Red
uce
Spark
Production Hadoop
Agent 1-N
Prod Edge Node
File Channel
Storage Processing
HDFS
Impala
Map/Red
uce
Spark
DR Hadoop
DR Edge Node
File Channel
Agent 1-NAgent 1-N
DR Edge Node
File Channel
Agent 1-NAgent 1-N
Prod Edge Node
File Channel
Agent 1-NAgent 1-N
Prod Edge Node
File Channel
Agent 1-NAgent 1-N
Flume
25© Cloudera, Inc. All rights reserved.
Challenges
• Fraud prevention is very difficult due to response time requirements.
26© Cloudera, Inc. All rights reserved.
Fraud Processing System
~50 ms >500 ms >30,000 ms >90,000 ms
Prevention Detection
Difficulty
High Low(er)
27© Cloudera, Inc. All rights reserved.
Challenges
• Fraud prevention is very difficult due to response time requirements.
• Disruptions in downstream systems can impact actual processing.
• Problems with HDFS, network problems, SAN, agents etc
• Integrating data across multiple systems increases complexity
• Other systems want / need the data.
• System has all of the transactions! Can be used for Customer Events, Analytics
etc.
• Tracking data and metrics is difficult with different protocols
• We need to true up the transaction data with what ends up in HDFS
28© Cloudera, Inc. All rights reserved.
Incoming
Events
Storage
HDFS
SolR
Processing
Impala
MR
Spark
3rd Party
Event Processing
JVM JVM JVM
HBase
Kafk
a
Kafk
a
Kafk
a
Model
Serving
Outgoing
Events Model
Building
Repository
JVM JVM JVM
Txn Inu
v
w
Txn Updates
z
All Eventsy
Txn Out
x
Alerts {
Case / Alert
Management
|
OtherOtherOther
}
29© Cloudera, Inc. All rights reserved.
JVM 1
JVM 2
JVM N
Host 1
JVM 1
Host 2
JVM 1
JVM 2
JVM N
Host 3
JVM 1
Host 4
Agent 1-N
Prod Edge Node
File Channel
Storage Processing
HDFS
Impala
Map/Red
uce
Spark
Production Hadoop
Agent 1-N
Prod Edge Node
File Channel
Storage Processing
HDFS
Impala
Map/Red
uce
Spark
DR Hadoop
Kafka
Kafka Cluster
Broker 1
Broker 2
Broker 3
Broker N
30© Cloudera, Inc. All rights reserved.
Kafka - Considerations
• Data Exchange
31© Cloudera, Inc. All rights reserved.
Data Exchange in Distributed Architectures
• Multiple systems interacting together benefit from a common data exchange
format.
• Choosing the correct standard can significantly impact application design and TCO
Client Client
serialize
serialize
deserialize
deserialize
Common Data Format
32© Cloudera, Inc. All rights reserved.
Goals
• Simple
• Flexible
• Efficient
• Change Tolerant
• Interoperable
As systems become more complex, data endpoints need to be decoupled
33© Cloudera, Inc. All rights reserved.
He means
traffic lights
34© Cloudera, Inc. All rights reserved.
Use Avro
• A data serialization system
• Data always* accompanied by a schema
• Provides
• A compact, fast, binary data format
• A container file to store persistent data
• Remote Procedure Call (RPC)
• Simple integration with dynamic languages
• Schema Evolution
• Similar to Thrift of Protocol Puffers but differs by
• Dynamic typing
• Untagged data
• No manually-assigned field IDs:
• When a schema changes, both the old and new schema are always present when processing
data, so differences may be resolved symbolically, using field names.
35© Cloudera, Inc. All rights reserved.
Schema Registry
• Use a Schema Registry / Repository
• There are open-source options out there
• Exposes a REST interface
• Backend storage can be just about anything
• Can be heavily customized for your environment
36© Cloudera, Inc. All rights reserved.
Deploying Kafka
• Data Exchange
• Provide Common Libraries
• There are a number of Kafka clients out there… standardize and develop a
producer / consumer library that is consistent so developers aren’t reinventing
the wheel
37© Cloudera, Inc. All rights reserved.
Deploying Kafka
• Data Exchange
• Provide Common Libraries
• Understand Durability Guarantees and Delivery Semantics
38© Cloudera, Inc. All rights reserved.
• Producers can choose to trade throughput for durability of writes:
• A sane configuration:
Durable Writes
Durability Behaviour Per Event Latency Required Acknowledgements
(request.required.acks)
Highest ACK all ISRs have received Highest -1
Medium ACK once the leader has received Medium 1
Lowest No ACKs required Lowest 0
Property Value
replication 3
min.insync.replicas 2
request.required.acks -1
39© Cloudera, Inc. All rights reserved.
Producer Performance – Single Thread
Type Records/sec MB/s Avg Latency
(ms)
Max
Latency
Median
Latency
95th %tile
No
Replication
1,100,182 104 42 1070 1 362
3x Async 1,056,546 101 42 1157 2 323
3x Sync 493,855 47 379 4483 192 1692
40© Cloudera, Inc. All rights reserved.
Delivery Semantics
• At least once
• Messages are never lost but may be redelivered
• At most once
• Messages are lost but never redelivered
• Exactly once
• Messages are delivered once and only once
Much Harder
(Impossible??)
41© Cloudera, Inc. All rights reserved.
Getting Exactly Once Semantics
• Must consider two components
• Durability guarantees when publishing a message
• Durability guarantees when consuming a message
• Producer
• What happens when a produce request was sent but a network error returned
before an ack?
• Use a single writer per partition and check the latest committed value after
network errors
• Consumer
• Include a unique ID (e.g. UUID) and de-duplicate.
• Consider storing offsets with data
42© Cloudera, Inc. All rights reserved.
Deploying Kafka
• Data Exchange
• Provide Common Libraries
• Understand Durability Guarantees and Delivery Semantics
• Build in auditing from the start
• We can use Kafka in-stream to save some reporting and analytics later
• This will increase your development time but pay off in the long run
43© Cloudera, Inc. All rights reserved.
Auditing and Tracking
• Embed timings in the message itself, eg:
{
"name": "timings",
"type": [
"null",
{
"type": "map",
"values": "long"
}
],
"default": null
}
• Adopt LinkedIn-style Auditing
44© Cloudera, Inc. All rights reserved.
Deploying Kafka
• Data Exchange
• Provide Common Libraries
• Understand Durability Guarantees and Delivery Semantics
• Build in auditing from the start
• Use Flume for easy ingest into HDFS / Solr
45© Cloudera, Inc. All rights reserved.
Flume (Flafka)
• Source
• Sink
• Channel
46© Cloudera, Inc. All rights reserved.
Flafka
Sources Interceptors Selectors Channels Sinks
Flume Agent
Kafka
HDFS
Kafka Producer
Producer A
Kafka
KafkaData Sources
Logs, JMS,
WebServer
etc.
47© Cloudera, Inc. All rights reserved.
Deploying Kafka
• Data Exchange
• Provide Common Libraries
• Understand Durability Guarantees and Delivery Semantics
• Build in auditing from the start
• Use Flume for Easy Ingest to HDFS / Solr
• Benchmark based on your message size
48© Cloudera, Inc. All rights reserved.
Benchmark Results
49© Cloudera, Inc. All rights reserved.
Benchmark Results
50© Cloudera, Inc. All rights reserved.
Deploying Kafka
• Data Exchange
• Provide Common Libraries
• Understand Durability Guarantees and Delivery Semantics
• Build in auditing from the start
• Benchmark based on your message size
• Take the time to setup Kafka metrics
51© Cloudera, Inc. All rights reserved.
Things like
• Consumer Lag
• Message in Rate
• Bytes in Rate
• Bytes out Rate
• (you can publish your own as well)
52© Cloudera, Inc. All rights reserved.
Deploying Kafka
• Data Exchange
• Provide Common Libraries
• Understand Durability Guarantees and Delivery Semantics
• Build in auditing from the start
• Benchmark based on your message size
• Take the time to setup Kafka metrics
• Security
53© Cloudera, Inc. All rights reserved.
Security
• Out-of-the-box security is pretty weak
• Currently must rely on network security
• Upcoming improvements add:
• Authentication
• Authorization
• SSL
54© Cloudera, Inc. All rights reserved.
Recap
• Fraud prevention is very difficult due to response time requirements.
• Disruptions in downstream systems can impact actual processing.
• Integrating data across multiple systems increases complexity
• Other systems want / need the data
• Tracking data and metrics is difficult with different protocols
Thank you.

Más contenido relacionado

La actualidad más candente

Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveDataWorks Summit/Hadoop Summit
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupGwen (Chen) Shapira
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamDataWorks Summit/Hadoop Summit
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupGwen (Chen) Shapira
 
Kafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeKafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeGuido Schmutz
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Gwen (Chen) Shapira
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Michael Noll
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
Real time analytics with Kafka and SparkStreaming
Real time analytics with Kafka and SparkStreamingReal time analytics with Kafka and SparkStreaming
Real time analytics with Kafka and SparkStreamingAshish Singh
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaGrant Henke
 

La actualidad más candente (20)

Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Spark+flume seattle
Spark+flume seattleSpark+flume seattle
Spark+flume seattle
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn Meetup
 
Kafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeKafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtime
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
In Flux Limiting for a multi-tenant logging service
In Flux Limiting for a multi-tenant logging serviceIn Flux Limiting for a multi-tenant logging service
In Flux Limiting for a multi-tenant logging service
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Real time analytics with Kafka and SparkStreaming
Real time analytics with Kafka and SparkStreamingReal time analytics with Kafka and SparkStreaming
Real time analytics with Kafka and SparkStreaming
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache Kafka
 

Destacado

Real-Time Fraud Detection with Storm and Kafka
Real-Time Fraud Detection with Storm and KafkaReal-Time Fraud Detection with Storm and Kafka
Real-Time Fraud Detection with Storm and KafkaAlexey Kharlamov
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaJoe Stein
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexasArvind Prabhakar
 
Wso2 in action
Wso2 in actionWso2 in action
Wso2 in actionBui Kiet
 
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Alexey Kharlamov
 
Hadoop BIG Data - Fraud Detection with Real-Time Analytics
Hadoop BIG Data - Fraud Detection with Real-Time AnalyticsHadoop BIG Data - Fraud Detection with Real-Time Analytics
Hadoop BIG Data - Fraud Detection with Real-Time Analyticshkbhadraa
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Event Driven Architecture
Event Driven ArchitectureEvent Driven Architecture
Event Driven ArchitectureStefan Norberg
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignMichael Noll
 

Destacado (16)

Real-Time Fraud Detection with Storm and Kafka
Real-Time Fraud Detection with Storm and KafkaReal-Time Fraud Detection with Storm and Kafka
Real-Time Fraud Detection with Storm and Kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
Big Data Application Architectures - Fraud Detection
Big Data Application Architectures - Fraud DetectionBig Data Application Architectures - Fraud Detection
Big Data Application Architectures - Fraud Detection
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexas
 
Wso2 in action
Wso2 in actionWso2 in action
Wso2 in action
 
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
 
IOT, Streaming Analytics and Machine Learning
IOT, Streaming Analytics and Machine Learning IOT, Streaming Analytics and Machine Learning
IOT, Streaming Analytics and Machine Learning
 
Building a Smarter Home with Apache NiFi and Spark
Building a Smarter Home with Apache NiFi and SparkBuilding a Smarter Home with Apache NiFi and Spark
Building a Smarter Home with Apache NiFi and Spark
 
ORC Files
ORC FilesORC Files
ORC Files
 
Hadoop BIG Data - Fraud Detection with Real-Time Analytics
Hadoop BIG Data - Fraud Detection with Real-Time AnalyticsHadoop BIG Data - Fraud Detection with Real-Time Analytics
Hadoop BIG Data - Fraud Detection with Real-Time Analytics
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Event Driven Architecture
Event Driven ArchitectureEvent Driven Architecture
Event Driven Architecture
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 

Similar a Event Detection Pipelines with Apache Kafka

Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Data Con LA
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming ArchitecturesCloudera, Inc.
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table NotesTimothy Spann
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy
 
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022HostedbyConfluent
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Timothy Spann
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
Fiware: Connecting to robots
Fiware: Connecting to robotsFiware: Connecting to robots
Fiware: Connecting to robotsJaime Martin Losa
 
Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning Eugene
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
 
Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline DevelopmentTimothy Spann
 
Updates to Apache CloudStack and LINBIT SDS
Updates to Apache CloudStack and LINBIT SDSUpdates to Apache CloudStack and LINBIT SDS
Updates to Apache CloudStack and LINBIT SDSShapeBlue
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsAsis Mohanty
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsTimothy Spann
 
Tokyo azure meetup #12 service fabric internals
Tokyo azure meetup #12   service fabric internalsTokyo azure meetup #12   service fabric internals
Tokyo azure meetup #12 service fabric internalsTokyo Azure Meetup
 

Similar a Event Detection Pipelines with Apache Kafka (20)

Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming Architectures
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table Notes
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Fiware: Connecting to robots
Fiware: Connecting to robotsFiware: Connecting to robots
Fiware: Connecting to robots
 
Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline Development
 
Updates to Apache CloudStack and LINBIT SDS
Updates to Apache CloudStack and LINBIT SDSUpdates to Apache CloudStack and LINBIT SDS
Updates to Apache CloudStack and LINBIT SDS
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
 
IoT Austin CUG talk
IoT Austin CUG talkIoT Austin CUG talk
IoT Austin CUG talk
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
 
Tokyo azure meetup #12 service fabric internals
Tokyo azure meetup #12   service fabric internalsTokyo azure meetup #12   service fabric internals
Tokyo azure meetup #12 service fabric internals
 

Más de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Event Detection Pipelines with Apache Kafka

  • 1. Event Detection Pipelines with Apache Kafka Hadoop Summit, Brussels 2015 Jeff Holoman
  • 2. 2© Cloudera, Inc. All rights reserved. The “Is this talk interesting enough to sit through?” slide • How we got here • Why Kafka • Use Case • Challenges • Kafka in Context What I’m going to say: Buzzword Bingo! If I don’t say all of these I owe you a beverage Kafka Machine Learning Real-time Delivery Semantics Spark StreamingHadoop Storm Durability Guarantees Ingest Pipelines Event Detection AvroJSON
  • 3. 3© Cloudera, Inc. All rights reserved. How we got here 3 Application RDBMS We Wanted to Do some stuff in Hadoop Hadoop RDBMS RDBMS RDBMS Application Application Application Batch File transfer Application Reporting
  • 4. 4© Cloudera, Inc. All rights reserved. How we got here 4 Application RDBMS We Wanted to Do some stuff in Hadoop Hadoop RDBMS RDBMS RDBMS Application Application Application Batch File transfer Application Reporting
  • 5. 5© Cloudera, Inc. All rights reserved. About Kafka • Publish/Subscribe Messaging System From LinkedIn • High throughput (100’s of k messages/sec) • Low latency (sub-second to low seconds) • Fault-tolerant (Replicated and Distributed) • Supports Agnostic Messaging • Standardizes format and delivery
  • 6. 6© Cloudera, Inc. All rights reserved. Kafka decouples data pipelines Why Kafka 6 Source System Source System Source System Source System Hadoop Security Systems Real-time monitoring Data Warehouse Kafka Producers Broker Consumers
  • 7. 7© Cloudera, Inc. All rights reserved. Use Case Fraud Detection in Consumer Banking
  • 8. 8© Cloudera, Inc. All rights reserved. Event Detection - Fraud • Offline • Model Building • Discovery • Forensics • Case Management • Pattern Analysis • Online
  • 9. 9© Cloudera, Inc. All rights reserved. Online Mobile ATM POS Integration
  • 10. 10© Cloudera, Inc. All rights reserved. Online Mobile ATM POS Integration Event Processing
  • 11. 11© Cloudera, Inc. All rights reserved. Online Mobile ATM POS Integration Event Processing Repository Reporting Forensics Analytics
  • 12. 12© Cloudera, Inc. All rights reserved. Online Mobile ATM POS Integration Event Processing Storage HDFS SolR Processing Impala Map/Reduce Spark 3rd Party R, SAS etc Mainfram e/RDBMS
  • 13. 13© Cloudera, Inc. All rights reserved. Online Mobile ATM POS Integration Event Processing Storage HDFS SolR Processing Impala Map/Reduce Spark 3rd Party Rules / Models Automated & Manual Analytical Adjustments and Pattern detection R, SAS etc Mainframe/R DBMS Case Management
  • 14. 14© Cloudera, Inc. All rights reserved. Event Detection - Fraud • Offline • Model Building • Discovery • Forensics • Case Management • Pattern Analysis • Online • Ingest • Enrichment (Profiles, feature selection, etc.) • Early warning / detection (model serving / model application) • Persistence
  • 15. 15© Cloudera, Inc. All rights reserved. Online Mobile ATM POS Integration Event Processing Repository Case Management Reporting Forensics Analytics Alerting Reference Data Rules / Models
  • 16. 16© Cloudera, Inc. All rights reserved. Event Detection A Concrete Example
  • 17. 17© Cloudera, Inc. All rights reserved.
  • 18. 18© Cloudera, Inc. All rights reserved. This is not a Data Science Talk. But lets talk about it anyway
  • 19. 19© Cloudera, Inc. All rights reserved. Event Detection • Attempt to detect if an event of interest has occurred • Temporal or Spatial (or both) • High number of non-events creates challenges • Fraud Detection - semi-supervised ML • You want to optimize for accuracy but also balance the risk of false positives • Very important to monitor the model
  • 20. 20© Cloudera, Inc. All rights reserved. Generally • Learn model for an expected signal value • Calculate a score based on the current event • Alert (or don’t) on that value • Simple right?
  • 21. 21© Cloudera, Inc. All rights reserved. Some Numbers • No data loss is acceptable • Event processing must complete ASAP, <500ms • Support approximately 400M transactions per day in aggregate • Highest Volume Flow: • Current – 2k transactions/s • Projected – 10k transactions/s • Each flow has at least three steps • Adapter, Persistence, Hadoop Persistence • Most complex with approximately seven steps
  • 22. 22© Cloudera, Inc. All rights reserved. Technology Stack
  • 23. 23© Cloudera, Inc. All rights reserved. Online Mobile ATM POS Spring Integration Storage HDFS SolR Processing Impala Map/Reduce Spark 3rd Party R, SAS etc Mainframe/ DB2, Oracle JVM JVM JVM HBase RPC Java API Flume via Avro RPC client (Netty) Files SQOOP Web Applications Web ApplicationsWeb Applications JDBC REST Java / PMML
  • 24. 24© Cloudera, Inc. All rights reserved. JVM 1 JVM 2 JVM N Host 1 JVM 1 Host 2 JVM 1 JVM 2 JVM N Host 3 JVM 1 Host 4 Agent 1-N Prod Edge Node File Channel Storage Processing HDFS Impala Map/Red uce Spark Production Hadoop Agent 1-N Prod Edge Node File Channel Storage Processing HDFS Impala Map/Red uce Spark DR Hadoop DR Edge Node File Channel Agent 1-NAgent 1-N DR Edge Node File Channel Agent 1-NAgent 1-N Prod Edge Node File Channel Agent 1-NAgent 1-N Prod Edge Node File Channel Agent 1-NAgent 1-N Flume
  • 25. 25© Cloudera, Inc. All rights reserved. Challenges • Fraud prevention is very difficult due to response time requirements.
  • 26. 26© Cloudera, Inc. All rights reserved. Fraud Processing System ~50 ms >500 ms >30,000 ms >90,000 ms Prevention Detection Difficulty High Low(er)
  • 27. 27© Cloudera, Inc. All rights reserved. Challenges • Fraud prevention is very difficult due to response time requirements. • Disruptions in downstream systems can impact actual processing. • Problems with HDFS, network problems, SAN, agents etc • Integrating data across multiple systems increases complexity • Other systems want / need the data. • System has all of the transactions! Can be used for Customer Events, Analytics etc. • Tracking data and metrics is difficult with different protocols • We need to true up the transaction data with what ends up in HDFS
  • 28. 28© Cloudera, Inc. All rights reserved. Incoming Events Storage HDFS SolR Processing Impala MR Spark 3rd Party Event Processing JVM JVM JVM HBase Kafk a Kafk a Kafk a Model Serving Outgoing Events Model Building Repository JVM JVM JVM Txn Inu v w Txn Updates z All Eventsy Txn Out x Alerts { Case / Alert Management | OtherOtherOther }
  • 29. 29© Cloudera, Inc. All rights reserved. JVM 1 JVM 2 JVM N Host 1 JVM 1 Host 2 JVM 1 JVM 2 JVM N Host 3 JVM 1 Host 4 Agent 1-N Prod Edge Node File Channel Storage Processing HDFS Impala Map/Red uce Spark Production Hadoop Agent 1-N Prod Edge Node File Channel Storage Processing HDFS Impala Map/Red uce Spark DR Hadoop Kafka Kafka Cluster Broker 1 Broker 2 Broker 3 Broker N
  • 30. 30© Cloudera, Inc. All rights reserved. Kafka - Considerations • Data Exchange
  • 31. 31© Cloudera, Inc. All rights reserved. Data Exchange in Distributed Architectures • Multiple systems interacting together benefit from a common data exchange format. • Choosing the correct standard can significantly impact application design and TCO Client Client serialize serialize deserialize deserialize Common Data Format
  • 32. 32© Cloudera, Inc. All rights reserved. Goals • Simple • Flexible • Efficient • Change Tolerant • Interoperable As systems become more complex, data endpoints need to be decoupled
  • 33. 33© Cloudera, Inc. All rights reserved. He means traffic lights
  • 34. 34© Cloudera, Inc. All rights reserved. Use Avro • A data serialization system • Data always* accompanied by a schema • Provides • A compact, fast, binary data format • A container file to store persistent data • Remote Procedure Call (RPC) • Simple integration with dynamic languages • Schema Evolution • Similar to Thrift of Protocol Puffers but differs by • Dynamic typing • Untagged data • No manually-assigned field IDs: • When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.
  • 35. 35© Cloudera, Inc. All rights reserved. Schema Registry • Use a Schema Registry / Repository • There are open-source options out there • Exposes a REST interface • Backend storage can be just about anything • Can be heavily customized for your environment
  • 36. 36© Cloudera, Inc. All rights reserved. Deploying Kafka • Data Exchange • Provide Common Libraries • There are a number of Kafka clients out there… standardize and develop a producer / consumer library that is consistent so developers aren’t reinventing the wheel
  • 37. 37© Cloudera, Inc. All rights reserved. Deploying Kafka • Data Exchange • Provide Common Libraries • Understand Durability Guarantees and Delivery Semantics
  • 38. 38© Cloudera, Inc. All rights reserved. • Producers can choose to trade throughput for durability of writes: • A sane configuration: Durable Writes Durability Behaviour Per Event Latency Required Acknowledgements (request.required.acks) Highest ACK all ISRs have received Highest -1 Medium ACK once the leader has received Medium 1 Lowest No ACKs required Lowest 0 Property Value replication 3 min.insync.replicas 2 request.required.acks -1
  • 39. 39© Cloudera, Inc. All rights reserved. Producer Performance – Single Thread Type Records/sec MB/s Avg Latency (ms) Max Latency Median Latency 95th %tile No Replication 1,100,182 104 42 1070 1 362 3x Async 1,056,546 101 42 1157 2 323 3x Sync 493,855 47 379 4483 192 1692
  • 40. 40© Cloudera, Inc. All rights reserved. Delivery Semantics • At least once • Messages are never lost but may be redelivered • At most once • Messages are lost but never redelivered • Exactly once • Messages are delivered once and only once Much Harder (Impossible??)
  • 41. 41© Cloudera, Inc. All rights reserved. Getting Exactly Once Semantics • Must consider two components • Durability guarantees when publishing a message • Durability guarantees when consuming a message • Producer • What happens when a produce request was sent but a network error returned before an ack? • Use a single writer per partition and check the latest committed value after network errors • Consumer • Include a unique ID (e.g. UUID) and de-duplicate. • Consider storing offsets with data
  • 42. 42© Cloudera, Inc. All rights reserved. Deploying Kafka • Data Exchange • Provide Common Libraries • Understand Durability Guarantees and Delivery Semantics • Build in auditing from the start • We can use Kafka in-stream to save some reporting and analytics later • This will increase your development time but pay off in the long run
  • 43. 43© Cloudera, Inc. All rights reserved. Auditing and Tracking • Embed timings in the message itself, eg: { "name": "timings", "type": [ "null", { "type": "map", "values": "long" } ], "default": null } • Adopt LinkedIn-style Auditing
  • 44. 44© Cloudera, Inc. All rights reserved. Deploying Kafka • Data Exchange • Provide Common Libraries • Understand Durability Guarantees and Delivery Semantics • Build in auditing from the start • Use Flume for easy ingest into HDFS / Solr
  • 45. 45© Cloudera, Inc. All rights reserved. Flume (Flafka) • Source • Sink • Channel
  • 46. 46© Cloudera, Inc. All rights reserved. Flafka Sources Interceptors Selectors Channels Sinks Flume Agent Kafka HDFS Kafka Producer Producer A Kafka KafkaData Sources Logs, JMS, WebServer etc.
  • 47. 47© Cloudera, Inc. All rights reserved. Deploying Kafka • Data Exchange • Provide Common Libraries • Understand Durability Guarantees and Delivery Semantics • Build in auditing from the start • Use Flume for Easy Ingest to HDFS / Solr • Benchmark based on your message size
  • 48. 48© Cloudera, Inc. All rights reserved. Benchmark Results
  • 49. 49© Cloudera, Inc. All rights reserved. Benchmark Results
  • 50. 50© Cloudera, Inc. All rights reserved. Deploying Kafka • Data Exchange • Provide Common Libraries • Understand Durability Guarantees and Delivery Semantics • Build in auditing from the start • Benchmark based on your message size • Take the time to setup Kafka metrics
  • 51. 51© Cloudera, Inc. All rights reserved. Things like • Consumer Lag • Message in Rate • Bytes in Rate • Bytes out Rate • (you can publish your own as well)
  • 52. 52© Cloudera, Inc. All rights reserved. Deploying Kafka • Data Exchange • Provide Common Libraries • Understand Durability Guarantees and Delivery Semantics • Build in auditing from the start • Benchmark based on your message size • Take the time to setup Kafka metrics • Security
  • 53. 53© Cloudera, Inc. All rights reserved. Security • Out-of-the-box security is pretty weak • Currently must rely on network security • Upcoming improvements add: • Authentication • Authorization • SSL
  • 54. 54© Cloudera, Inc. All rights reserved. Recap • Fraud prevention is very difficult due to response time requirements. • Disruptions in downstream systems can impact actual processing. • Integrating data across multiple systems increases complexity • Other systems want / need the data • Tracking data and metrics is difficult with different protocols

Notas del editor

  1. Good afternoon. Welcome to Event Detection Pipelines with Apache Kafka. Thank you for coming and I hope that the next 30 or so minutes that we have will be informative and enjoyable. Like the other talks here this week in Brussels we have around 40 minutes, so I’m going to get through the content that we have here and then take some questions towards the end. So lets get started
  2. Almost done with the pre-amble. Today We’re going to blah blah blah An
  3. So all of you here are interested in Hadoop and have either deployed it or are thinking about doing so. Most Hadoop use cases I know of started with doing batch ingest from some type of database, doing some ETL offloading usually. Then perhaps we even move things back to some other database for some reporting We of course realize that hadoop is capable of integrating multiple data sources so then we end up integrating with another system or application. And we realize that we can do some reporting directly from hadoop as well. We might even build other applications that pull data from Hadoop. Soon we have a myriad of applications and upstream systems feeding into Hadoop.
  4. But This original box that I drew is a little bit simplified. In reality these applications tend to be tied together. Particularly as organizations move towards services and micro-services, we have interdependencies with on another, and unless we are fairly disciplined, we likely have different ways that these applications talk to one another. If we believe, as I imagine most of us do here in the audience today, that data is extremely valuable, we want to make it easy to exchange data within our overall system and also be flexible and nimble in this process. Unfortunately, all to often, our application stack ends up looking something like this. Where, applications are coupled together tightly, and changes in one system can have drastic impact to other downstream systems. I tend to work with very large-scale enterprises, usually these applications are separated by not just technology, but political or organizational barriers as well.
  5. Kafka is a pub/sub messaging system that can decouple your data pipelines. Most of you are probably familiar with it’s history at LinkedIn. One of the engineers at LinkedIn has said, “if data is the lifeblood of the organization then Kakfa is the circulatory system.” Kafka can handle 100’s of thousands of messages per second, if not more…with very low latency, sub-second in many cases. It also is fault-tolerant as it runs as a cluster of machines and messages are replicated across multiple machines. When I say agnostic message, I mean that producers of messages are not concerned with consumers of messages, and vice versa…there is no dependency on each other. .
  6. Producers Broker Consumers Importantly, it allows us solid system on which to standardize our data exchange. As we’ll discuss, we use it as the foundation for moving data between our systems and so allows us to reuse code and design patterns across our systems
  7. Today I’m going we’ll talk about fraud detection. I have the most experience in this space as I mentioned previously, as it relates to consumer banking, but the architecture here could easily be applied to other businesses. Whenever we need to build systems that take inputs of data in real time and efficiently ingest them into Hadoop this will be applicable.
  8. When building Fraud systems, you can broadly classify them into two categories, the offline aspect and the online aspect. Another way to think about this is that the offline system is Human or Operator Driven, and the online system is happening in an automated fashion, during the flow of the actual event. I’ll briefly cover the offline aspect to show the architecture of a fraud system and then we’ll get into the details of building the online system. Note this isn’t a contrived example, this type of system is in use today in large banks back in the United States
  9. So we want to build a multi-channel fraud system. In this system we accept input from Online transactions, Mobile devices, ATM, and Credit and Debit Cards. Each of these have different exchange formats and so we have an integration layer that is responsible performing conversions on the data feeds into the appropriate formats for processing. More on this a bit later.
  10. So the next stage in our system is the event processing. In this segment we take in incoming transactions, and based on the information we have, either from the transaction itself or other data in our systems we make a decision about the event as it comes in, and this is returned back to the source systems.
  11. Every transaction then is persisted into a repository. The majority of the reporting that we do is really focused on a relatively short time window, however, we keep the data forever so that we can do forensics, discovery, and analytics on all of the transaction data
  12. So in our Case, the repository is Hadoop, and forgive me here as I’ve overlaid system components with functional boxes, but we store all of the transactions in HDFS and also build solr indexes to Allow faceted searching to assist on our forensics.
  13. SO the output of our system then, is really 3 fold. We generate alerts to send over to the case management system. “Fraud” is actually quite broad. A good portion of it is really handling suspected Fraud…we send updates to the case management system, and they work through their investigations. The second is end-user access. Analysts run Hive queries, impala queries, view search GUI to look for patterns and see the incoming data as close to real time as possible. Due to the ingestion rates, And finally, we use our Hadoop cluster to do two primary actions. First we generate rules to feed into a rules engine system to check during our event processing. The next is we use the system to build our ML models and fit them with the appropriate parameters. For this we use SAS, or perhaps R or whatever Data analysis tools we need. This brings to the online system.
  14. SO the output of our system then, is really 3 fold. We generate alerts to send over to the case management system. “Fraud” is actually quite broad. A good portion of it is really handling suspected Fraud…we send updates to the case management system, and they work through their investigations. The second is end-user access. Analysts run Hive queries, impala queries, view search GUI to look for patterns and see the incoming data as close to real time as possible. Due to the ingestion rates, And finally, we use our Hadoop cluster to do two primary actions. First we generate rules to feed into a rules engine system to check during our event processing. The next is we use the system to build our ML models and fit them with the appropriate parameters. For this we use SAS, or perhaps R or whatever Data analysis tools we need. This brings to the online system.
  15. This might not be the place to put this slide in.
  16. This might not be the place to put this slide in.
  17. If only it were as easy as just dropping in Kafka and making all of our problems go away.
  18. If only it were as easy as just dropping in Kafka and making all of our problems go away.
  19. If only it were as easy as just dropping in Kafka and making all of our problems go away.
  20. Replication -> all the the min.insync.replicas. ..there is a timeout. The single digit
  21. This is doable with an idempotent producer where the producer tracks committed messages within some configurable window
  22. If only it were as easy as just dropping in Kafka and making all of our problems go away.
  23. If only it were as easy as just dropping in Kafka and making all of our problems go away.
  24. If only it were as easy as just dropping in Kafka and making all of our problems go away.
  25. If only it were as easy as just dropping in Kafka and making all of our problems go away.
  26. If only it were as easy as just dropping in Kafka and making all of our problems go away.