Good afternoon. Welcome to Event Detection Pipelines with Apache Kafka. Thank you for coming and I hope that the next 30 or so minutes that we have will be informative and enjoyable. Like the other talks here this week in Brussels we have around 40 minutes, so I’m going to get through the content that we have here and then take some questions towards the end. So lets get started
Almost done with the pre-amble. Today We’re going to blah blah blah
An
So all of you here are interested in Hadoop and have either deployed it or are thinking about doing so.
Most Hadoop use cases I know of started with doing batch ingest from some type of database, doing some ETL offloading usually. Then perhaps we even move things back to some other database for some reporting
We of course realize that hadoop is capable of integrating multiple data sources so then we end up integrating with another system or application.
And we realize that we can do some reporting directly from hadoop as well.
We might even build other applications that pull data from Hadoop.
Soon we have a myriad of applications and upstream systems feeding into Hadoop.
But This original box that I drew is a little bit simplified. In reality these applications tend to be tied together. Particularly as organizations move towards services and micro-services, we have interdependencies with on another, and unless we are fairly disciplined, we likely have different ways that these applications talk to one another. If we believe, as I imagine most of us do here in the audience today, that data is extremely valuable, we want to make it easy to exchange data within our overall system and also be flexible and nimble in this process.
Unfortunately, all to often, our application stack ends up looking something like this. Where, applications are coupled together tightly, and changes in one system can have drastic impact to other downstream systems. I tend to work with very large-scale enterprises, usually these applications are separated by not just technology, but political or organizational barriers as well.
Kafka is a pub/sub messaging system that can decouple your data pipelines. Most of you are probably familiar with it’s history at LinkedIn. One of the engineers at LinkedIn has said, “if data is the lifeblood of the organization then Kakfa is the circulatory system.”
Kafka can handle 100’s of thousands of messages per second, if not more…with very low latency, sub-second in many cases. It also is fault-tolerant as it runs as a cluster of machines and messages are replicated across multiple machines.
When I say agnostic message, I mean that producers of messages are not concerned with consumers of messages, and vice versa…there is no dependency on each other.
.
Producers
Broker
Consumers
Importantly, it allows us solid system on which to standardize our data exchange. As we’ll discuss, we use it as the foundation for moving data between our systems and so allows us to reuse code and design patterns across our systems
Today I’m going we’ll talk about fraud detection. I have the most experience in this space as I mentioned previously, as it relates to consumer banking, but the architecture here could easily be applied to other businesses. Whenever we need to build systems that take inputs of data in real time and efficiently ingest them into Hadoop this will be applicable.
When building Fraud systems, you can broadly classify them into two categories, the offline aspect and the online aspect. Another way to think about this is that the offline system is Human or Operator Driven, and the online system is happening in an automated fashion, during the flow of the actual event.
I’ll briefly cover the offline aspect to show the architecture of a fraud system and then we’ll get into the details of building the online system.
Note this isn’t a contrived example, this type of system is in use today in large banks back in the United States
So we want to build a multi-channel fraud system. In this system we accept input from Online transactions, Mobile devices, ATM, and Credit and Debit Cards. Each of these have different exchange formats and so we have an integration layer that is responsible performing conversions on the data feeds into the appropriate formats for processing. More on this a bit later.
So the next stage in our system is the event processing. In this segment we take in incoming transactions, and based on the information we have, either from the transaction itself or other data in our systems we make a decision about the event as it comes in, and this is returned back to the source systems.
Every transaction then is persisted into a repository. The majority of the reporting that we do is really focused on a relatively short time window, however, we keep the data forever so that we can do forensics, discovery, and analytics on all of the transaction data
So in our Case, the repository is Hadoop, and forgive me here as I’ve overlaid system components with functional boxes, but we store all of the transactions in HDFS and also build solr indexes to Allow faceted searching to assist on our forensics.
SO the output of our system then, is really 3 fold.
We generate alerts to send over to the case management system. “Fraud” is actually quite broad. A good portion of it is really handling suspected Fraud…we send updates to the case management system, and they work through their investigations.
The second is end-user access. Analysts run Hive queries, impala queries, view search GUI to look for patterns and see the incoming data as close to real time as possible. Due to the ingestion rates,
And finally, we use our Hadoop cluster to do two primary actions. First we generate rules to feed into a rules engine system to check during our event processing. The next is we use the system to build our ML models and fit them with the appropriate parameters. For this we use SAS, or perhaps R or whatever Data analysis tools we need. This brings to the online system.
SO the output of our system then, is really 3 fold.
We generate alerts to send over to the case management system. “Fraud” is actually quite broad. A good portion of it is really handling suspected Fraud…we send updates to the case management system, and they work through their investigations.
The second is end-user access. Analysts run Hive queries, impala queries, view search GUI to look for patterns and see the incoming data as close to real time as possible. Due to the ingestion rates,
And finally, we use our Hadoop cluster to do two primary actions. First we generate rules to feed into a rules engine system to check during our event processing. The next is we use the system to build our ML models and fit them with the appropriate parameters. For this we use SAS, or perhaps R or whatever Data analysis tools we need. This brings to the online system.
This might not be the place to put this slide in.
This might not be the place to put this slide in.
If only it were as easy as just dropping in Kafka and making all of our problems go away.
If only it were as easy as just dropping in Kafka and making all of our problems go away.
If only it were as easy as just dropping in Kafka and making all of our problems go away.
Replication -> all the the min.insync.replicas. ..there is a timeout.
The single digit
This is doable with an idempotent producer where the producer tracks committed messages within some configurable window
If only it were as easy as just dropping in Kafka and making all of our problems go away.
If only it were as easy as just dropping in Kafka and making all of our problems go away.
If only it were as easy as just dropping in Kafka and making all of our problems go away.
If only it were as easy as just dropping in Kafka and making all of our problems go away.
If only it were as easy as just dropping in Kafka and making all of our problems go away.