Streaming Analytics (or Fast Data processing) is becoming an increasingly popular subject in the financial sector. There are two main reasons for this development. First, more and more data has to be analyze in real-time to prevent fraud; all transactions that are being processed by banks have to pass and ever-growing number of tests to make sure that the money is coming from and going to legitimate sources. Second, customers want to have friction-less mobile experiences while managing their money, such as immediate notifications and personal advise based on their online behavior and other users’ actions.
A typical streaming analytics solution follows a ‘pipes and filters’ pattern that consists of three main steps: detecting patterns on raw event data (Complex Event Processing), evaluating the outcomes with the aid of business rules and machine learning algorithms, and deciding on the next action. At the core of this architecture is the execution of predictive models that operate on enormous amounts of never-ending data streams.
In this talk, I’ll present an architecture for streaming analytics solutions that covers many use cases that follow this pattern: actionable insights, fraud detection, log parsing, traffic analysis, factory data, the IoT, and others. I’ll go through a few architecture challenges that will arise when dealing with streaming data, such as latency issues, event time vs server time, and exactly-once processing. The solution is build on the KISSS stack: Kafka, Ignite, and Spark Structured Streaming. The solution is open source and available on GitHub.
5. FAST DATA USE CASESFAST DATA USE CASES
Sector Data source Pattern Noti cation
Finance Payment data Fraud detection Block money
transfer
Finance Clicks and page
visits
Trend analysis Actionable insights
Insurance Page visits Customer is stuck in a web
form
Chat window
Healthcare Patient data Heart failure Alert doctor
Traf c Cars passing Traf c jam Update route info
Internet of
Things
Machine logs System failure Alert to sys admin
6. FAST DATA PATTERNFAST DATA PATTERN
The common pattern in all these scenarios:
1. Detect pattern by combining data (CEP)
2. Determine relevancy (ML)
3. Produce follow-up action
8. THE SOFTWARE STACKTHE SOFTWARE STACK
Data stream storage: Kafka
Persisting cache, rules, models, and con g:
Cassandra or Ignite
Stream processing: Spark Structured Streaming
Model scoring: PMML and Openscoring.io
12. SPARK-KAFKA INTEGRATIONSPARK-KAFKA INTEGRATION
A Fast Data application is a running job that
processes events in a data store (Kafka)
Jobs can be deployed as ever-running pieces of
software in a big data cluster (Spark)
13. SPARK-KAFKA INTEGRATIONSPARK-KAFKA INTEGRATION
A Fast Data application is a running job that
processes events in a data store (Kafka)
Jobs can be deployed as ever-running pieces of
software in a big data cluster (Spark)
The basic pattern of a job is:
Connect to the stream and consume events
Group and gather events (windowing)
Perform analysis (aggregation) on each window
Write the result to another stream (sink)
14. PARALLELISMPARALLELISM
To get high throughput, we have to process the
events in parallel
Parallelism can be con gured on cluster level (YARN)
and on job level (number of worker threads)
val conf = new SparkConf()
.setMaster("local[8]")
.setAppName("FraudNumberOfTransactions")
./bin/spark-submit --name "LowMoneyAlert" --master local[4]
--conf "spark.dynamicAllocation.enabled=true"
--conf "spark.dynamicAllocation.maxExecutors=2" styx.jar
15. HELLO SPEED!HELLO SPEED!
// connect to Spark
val spark = SparkSession
.builder
.config(conf)
.getOrCreate()
// for using DataFrames
import spark.sqlContext.implicits._
// get the data from Kafka: subscribe to topic
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "transactions")
.option("startingOffsets", "latest")
.load()
20. WINDOWSWINDOWS
In processing in nite streams, we usually look at a
time window
A windows can be considered as a bucket of time
21. WINDOWSWINDOWS
In processing in nite streams, we usually look at a
time window
A windows can be considered as a bucket of time
There are different types of windows:
Sliding window
Tumbling window
Session window
23. WINDOW CONSIDERATIONSWINDOW CONSIDERATIONS
Size: large windows lead to big state and long
calculations
Number: many windows (e.g. sliding, session) lead to
more calculations
Evaluation: do all calculations within one window, or
keep a cache across multiple windows (e.g. when
comparing windows, like in trend analysis)
Timing: events for a window can appear early or late
24. WINDOWSWINDOWS
Example: sliding window of 1 day, evaluated every 15
minutes over the eld 'customer_id'. The event time is
stored in the eld 'transaction_time'
// aggregate, produces a sql.DataFrame
val windowedTransactions = transactionStream
.groupBy(
window($"transaction_time", "1 day", "15 minutes"),
$"customer_id")
.agg(count("t_id") as "count", $"customer_id", $"window.end")
26. WATERMARKSWATERMARKS
Watermarks are timestamps that trigger the
computation of the window
They are generated at a time that allows a bit of slack
for late events
Any event that reaches the processor later than the
watermark, but with an event time that should
belong to the former window, is ignored
27. EVENT TIME AND WATERMARKSEVENT TIME AND WATERMARKS
Example: sliding window of 60 seconds, evaluated
every 30 seconds. The watermark is set at 1 second,
giving all events some time to arrive.
val windowedTransactions = transactionStream
.withWatermark("created_at", "1 second")
.groupBy(
window($"transaction_time", "60 seconds", "30 seconds"),
$"customer_id")
.agg(...) // e.g. count/sum/...
29. FAULT-TOLERANCE AND CHECKPOINTINGFAULT-TOLERANCE AND CHECKPOINTING
Data is in one of three stages:
Unprocessed ⇛ Kafka consumers provide offsets
that guarantee no data loss for unprocessed data
In transit ⇛ data can be preserved in a checkpoint,
to reload and replay it after a crash
Processed ⇛ Kafka provides an acknowledgement
once data is written
30. SINK THE OUTPUT TO KAFKASINK THE OUTPUT TO KAFKA
businessEvents
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "business_events")
.option("checkpointLocation", "/hdfs/checkpoint")
.start() // this triggers the start of the streaming query
32. MODEL SCORINGMODEL SCORING
To determine the follow-up action of a aggregated
business event (e.g. pattern), we have to enrich the
event with customer data
The resulting data object contains the characteristics
(features) as input for a model
33. MODEL SCORINGMODEL SCORING
To determine the follow-up action of a aggregated
business event (e.g. pattern), we have to enrich the
event with customer data
The resulting data object contains the characteristics
(features) as input for a model
To get the features and score the model, ef ciency
plays a role again:
Direct database call > API call
In-memory model cache > model on disk
34. PMMLPMML
PMML is the glue between data science and data
engineering
Data scientists can export their machine learning
models to PMML (or PFA) format
from sklearn.linear_model import LogisticRegression
from sklearn2pmml import sklearn2pmml
events_df = pandas.read_csv("events.csv")
pipeline = PMMLPipeline(...)
pipeline.fit(events_df, events_df["notifications"])
sklearn2pmml(pipeline, "LogisticRegression.pmml", with_repr = True)
36. MODEL SCORINGMODEL SCORING
The models can be loaded into memory to enable
split-second performance
By applying map functions over the events we can
process/transform the data in the windows:
1. enrich each business event by getting more data
2. ltering events based on selection criteria (rules)
3. score a machine learning model on each event
4. write the outcome to a new event / output stream
37. OPENSCORING.IOOPENSCORING.IO
def score(event: RichBusinessEvent, pmmlModel: PmmlModel): Double = {
val arguments = new util.LinkedHashMap[FieldName, FieldValue]
for (inputField: InputField <- pmmlModel.getInputFields.asScala) {
arguments.put(inputField.getField.getName,
inputField.prepare(customer.all(fieldName.getValue)))
}
// return the notification with a relevancy score
val results = pmmlModel.evaluate(arguments)
pmmlModel.getTargetFields.asScala.headOption match {
case Some(targetField) =>
val targetFieldValue = results.get(targetField.getName)
case _ => throw new Exception("No valid target")
}
}
}
38. ALTERNATIVE STACKSALTERNATIVE STACKS
Message bus Streaming technology Database
Kafka Spark Structured Streaming Ignite
Kafka Flink Cassandra
Azure Event Hubs Azure Stream Analytics Cosmos DB
AWS Kinesis Data Streams AWS Kinesis Data Firehose DynamoDb
39. WRAP-UPWRAP-UP
Fraud detection and actionable insights are examples
of streaming analytics use cases in nancial services
The common pattern is: CEP → ML → Noti cation
Pick the right tools for the job; Kafka, Ignite, and
Spark are amongst the best
Be aware of typical streaming data issues: late
events, state management, windows, etc.
40. THANKS!THANKS!
Please rate my session on the website or app :)
Read more about streaming analytics at:
Source code and presentation are available at:
The world beyond batch: Streaming 101
Google Data ow paper
https://github.com/streaming-analytics/Styx