Streaming Analytics for Financial Enterprises

STREAMING ANALYTICSSTREAMING ANALYTICS
FOR FINANCIAL ENTERPRISESFOR FINANCIAL ENTERPRISES
Bas Geerdink | October 16, 2019 | Spark + AI Summit

WHO AM I?WHO AM I?
{
"name": "Bas Geerdink",
"role": "Technology Lead",
"background": ["Artificial Intelligence",
"Informatics"],
"mixins": ["Software engineering",
"Architecture",
"Management",
"Innovation"],
"twitter": "@bgeerdink",
"linked_in": "bgeerdink"
}

AGENDAAGENDA
1. Fast Data in Finance
2. Architecture and Technology
3. Deep dive:
Event Time, Windows, and Watermarks
Model scoring
4. Wrap-up

BIG DATABIG DATA
Volume
Variety
Velocity

FAST DATA USE CASESFAST DATA USE CASES
Sector Data source Pattern Noti cation
Finance Payment data Fraud detection Block money
transfer
Finance Clicks and page
visits
Trend analysis Actionable insights
Insurance Page visits Customer is stuck in a web
form
Chat window
Healthcare Patient data Heart failure Alert doctor
Traf c Cars passing Traf c jam Update route info
Internet of
Things
Machine logs System failure Alert to sys admin

FAST DATA PATTERNFAST DATA PATTERN
The common pattern in all these scenarios:
1. Detect pattern by combining data (CEP)
2. Determine relevancy (ML)
3. Produce follow-up action

THE SOFTWARE STACKTHE SOFTWARE STACK
Data stream storage: Kafka
Persisting cache, rules, models, and con g:
Cassandra or Ignite
Stream processing: Spark Structured Streaming
Model scoring: PMML and Openscoring.io

APACHE SPARK LIBRARIESAPACHE SPARK LIBRARIES

STREAMING ARCHITECTURESTREAMING ARCHITECTURE

DEEP DIVE PART 1DEEP DIVE PART 1

SPARK-KAFKA INTEGRATIONSPARK-KAFKA INTEGRATION
A Fast Data application is a running job that
processes events in a data store (Kafka)
Jobs can be deployed as ever-running pieces of
software in a big data cluster (Spark)

SPARK-KAFKA INTEGRATIONSPARK-KAFKA INTEGRATION
A Fast Data application is a running job that
processes events in a data store (Kafka)
Jobs can be deployed as ever-running pieces of
software in a big data cluster (Spark)
The basic pattern of a job is:
Connect to the stream and consume events
Group and gather events (windowing)
Perform analysis (aggregation) on each window
Write the result to another stream (sink)

PARALLELISMPARALLELISM
To get high throughput, we have to process the
events in parallel
Parallelism can be con gured on cluster level (YARN)
and on job level (number of worker threads)
val conf = new SparkConf()
.setMaster("local[8]")
.setAppName("FraudNumberOfTransactions")
./bin/spark-submit --name "LowMoneyAlert" --master local[4]
--conf "spark.dynamicAllocation.enabled=true"
--conf "spark.dynamicAllocation.maxExecutors=2" styx.jar

HELLO SPEED!HELLO SPEED!
// connect to Spark
val spark = SparkSession
.builder
.config(conf)
.getOrCreate()
// for using DataFrames
import spark.sqlContext.implicits._
// get the data from Kafka: subscribe to topic
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "transactions")
.option("startingOffsets", "latest")
.load()

EVENT TIMEEVENT TIME
Events occur at certain time
... and are processed later

EVENT TIMEEVENT TIME
Events occur at certain time ⇛ event time
... and are processed later ⇛ processing time

OUT-OF-ORDERNESSOUT-OF-ORDERNESS

WINDOWSWINDOWS
In processing in nite streams, we usually look at a
time window
A windows can be considered as a bucket of time

WINDOWSWINDOWS
In processing in nite streams, we usually look at a
time window
A windows can be considered as a bucket of time
There are different types of windows:
Sliding window
Tumbling window
Session window

WINDOW CONSIDERATIONSWINDOW CONSIDERATIONS
Size: large windows lead to big state and long
calculations
Number: many windows (e.g. sliding, session) lead to
more calculations
Evaluation: do all calculations within one window, or
keep a cache across multiple windows (e.g. when
comparing windows, like in trend analysis)
Timing: events for a window can appear early or late

WINDOWSWINDOWS
Example: sliding window of 1 day, evaluated every 15
minutes over the eld 'customer_id'. The event time is
stored in the eld 'transaction_time'
// aggregate, produces a sql.DataFrame
val windowedTransactions = transactionStream
.groupBy(
window($"transaction_time", "1 day", "15 minutes"),
$"customer_id")
.agg(count("t_id") as "count", $"customer_id", $"window.end")

WATERMARKSWATERMARKS
Watermarks are timestamps that trigger the
computation of the window
They are generated at a time that allows a bit of slack
for late events

WATERMARKSWATERMARKS
Watermarks are timestamps that trigger the
computation of the window
They are generated at a time that allows a bit of slack
for late events
Any event that reaches the processor later than the
watermark, but with an event time that should
belong to the former window, is ignored

EVENT TIME AND WATERMARKSEVENT TIME AND WATERMARKS
Example: sliding window of 60 seconds, evaluated
every 30 seconds. The watermark is set at 1 second,
giving all events some time to arrive.
val windowedTransactions = transactionStream
.withWatermark("created_at", "1 second")
.groupBy(
window($"transaction_time", "60 seconds", "30 seconds"),
$"customer_id")
.agg(...) // e.g. count/sum/...

FAULT-TOLERANCE AND CHECKPOINTINGFAULT-TOLERANCE AND CHECKPOINTING
Data is in one of three stages:
Unprocessed
In transit
Processed

FAULT-TOLERANCE AND CHECKPOINTINGFAULT-TOLERANCE AND CHECKPOINTING
Data is in one of three stages:
Unprocessed ⇛ Kafka consumers provide offsets
that guarantee no data loss for unprocessed data
In transit ⇛ data can be preserved in a checkpoint,
to reload and replay it after a crash
Processed ⇛ Kafka provides an acknowledgement
once data is written

SINK THE OUTPUT TO KAFKASINK THE OUTPUT TO KAFKA
businessEvents
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "business_events")
.option("checkpointLocation", "/hdfs/checkpoint")
.start() // this triggers the start of the streaming query

DEEP DIVE PART 2DEEP DIVE PART 2

MODEL SCORINGMODEL SCORING
To determine the follow-up action of a aggregated
business event (e.g. pattern), we have to enrich the
event with customer data
The resulting data object contains the characteristics
(features) as input for a model

To determine the follow-up action of a aggregated
business event (e.g. pattern), we have to enrich the
event with customer data
The resulting data object contains the characteristics
(features) as input for a model
To get the features and score the model, ef ciency
plays a role again:
Direct database call > API call
In-memory model cache > model on disk

PMMLPMML
PMML is the glue between data science and data
engineering
Data scientists can export their machine learning
models to PMML (or PFA) format
from sklearn.linear_model import LogisticRegression
from sklearn2pmml import sklearn2pmml
events_df = pandas.read_csv("events.csv")
pipeline = PMMLPipeline(...)
pipeline.fit(events_df, events_df["notifications"])
sklearn2pmml(pipeline, "LogisticRegression.pmml", with_repr = True)

The models can be loaded into memory to enable
split-second performance
By applying map functions over the events we can
process/transform the data in the windows:
1. enrich each business event by getting more data
2. ltering events based on selection criteria (rules)
3. score a machine learning model on each event
4. write the outcome to a new event / output stream

OPENSCORING.IOOPENSCORING.IO
def score(event: RichBusinessEvent, pmmlModel: PmmlModel): Double = {
val arguments = new util.LinkedHashMap[FieldName, FieldValue]
for (inputField: InputField <- pmmlModel.getInputFields.asScala) {
arguments.put(inputField.getField.getName,
inputField.prepare(customer.all(fieldName.getValue)))
}
// return the notification with a relevancy score
val results = pmmlModel.evaluate(arguments)
pmmlModel.getTargetFields.asScala.headOption match {
case Some(targetField) =>
val targetFieldValue = results.get(targetField.getName)
case _ => throw new Exception("No valid target")
}
}
}

ALTERNATIVE STACKSALTERNATIVE STACKS
Message bus Streaming technology Database
Kafka Spark Structured Streaming Ignite
Kafka Flink Cassandra
Azure Event Hubs Azure Stream Analytics Cosmos DB
AWS Kinesis Data Streams AWS Kinesis Data Firehose DynamoDb

WRAP-UPWRAP-UP
Fraud detection and actionable insights are examples
of streaming analytics use cases in nancial services
The common pattern is: CEP → ML → Noti cation
Pick the right tools for the job; Kafka, Ignite, and
Spark are amongst the best
Be aware of typical streaming data issues: late
events, state management, windows, etc.

THANKS!THANKS!
Please rate my session on the website or app :)
Read more about streaming analytics at:
Source code and presentation are available at:
The world beyond batch: Streaming 101
Google Data ow paper
https://github.com/streaming-analytics/Styx

Streaming Analytics for Financial Enterprises

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Streaming Analytics for Financial Enterprises

Similar a Streaming Analytics for Financial Enterprises (20)

Más de Databricks

Más de Databricks (20)

Último

Último (20)

Streaming Analytics for Financial Enterprises