SlideShare una empresa de Scribd logo
1 de 40
Descargar para leer sin conexión
STREAMING ANALYTICSSTREAMING ANALYTICS
FOR FINANCIAL ENTERPRISESFOR FINANCIAL ENTERPRISES
Bas Geerdink | October 16, 2019 | Spark + AI Summit
WHO AM I?WHO AM I?
{
"name": "Bas Geerdink",
"role": "Technology Lead",
"background": ["Artificial Intelligence",
"Informatics"],
"mixins": ["Software engineering",
"Architecture",
"Management",
"Innovation"],
"twitter": "@bgeerdink",
"linked_in": "bgeerdink"
}
AGENDAAGENDA
1. Fast Data in Finance
2. Architecture and Technology
3. Deep dive:
Event Time, Windows, and Watermarks
Model scoring
4. Wrap-up
BIG DATABIG DATA
Volume
Variety
Velocity
FAST DATA USE CASESFAST DATA USE CASES
Sector Data source Pattern Noti cation
Finance Payment data Fraud detection Block money
transfer
Finance Clicks and page
visits
Trend analysis Actionable insights
Insurance Page visits Customer is stuck in a web
form
Chat window
Healthcare Patient data Heart failure Alert doctor
Traf c Cars passing Traf c jam Update route info
Internet of
Things
Machine logs System failure Alert to sys admin
FAST DATA PATTERNFAST DATA PATTERN
The common pattern in all these scenarios:
1. Detect pattern by combining data (CEP)
2. Determine relevancy (ML)
3. Produce follow-up action
ARCHITECTUREARCHITECTURE
THE SOFTWARE STACKTHE SOFTWARE STACK
Data stream storage: Kafka
Persisting cache, rules, models, and con g:
Cassandra or Ignite
Stream processing: Spark Structured Streaming
Model scoring: PMML and Openscoring.io
APACHE SPARK LIBRARIESAPACHE SPARK LIBRARIES
STREAMING ARCHITECTURESTREAMING ARCHITECTURE
DEEP DIVE PART 1DEEP DIVE PART 1
SPARK-KAFKA INTEGRATIONSPARK-KAFKA INTEGRATION
A Fast Data application is a running job that
processes events in a data store (Kafka)
Jobs can be deployed as ever-running pieces of
software in a big data cluster (Spark)
SPARK-KAFKA INTEGRATIONSPARK-KAFKA INTEGRATION
A Fast Data application is a running job that
processes events in a data store (Kafka)
Jobs can be deployed as ever-running pieces of
software in a big data cluster (Spark)
The basic pattern of a job is:
Connect to the stream and consume events
Group and gather events (windowing)
Perform analysis (aggregation) on each window
Write the result to another stream (sink)
PARALLELISMPARALLELISM
To get high throughput, we have to process the
events in parallel
Parallelism can be con gured on cluster level (YARN)
and on job level (number of worker threads)
val conf = new SparkConf()
.setMaster("local[8]")
.setAppName("FraudNumberOfTransactions")
./bin/spark-submit --name "LowMoneyAlert" --master local[4]
--conf "spark.dynamicAllocation.enabled=true"
--conf "spark.dynamicAllocation.maxExecutors=2" styx.jar
HELLO SPEED!HELLO SPEED!
// connect to Spark
val spark = SparkSession
.builder
.config(conf)
.getOrCreate()
// for using DataFrames
import spark.sqlContext.implicits._
// get the data from Kafka: subscribe to topic
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "transactions")
.option("startingOffsets", "latest")
.load()
EVENT TIMEEVENT TIME
Events occur at certain time
... and are processed later
EVENT TIMEEVENT TIME
Events occur at certain time ⇛ event time
... and are processed later ⇛ processing time
EVENT TIMEEVENT TIME
Events occur at certain time ⇛ event time
... and are processed later ⇛ processing time
OUT-OF-ORDERNESSOUT-OF-ORDERNESS
WINDOWSWINDOWS
In processing in nite streams, we usually look at a
time window
A windows can be considered as a bucket of time
WINDOWSWINDOWS
In processing in nite streams, we usually look at a
time window
A windows can be considered as a bucket of time
There are different types of windows:
Sliding window
Tumbling window
Session window
WINDOWSWINDOWS
WINDOW CONSIDERATIONSWINDOW CONSIDERATIONS
Size: large windows lead to big state and long
calculations
Number: many windows (e.g. sliding, session) lead to
more calculations
Evaluation: do all calculations within one window, or
keep a cache across multiple windows (e.g. when
comparing windows, like in trend analysis)
Timing: events for a window can appear early or late
WINDOWSWINDOWS
Example: sliding window of 1 day, evaluated every 15
minutes over the eld 'customer_id'. The event time is
stored in the eld 'transaction_time'
// aggregate, produces a sql.DataFrame
val windowedTransactions = transactionStream
.groupBy(
window($"transaction_time", "1 day", "15 minutes"),
$"customer_id")
.agg(count("t_id") as "count", $"customer_id", $"window.end")
WATERMARKSWATERMARKS
Watermarks are timestamps that trigger the
computation of the window
They are generated at a time that allows a bit of slack
for late events
WATERMARKSWATERMARKS
Watermarks are timestamps that trigger the
computation of the window
They are generated at a time that allows a bit of slack
for late events
Any event that reaches the processor later than the
watermark, but with an event time that should
belong to the former window, is ignored
EVENT TIME AND WATERMARKSEVENT TIME AND WATERMARKS
Example: sliding window of 60 seconds, evaluated
every 30 seconds. The watermark is set at 1 second,
giving all events some time to arrive.
val windowedTransactions = transactionStream
.withWatermark("created_at", "1 second")
.groupBy(
window($"transaction_time", "60 seconds", "30 seconds"),
$"customer_id")
.agg(...) // e.g. count/sum/...
FAULT-TOLERANCE AND CHECKPOINTINGFAULT-TOLERANCE AND CHECKPOINTING
Data is in one of three stages:
Unprocessed
In transit
Processed
FAULT-TOLERANCE AND CHECKPOINTINGFAULT-TOLERANCE AND CHECKPOINTING
Data is in one of three stages:
Unprocessed ⇛ Kafka consumers provide offsets
that guarantee no data loss for unprocessed data
In transit ⇛ data can be preserved in a checkpoint,
to reload and replay it after a crash
Processed ⇛ Kafka provides an acknowledgement
once data is written
SINK THE OUTPUT TO KAFKASINK THE OUTPUT TO KAFKA
businessEvents
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "business_events")
.option("checkpointLocation", "/hdfs/checkpoint")
.start() // this triggers the start of the streaming query
DEEP DIVE PART 2DEEP DIVE PART 2
MODEL SCORINGMODEL SCORING
To determine the follow-up action of a aggregated
business event (e.g. pattern), we have to enrich the
event with customer data
The resulting data object contains the characteristics
(features) as input for a model
MODEL SCORINGMODEL SCORING
To determine the follow-up action of a aggregated
business event (e.g. pattern), we have to enrich the
event with customer data
The resulting data object contains the characteristics
(features) as input for a model
To get the features and score the model, ef ciency
plays a role again:
Direct database call > API call
In-memory model cache > model on disk
PMMLPMML
PMML is the glue between data science and data
engineering
Data scientists can export their machine learning
models to PMML (or PFA) format
from sklearn.linear_model import LogisticRegression
from sklearn2pmml import sklearn2pmml
events_df = pandas.read_csv("events.csv")
pipeline = PMMLPipeline(...)
pipeline.fit(events_df, events_df["notifications"])
sklearn2pmml(pipeline, "LogisticRegression.pmml", with_repr = True)
PMMLPMML
MODEL SCORINGMODEL SCORING
The models can be loaded into memory to enable
split-second performance
By applying map functions over the events we can
process/transform the data in the windows:
1. enrich each business event by getting more data
2. ltering events based on selection criteria (rules)
3. score a machine learning model on each event
4. write the outcome to a new event / output stream
OPENSCORING.IOOPENSCORING.IO
def score(event: RichBusinessEvent, pmmlModel: PmmlModel): Double = {
val arguments = new util.LinkedHashMap[FieldName, FieldValue]
for (inputField: InputField <- pmmlModel.getInputFields.asScala) {
arguments.put(inputField.getField.getName,
inputField.prepare(customer.all(fieldName.getValue)))
}
// return the notification with a relevancy score
val results = pmmlModel.evaluate(arguments)
pmmlModel.getTargetFields.asScala.headOption match {
case Some(targetField) =>
val targetFieldValue = results.get(targetField.getName)
case _ => throw new Exception("No valid target")
}
}
}
ALTERNATIVE STACKSALTERNATIVE STACKS
Message bus Streaming technology Database
Kafka Spark Structured Streaming Ignite
Kafka Flink Cassandra
Azure Event Hubs Azure Stream Analytics Cosmos DB
AWS Kinesis Data Streams AWS Kinesis Data Firehose DynamoDb
WRAP-UPWRAP-UP
Fraud detection and actionable insights are examples
of streaming analytics use cases in nancial services
The common pattern is: CEP → ML → Noti cation
Pick the right tools for the job; Kafka, Ignite, and
Spark are amongst the best
Be aware of typical streaming data issues: late
events, state management, windows, etc.
THANKS!THANKS!
Please rate my session on the website or app :)
Read more about streaming analytics at:
Source code and presentation are available at:
The world beyond batch: Streaming 101
Google Data ow paper
https://github.com/streaming-analytics/Styx

Más contenido relacionado

La actualidad más candente

Implementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkImplementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkDataWorks Summit
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Summit
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming Stratio
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastDatabricks
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Khai Tran
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and CassandraNatalino Busa
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeSpark Summit
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeDatabricks
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIOJozo Kovac
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkDatabricks
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Databricks
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Spark Summit
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with SparkVincent GALOPIN
 
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Databricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
Real Time Machine Learning Visualization With Spark
Real Time Machine Learning Visualization With SparkReal Time Machine Learning Visualization With Spark
Real Time Machine Learning Visualization With SparkChester Chen
 
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
 Application and Challenges of Streaming Analytics and Machine Learning on Mu... Application and Challenges of Streaming Analytics and Machine Learning on Mu...
Application and Challenges of Streaming Analytics and Machine Learning on Mu...Databricks
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseBig Data Spain
 

La actualidad más candente (20)

Implementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkImplementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache Spark
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and Fast
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIO
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache Spark
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
 
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Real Time Machine Learning Visualization With Spark
Real Time Machine Learning Visualization With SparkReal Time Machine Learning Visualization With Spark
Real Time Machine Learning Visualization With Spark
 
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
 Application and Challenges of Streaming Analytics and Machine Learning on Mu... Application and Challenges of Streaming Analytics and Machine Learning on Mu...
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
 

Similar a Streaming Analytics for Financial Enterprises

Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics PlatformSrinath Perera
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?confluent
 
Andrii Dembitskyi "Events in our applications Event bus and distributed systems"
Andrii Dembitskyi "Events in our applications Event bus and distributed systems"Andrii Dembitskyi "Events in our applications Event bus and distributed systems"
Andrii Dembitskyi "Events in our applications Event bus and distributed systems"Fwdays
 
Serverless London 2019 FaaS composition using Kafka and CloudEvents
Serverless London 2019   FaaS composition using Kafka and CloudEventsServerless London 2019   FaaS composition using Kafka and CloudEvents
Serverless London 2019 FaaS composition using Kafka and CloudEventsNeil Avery
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...Databricks
 
Big data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M RulliBig data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M Rullimfrancis
 
ThingMonk 2016 - Concursus Event sourcing for the IOT By Tareq Abedrabbo & Do...
ThingMonk 2016 - Concursus Event sourcing for the IOT By Tareq Abedrabbo & Do...ThingMonk 2016 - Concursus Event sourcing for the IOT By Tareq Abedrabbo & Do...
ThingMonk 2016 - Concursus Event sourcing for the IOT By Tareq Abedrabbo & Do...OpenCredo
 
Spark Streaming - Meetup Data Analysis
Spark Streaming - Meetup Data AnalysisSpark Streaming - Meetup Data Analysis
Spark Streaming - Meetup Data AnalysisSushmanth Sagala
 
Introducing the WSO2 Complex Event Processor
Introducing the WSO2 Complex Event ProcessorIntroducing the WSO2 Complex Event Processor
Introducing the WSO2 Complex Event ProcessorWSO2
 
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...Fulvio Corno
 
Building Serverless EDA w_ AWS Lambda (1).pptx
Building Serverless EDA w_ AWS Lambda (1).pptxBuilding Serverless EDA w_ AWS Lambda (1).pptx
Building Serverless EDA w_ AWS Lambda (1).pptxAhmed791434
 
Working with data using Azure Functions.pdf
Working with data using Azure Functions.pdfWorking with data using Azure Functions.pdf
Working with data using Azure Functions.pdfStephanie Locke
 
TSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkTSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkAnirudh Todi
 
Budapest Spark Meetup - Apache Spark @enbrite.ly
Budapest Spark Meetup - Apache Spark @enbrite.lyBudapest Spark Meetup - Apache Spark @enbrite.ly
Budapest Spark Meetup - Apache Spark @enbrite.lyMészáros József
 
Turning Spreadsheets into APIs
Turning Spreadsheets into APIsTurning Spreadsheets into APIs
Turning Spreadsheets into APIsWSO2
 
Spreadsheets To API
Spreadsheets To APISpreadsheets To API
Spreadsheets To APIChris Haddad
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!Guido Schmutz
 
Reactive programming every day
Reactive programming every dayReactive programming every day
Reactive programming every dayVadym Khondar
 
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...confluent
 

Similar a Streaming Analytics for Financial Enterprises (20)

Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics Platform
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?
 
Andrii Dembitskyi "Events in our applications Event bus and distributed systems"
Andrii Dembitskyi "Events in our applications Event bus and distributed systems"Andrii Dembitskyi "Events in our applications Event bus and distributed systems"
Andrii Dembitskyi "Events in our applications Event bus and distributed systems"
 
Serverless London 2019 FaaS composition using Kafka and CloudEvents
Serverless London 2019   FaaS composition using Kafka and CloudEventsServerless London 2019   FaaS composition using Kafka and CloudEvents
Serverless London 2019 FaaS composition using Kafka and CloudEvents
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
 
Big data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M RulliBig data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M Rulli
 
ThingMonk 2016 - Concursus Event sourcing for the IOT By Tareq Abedrabbo & Do...
ThingMonk 2016 - Concursus Event sourcing for the IOT By Tareq Abedrabbo & Do...ThingMonk 2016 - Concursus Event sourcing for the IOT By Tareq Abedrabbo & Do...
ThingMonk 2016 - Concursus Event sourcing for the IOT By Tareq Abedrabbo & Do...
 
Spark Streaming - Meetup Data Analysis
Spark Streaming - Meetup Data AnalysisSpark Streaming - Meetup Data Analysis
Spark Streaming - Meetup Data Analysis
 
Introducing the WSO2 Complex Event Processor
Introducing the WSO2 Complex Event ProcessorIntroducing the WSO2 Complex Event Processor
Introducing the WSO2 Complex Event Processor
 
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
 
Building Serverless EDA w_ AWS Lambda (1).pptx
Building Serverless EDA w_ AWS Lambda (1).pptxBuilding Serverless EDA w_ AWS Lambda (1).pptx
Building Serverless EDA w_ AWS Lambda (1).pptx
 
Working with data using Azure Functions.pdf
Working with data using Azure Functions.pdfWorking with data using Azure Functions.pdf
Working with data using Azure Functions.pdf
 
TSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkTSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech Talk
 
Tsar tech talk
Tsar tech talkTsar tech talk
Tsar tech talk
 
Budapest Spark Meetup - Apache Spark @enbrite.ly
Budapest Spark Meetup - Apache Spark @enbrite.lyBudapest Spark Meetup - Apache Spark @enbrite.ly
Budapest Spark Meetup - Apache Spark @enbrite.ly
 
Turning Spreadsheets into APIs
Turning Spreadsheets into APIsTurning Spreadsheets into APIs
Turning Spreadsheets into APIs
 
Spreadsheets To API
Spreadsheets To APISpreadsheets To API
Spreadsheets To API
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!
 
Reactive programming every day
Reactive programming every dayReactive programming every day
Reactive programming every day
 
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 

Último (20)

Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 

Streaming Analytics for Financial Enterprises

  • 1. STREAMING ANALYTICSSTREAMING ANALYTICS FOR FINANCIAL ENTERPRISESFOR FINANCIAL ENTERPRISES Bas Geerdink | October 16, 2019 | Spark + AI Summit
  • 2. WHO AM I?WHO AM I? { "name": "Bas Geerdink", "role": "Technology Lead", "background": ["Artificial Intelligence", "Informatics"], "mixins": ["Software engineering", "Architecture", "Management", "Innovation"], "twitter": "@bgeerdink", "linked_in": "bgeerdink" }
  • 3. AGENDAAGENDA 1. Fast Data in Finance 2. Architecture and Technology 3. Deep dive: Event Time, Windows, and Watermarks Model scoring 4. Wrap-up
  • 5. FAST DATA USE CASESFAST DATA USE CASES Sector Data source Pattern Noti cation Finance Payment data Fraud detection Block money transfer Finance Clicks and page visits Trend analysis Actionable insights Insurance Page visits Customer is stuck in a web form Chat window Healthcare Patient data Heart failure Alert doctor Traf c Cars passing Traf c jam Update route info Internet of Things Machine logs System failure Alert to sys admin
  • 6. FAST DATA PATTERNFAST DATA PATTERN The common pattern in all these scenarios: 1. Detect pattern by combining data (CEP) 2. Determine relevancy (ML) 3. Produce follow-up action
  • 8. THE SOFTWARE STACKTHE SOFTWARE STACK Data stream storage: Kafka Persisting cache, rules, models, and con g: Cassandra or Ignite Stream processing: Spark Structured Streaming Model scoring: PMML and Openscoring.io
  • 11. DEEP DIVE PART 1DEEP DIVE PART 1
  • 12. SPARK-KAFKA INTEGRATIONSPARK-KAFKA INTEGRATION A Fast Data application is a running job that processes events in a data store (Kafka) Jobs can be deployed as ever-running pieces of software in a big data cluster (Spark)
  • 13. SPARK-KAFKA INTEGRATIONSPARK-KAFKA INTEGRATION A Fast Data application is a running job that processes events in a data store (Kafka) Jobs can be deployed as ever-running pieces of software in a big data cluster (Spark) The basic pattern of a job is: Connect to the stream and consume events Group and gather events (windowing) Perform analysis (aggregation) on each window Write the result to another stream (sink)
  • 14. PARALLELISMPARALLELISM To get high throughput, we have to process the events in parallel Parallelism can be con gured on cluster level (YARN) and on job level (number of worker threads) val conf = new SparkConf() .setMaster("local[8]") .setAppName("FraudNumberOfTransactions") ./bin/spark-submit --name "LowMoneyAlert" --master local[4] --conf "spark.dynamicAllocation.enabled=true" --conf "spark.dynamicAllocation.maxExecutors=2" styx.jar
  • 15. HELLO SPEED!HELLO SPEED! // connect to Spark val spark = SparkSession .builder .config(conf) .getOrCreate() // for using DataFrames import spark.sqlContext.implicits._ // get the data from Kafka: subscribe to topic val df = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "localhost:9092") .option("subscribe", "transactions") .option("startingOffsets", "latest") .load()
  • 16. EVENT TIMEEVENT TIME Events occur at certain time ... and are processed later
  • 17. EVENT TIMEEVENT TIME Events occur at certain time ⇛ event time ... and are processed later ⇛ processing time
  • 18. EVENT TIMEEVENT TIME Events occur at certain time ⇛ event time ... and are processed later ⇛ processing time
  • 20. WINDOWSWINDOWS In processing in nite streams, we usually look at a time window A windows can be considered as a bucket of time
  • 21. WINDOWSWINDOWS In processing in nite streams, we usually look at a time window A windows can be considered as a bucket of time There are different types of windows: Sliding window Tumbling window Session window
  • 23. WINDOW CONSIDERATIONSWINDOW CONSIDERATIONS Size: large windows lead to big state and long calculations Number: many windows (e.g. sliding, session) lead to more calculations Evaluation: do all calculations within one window, or keep a cache across multiple windows (e.g. when comparing windows, like in trend analysis) Timing: events for a window can appear early or late
  • 24. WINDOWSWINDOWS Example: sliding window of 1 day, evaluated every 15 minutes over the eld 'customer_id'. The event time is stored in the eld 'transaction_time' // aggregate, produces a sql.DataFrame val windowedTransactions = transactionStream .groupBy( window($"transaction_time", "1 day", "15 minutes"), $"customer_id") .agg(count("t_id") as "count", $"customer_id", $"window.end")
  • 25. WATERMARKSWATERMARKS Watermarks are timestamps that trigger the computation of the window They are generated at a time that allows a bit of slack for late events
  • 26. WATERMARKSWATERMARKS Watermarks are timestamps that trigger the computation of the window They are generated at a time that allows a bit of slack for late events Any event that reaches the processor later than the watermark, but with an event time that should belong to the former window, is ignored
  • 27. EVENT TIME AND WATERMARKSEVENT TIME AND WATERMARKS Example: sliding window of 60 seconds, evaluated every 30 seconds. The watermark is set at 1 second, giving all events some time to arrive. val windowedTransactions = transactionStream .withWatermark("created_at", "1 second") .groupBy( window($"transaction_time", "60 seconds", "30 seconds"), $"customer_id") .agg(...) // e.g. count/sum/...
  • 28. FAULT-TOLERANCE AND CHECKPOINTINGFAULT-TOLERANCE AND CHECKPOINTING Data is in one of three stages: Unprocessed In transit Processed
  • 29. FAULT-TOLERANCE AND CHECKPOINTINGFAULT-TOLERANCE AND CHECKPOINTING Data is in one of three stages: Unprocessed ⇛ Kafka consumers provide offsets that guarantee no data loss for unprocessed data In transit ⇛ data can be preserved in a checkpoint, to reload and replay it after a crash Processed ⇛ Kafka provides an acknowledgement once data is written
  • 30. SINK THE OUTPUT TO KAFKASINK THE OUTPUT TO KAFKA businessEvents .format("kafka") .option("kafka.bootstrap.servers", "localhost:9092") .option("topic", "business_events") .option("checkpointLocation", "/hdfs/checkpoint") .start() // this triggers the start of the streaming query
  • 31. DEEP DIVE PART 2DEEP DIVE PART 2
  • 32. MODEL SCORINGMODEL SCORING To determine the follow-up action of a aggregated business event (e.g. pattern), we have to enrich the event with customer data The resulting data object contains the characteristics (features) as input for a model
  • 33. MODEL SCORINGMODEL SCORING To determine the follow-up action of a aggregated business event (e.g. pattern), we have to enrich the event with customer data The resulting data object contains the characteristics (features) as input for a model To get the features and score the model, ef ciency plays a role again: Direct database call > API call In-memory model cache > model on disk
  • 34. PMMLPMML PMML is the glue between data science and data engineering Data scientists can export their machine learning models to PMML (or PFA) format from sklearn.linear_model import LogisticRegression from sklearn2pmml import sklearn2pmml events_df = pandas.read_csv("events.csv") pipeline = PMMLPipeline(...) pipeline.fit(events_df, events_df["notifications"]) sklearn2pmml(pipeline, "LogisticRegression.pmml", with_repr = True)
  • 36. MODEL SCORINGMODEL SCORING The models can be loaded into memory to enable split-second performance By applying map functions over the events we can process/transform the data in the windows: 1. enrich each business event by getting more data 2. ltering events based on selection criteria (rules) 3. score a machine learning model on each event 4. write the outcome to a new event / output stream
  • 37. OPENSCORING.IOOPENSCORING.IO def score(event: RichBusinessEvent, pmmlModel: PmmlModel): Double = { val arguments = new util.LinkedHashMap[FieldName, FieldValue] for (inputField: InputField <- pmmlModel.getInputFields.asScala) { arguments.put(inputField.getField.getName, inputField.prepare(customer.all(fieldName.getValue))) } // return the notification with a relevancy score val results = pmmlModel.evaluate(arguments) pmmlModel.getTargetFields.asScala.headOption match { case Some(targetField) => val targetFieldValue = results.get(targetField.getName) case _ => throw new Exception("No valid target") } } }
  • 38. ALTERNATIVE STACKSALTERNATIVE STACKS Message bus Streaming technology Database Kafka Spark Structured Streaming Ignite Kafka Flink Cassandra Azure Event Hubs Azure Stream Analytics Cosmos DB AWS Kinesis Data Streams AWS Kinesis Data Firehose DynamoDb
  • 39. WRAP-UPWRAP-UP Fraud detection and actionable insights are examples of streaming analytics use cases in nancial services The common pattern is: CEP → ML → Noti cation Pick the right tools for the job; Kafka, Ignite, and Spark are amongst the best Be aware of typical streaming data issues: late events, state management, windows, etc.
  • 40. THANKS!THANKS! Please rate my session on the website or app :) Read more about streaming analytics at: Source code and presentation are available at: The world beyond batch: Streaming 101 Google Data ow paper https://github.com/streaming-analytics/Styx