SlideShare a Scribd company logo
1 of 24
Prediction as a service with
Ensemble Model trained in
SparkML and Python ScikitLearn
on 1Bn observed flight prices daily
Josef Habdank
Lead Data Scientist & Data Platform Architect at INFARE
• jha@infare.com www.infare.com
• www.linkedin.com/in/jahabdank
• @jahabdank
Leading provider of
Airfare Intelligence Solutions
to the Aviation Industry
150 Airlines & Airports 7 offices worldwide
Collect and processes
1.2 billion distinct
airfares daily
https://www.youtube.com/watch?v=h9cQTooY92E
What is this talk about?
• Ensemble approach for large scale
DataScience
– Online learning for huge datasets
– thousands simple models are better
than one very complex
• N-billion rows/day machine learning
system architecture
– Implementation of parallel online
training of tens of thousands of
models in Spark Streaming and
Python ScikitLearn
Ensemble approach on
billions of rows
Batch vs Online model training
• Large variety of options available (historical
reasons)
• Often more accurate
• Does not scale well
• Model might be missing critical latest
information
• Train on microbatches or individual observations
• Relies on Learning Rate and Sample Weighing
• Can be used in horizontally scalable environments
• Model can be as up to date as possible
Bn
can stream
Batch
Online Bn
Mn
Especially critical
in prediction of
volatile signals
split data
into subspaces/
clusters
have one model
per cluster When doing
prediction we know
which model is best
for which data
Do prediction
Ensemble approach to prediction in BigData
traditional ensemble mixes multiple models for one prediction,
here we simply select one best for the data segment
In such
system model
training can
be parallelized
• Entire space consists of 1.2Bn time series
• Best results obtained when division is
done using combination of knowledge
(manual division) and clustering methods
• Optimal number of slices/clusters is
between few thousand to hundreds of
thousands
• Clustering methods need dimensionality
reduction if subspace has still too many
dimensions
Segmenting the space
• Fuzzy clustering method which gives
probability of a point being in the
cluster
• The probability could be used as a
model weight, in case of model
mixing
SparkML: org.apache.spark.ml.clustering.GaussianMixture
http://spark.apache.org/docs/latest/ml-clustering.html#gaussian-mixture-model-gmm
(showing only 2 first parameters)
Gaussian Mixture Model
Gaussian Mixture Model Results
• Classification vs regression
– if your problem can be converted into
classification, try this as a first attempt
• Linear online models in Python:
– sklearn.linear_model.SGDClassifier
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
– sklearn.linear_model.SGDRegressor
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html
• Other interesting models:
– whole sklearn.svm package
– Kalman and ARIMA models
– Particle Predictor (wrote own library)
Feature and model selection
• OneHotEncoder:
– can capture any nonlinear behavior
– explodes exponentially dims
– can reduce your problem to a
hypercube
• Try assigning values to labels
which carry information
– ℎ𝑜𝑢𝑟 ∈ 0, 1, … , 23 → ℎ𝑜𝑢𝑟′
∈
0.22, 0.45, … , 0.03
• Try to capture nonlinear behavior
using linear model, by adding
meta-features
Prediction results
good accuracy
(28 day prediction)
can model flat endgame
(in exp. processes it is hard)
Some example of
fair prediction
anomaly,
manual price
change
N-billion rows/day
machine learning architecture
using DataBricks
N-billion rows/day machine learning
architecture using DataBricks
Avro nanobatches
compressed with
Snappy
DataWarehouse
Parquet
Permanent Storage
(blob storage)
Temporary Storage
(message broker)
Training models in parallel
in Spark Streaming
get existing models update models
store models
back in the DB
* we already know the clusters
Grouping in Spark DataFrames with collect_list()
Wrapping model training in UDF
Wrapping model training in UDF
Prepare the Matrix with inputs
for model training
Normalize the data using normalization
defined for this particular cluster
Generate sample weights which enable
controlling the learning rate
Get the current model from DB
(use in memory DB for fast response)
Preform partial fit using the sample weights
Important trick:
Converts numpy.ndarray[numpy.float64]
into native python list[float] which then
can be autoconverted to Spark List[Double]
Register UDF which returns Spark List[Double]
Time Series Prediction as a Service
• Provided the labels identify time series and
lookup the model
• Get historical data (performance is the key)
• Recursively predict next price, shifting the
window for the desired length
• The same workflow for any model:
SGDClassifier, SGDRegressor, ARIMA,
Kalman, Particle Predictor
𝑓
𝑥 𝑛−𝑙
⋮
𝑥 𝑛
= 𝑥 𝑛+1 ⇒ 𝑓
𝑥 𝑛−𝑙+1
⋮
𝑥 𝑛+1
= 𝑥 𝑛+2 ⇒ …
Summary
• Spark + Python is AWESOME for
DataScience 
• Large scale DataScience needs correct
infrastructure (Kafka-Kinesis, Spark
Streaming, in memory DB, Notebooks)
• It is much easier to work with large volumes
of models, then very few ones
• Gaussian Mixture is great for fuzzy
clustering, has very mature and fast
implementations
• Spark DataFrames with UDF can be used to
efficiently palletize the model training in tens
and even hundreds of thousands models
Senior Data Scientist, working with
Apache Spark + Python doing Airfare/price forecasting
Senior Big Data Engineer/Senior Backend Developer,
working with Apache Spark/S3/MemSql/Scala + MicroAPIs
job@infare.com
http://www.infare.com/jobs/
Want to work with cutting edge
100% Apache Spark + Python projects? We are hiring!!!
THANK YOU!
Q/ARemember, we are hiring!
Josef Habdank
Lead Data Scientist & Data Platform Architect at INFARE
• jha@infare.com www.infare.com
• www.linkedin.com/in/jahabdank
• @jahabdank
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn

More Related Content

What's hot

Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDistributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Databricks
 

What's hot (20)

Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
 
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDistributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
 
Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
 Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
 
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
Spark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf JagermanSpark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf Jagerman
 
Harnessing Big Data with Spark
Harnessing Big Data with SparkHarnessing Big Data with Spark
Harnessing Big Data with Spark
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi Torres
 

Viewers also liked

SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData
 

Viewers also liked (20)

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosGetting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 

Similar to Prediction as a service with ensemble model in SparkML and Python ScikitLearn

Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
inside-BigData.com
 
Presentation
PresentationPresentation
Presentation
butest
 

Similar to Prediction as a service with ensemble model in SparkML and Python ScikitLearn (20)

Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef Habdank
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 
Auto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADBAuto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADB
 
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 
Intro to SnappyData Webinar
Intro to SnappyData WebinarIntro to SnappyData Webinar
Intro to SnappyData Webinar
 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache Spark
 
StackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkStackNet Meta-Modelling framework
StackNet Meta-Modelling framework
 
A High-Level Programming Approach for using FPGAs in HPC using Functional Des...
A High-Level Programming Approach for using FPGAs in HPC using Functional Des...A High-Level Programming Approach for using FPGAs in HPC using Functional Des...
A High-Level Programming Approach for using FPGAs in HPC using Functional Des...
 
Intro to Deep Learning with Keras - using TensorFlow backend
Intro to Deep Learning with Keras - using TensorFlow backendIntro to Deep Learning with Keras - using TensorFlow backend
Intro to Deep Learning with Keras - using TensorFlow backend
 
Scaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache SparkScaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache Spark
 
AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cl...
AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cl...AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cl...
AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cl...
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud Infrastructures
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Presentation
PresentationPresentation
Presentation
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
Advanced Hyperparameter Optimization for Deep Learning with MLflow
Advanced Hyperparameter Optimization for Deep Learning with MLflowAdvanced Hyperparameter Optimization for Deep Learning with MLflow
Advanced Hyperparameter Optimization for Deep Learning with MLflow
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 

Prediction as a service with ensemble model in SparkML and Python ScikitLearn

  • 1. Prediction as a service with Ensemble Model trained in SparkML and Python ScikitLearn on 1Bn observed flight prices daily Josef Habdank Lead Data Scientist & Data Platform Architect at INFARE • jha@infare.com www.infare.com • www.linkedin.com/in/jahabdank • @jahabdank
  • 2. Leading provider of Airfare Intelligence Solutions to the Aviation Industry 150 Airlines & Airports 7 offices worldwide Collect and processes 1.2 billion distinct airfares daily https://www.youtube.com/watch?v=h9cQTooY92E
  • 3. What is this talk about? • Ensemble approach for large scale DataScience – Online learning for huge datasets – thousands simple models are better than one very complex • N-billion rows/day machine learning system architecture – Implementation of parallel online training of tens of thousands of models in Spark Streaming and Python ScikitLearn
  • 5. Batch vs Online model training • Large variety of options available (historical reasons) • Often more accurate • Does not scale well • Model might be missing critical latest information • Train on microbatches or individual observations • Relies on Learning Rate and Sample Weighing • Can be used in horizontally scalable environments • Model can be as up to date as possible Bn can stream Batch Online Bn Mn Especially critical in prediction of volatile signals
  • 6. split data into subspaces/ clusters have one model per cluster When doing prediction we know which model is best for which data Do prediction Ensemble approach to prediction in BigData traditional ensemble mixes multiple models for one prediction, here we simply select one best for the data segment In such system model training can be parallelized
  • 7. • Entire space consists of 1.2Bn time series • Best results obtained when division is done using combination of knowledge (manual division) and clustering methods • Optimal number of slices/clusters is between few thousand to hundreds of thousands • Clustering methods need dimensionality reduction if subspace has still too many dimensions Segmenting the space
  • 8. • Fuzzy clustering method which gives probability of a point being in the cluster • The probability could be used as a model weight, in case of model mixing SparkML: org.apache.spark.ml.clustering.GaussianMixture http://spark.apache.org/docs/latest/ml-clustering.html#gaussian-mixture-model-gmm (showing only 2 first parameters) Gaussian Mixture Model
  • 10. • Classification vs regression – if your problem can be converted into classification, try this as a first attempt • Linear online models in Python: – sklearn.linear_model.SGDClassifier http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html – sklearn.linear_model.SGDRegressor http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html • Other interesting models: – whole sklearn.svm package – Kalman and ARIMA models – Particle Predictor (wrote own library) Feature and model selection • OneHotEncoder: – can capture any nonlinear behavior – explodes exponentially dims – can reduce your problem to a hypercube • Try assigning values to labels which carry information – ℎ𝑜𝑢𝑟 ∈ 0, 1, … , 23 → ℎ𝑜𝑢𝑟′ ∈ 0.22, 0.45, … , 0.03 • Try to capture nonlinear behavior using linear model, by adding meta-features
  • 11. Prediction results good accuracy (28 day prediction) can model flat endgame (in exp. processes it is hard) Some example of fair prediction anomaly, manual price change
  • 12. N-billion rows/day machine learning architecture using DataBricks
  • 13. N-billion rows/day machine learning architecture using DataBricks Avro nanobatches compressed with Snappy DataWarehouse Parquet Permanent Storage (blob storage) Temporary Storage (message broker)
  • 14. Training models in parallel in Spark Streaming get existing models update models store models back in the DB * we already know the clusters
  • 15. Grouping in Spark DataFrames with collect_list()
  • 17. Wrapping model training in UDF Prepare the Matrix with inputs for model training Normalize the data using normalization defined for this particular cluster Generate sample weights which enable controlling the learning rate Get the current model from DB (use in memory DB for fast response) Preform partial fit using the sample weights Important trick: Converts numpy.ndarray[numpy.float64] into native python list[float] which then can be autoconverted to Spark List[Double] Register UDF which returns Spark List[Double]
  • 18. Time Series Prediction as a Service • Provided the labels identify time series and lookup the model • Get historical data (performance is the key) • Recursively predict next price, shifting the window for the desired length • The same workflow for any model: SGDClassifier, SGDRegressor, ARIMA, Kalman, Particle Predictor 𝑓 𝑥 𝑛−𝑙 ⋮ 𝑥 𝑛 = 𝑥 𝑛+1 ⇒ 𝑓 𝑥 𝑛−𝑙+1 ⋮ 𝑥 𝑛+1 = 𝑥 𝑛+2 ⇒ …
  • 19. Summary • Spark + Python is AWESOME for DataScience  • Large scale DataScience needs correct infrastructure (Kafka-Kinesis, Spark Streaming, in memory DB, Notebooks) • It is much easier to work with large volumes of models, then very few ones • Gaussian Mixture is great for fuzzy clustering, has very mature and fast implementations • Spark DataFrames with UDF can be used to efficiently palletize the model training in tens and even hundreds of thousands models
  • 20. Senior Data Scientist, working with Apache Spark + Python doing Airfare/price forecasting Senior Big Data Engineer/Senior Backend Developer, working with Apache Spark/S3/MemSql/Scala + MicroAPIs job@infare.com http://www.infare.com/jobs/ Want to work with cutting edge 100% Apache Spark + Python projects? We are hiring!!!
  • 21. THANK YOU! Q/ARemember, we are hiring! Josef Habdank Lead Data Scientist & Data Platform Architect at INFARE • jha@infare.com www.infare.com • www.linkedin.com/in/jahabdank • @jahabdank