SlideShare una empresa de Scribd logo
1 de 22
Deploying Data Science Engines to Production
Comparing Options + Code Examples
Mostafa Majidpour Senior Data Scientist at Meredith Corp.
October 20
2018
IDEAS SoCal
About 140 million U.S. monthly unique visitors
• #1 network for women and millennials
https://www.comscore.com/Insights/Rankings 2
Motivating Example
• Scenario:
• User’s browsing a website. We have
access to the user’s cookie and/or past
browsing behavior
• Requirements:
• Involves Predictive Modeling
• Real time/ near real time scoring
3
Machine Learning Pipeline
Creation to Deployment
Deployment
Wall!
• https://speakerdeck.com/szilard/machine-learning-
software-in-practice-quo-vadis-invited-talk-kdd-conference-
applied-data-science-track-august-2017-halifax-canada
5
Deployment:
To be or not
to be?
• According to Rexer Data Science Survey:
• 37% of surveyed data scientists reported
their models are sometimes/rarely
deployed.
• 12% of surveyed data scientists reported
their models are always deployed.
• http://www.rexeranalytics.com/files/Rexer_Data_Science_Survey_Hig
hlights_Apr-2016.pdf
6
Approach 1: Look-up table
● Pre-compute the scores for all possible inputs (or a subset of them)
● Store the scores in a look-up table
+No need for a complex scoring environment
- Table size grows fast with high cardinality features (~50K zip code x …)
- Unused scoring for some permutations
7
Approach 2: Code re-write for deployment
- Time consuming
- Prone to errors
- Existence of comparable packages
- Slows the impact of data science team on the business!
+Ensures higher quality codes
8
Approach 3: Deployable Data Science outcome
What if the DS’s outcome (the ML pipeline) was readily deployable?
+DS develops with more familiar tools (e.g. python & R)
+DE/SWE does not have to re-write the DS outcome (Avoiding code duplication)
+Ensures higher quality code
ML pipeline includes <Pre-transformation + ML Algorithm + Post-transformation>
Scoring Engine
String
Indexer
Normalizer PCA
Logistic
Regression
Scoring Engine
ML Pipeline
Raw Input
Output 9
Deployable Data Science outcome
Available Solutions
Decision
Criteria
Financial cost
Supported languages
in pipeline creation
and runtime
Ability to score
multiple data points
simultaneously
(Dataframe vs. Row)
Support for pre and
post transformations
(ML pipeline vs. ML
model)
SparkML support Scoring Latency
Active community Good documentation
11
Investigated Technologies
● PMML, jPMML
● PFA
● H2O
● Aloha
● Embedded Spark
● mllib-local (Spark)
● MLeap
For detailed comparison: https://www.slideshare.net/formulatedby/a-journey-of-deploying-a-data-science-engine-to-production
Scoring one data point at a time
No support for pre and post transformations
Slower than MLeap
Only works in Scala
Not fast enough
Satisfies our main requirements
Not mature enough
12
MLeap
● Model creation: Python and Scala; Scoring: Scala (Integrates well with Java)
● Supports many transformations and ML models from SparkML, sklearn, TensorFlow, and xgboost
● Active community
● Fast (0.11ms vs. 22ms for Spark)
● Custom transformers
● Even Databricks recommends it: “Databricks recommends MLeap, which is a common serialization
format and execution engine for machine learning pipelines. It supports serializing Apache Spark, scikit-
learn, and TensorFlow pipelines into a bundle, so you can load and deploy your trained models to make
predictions with new data.”
○ https://docs.databricks.com/spark/latest/mllib/index.html#model-export-label
- Inconsistent documentation
https://www.drivenbycode.com/mleap-quickly-release-spark-ml-pipelines/
13
Deployable DS outcome with MLeap
Scoring Engine
MLeap runtime
(JVM)
String
Indexer
Normalizer PCA
Logistic
Regression
Scoring Engine
MLeap runtime
(JVM)
Spark MLlib pipeline as MLeap bundle
Export as
MLeap
bundle
Python, R, Scala, or Java
Java or Scala
Data Science Playground
Production Environment
Raw Input Output 14
MLeap sample code
Spark ML + MLeap bundle + Scoring
Build Spark ML pipeline
16
Export as MLeap Bundle
Use Scala version! ;)
17
Score in JVM (1)
18
Score in JVM (2)
19
Use Case at
Meredith
• Recommend products to
online users
• Legacy system: reduced
dimension lookup table with
simple predictive models
• Proposed system with SparkML
and MLeap: boosted
conversion rate by around 20%
in different releases
20
Summary
● Batch scoring? Do it in DS environment! No deployment needed
● Real time scoring? Relatively small number of input permutations?
○ Look-up table! Simple deployment
○ No! check out MLeap and alike (You do have a sample MLeap code, simple enough to start!)
● Consider deployment solution that exports the whole ML pipeline
● MLeap worked for us! Still needs lots of attention from community
● Not discussed because of cost: Databricks mlflow, Amazon SageMaker , ScienceOPS (yhat),
Anaconda Enterprise, NStack, …
○ Big enterprise solutions are very recent
● Open source possibility: dbml-local (Databricks)
21
Thanks to my colleagues at Meredith!
Thank you!
Questions?
22

Más contenido relacionado

La actualidad más candente

Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...Databricks
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...Databricks
 
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens... ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...Databricks
 
When Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaWhen Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaDatabricks
 
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...Databricks
 
Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...
Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...
Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...Databricks
 
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...Databricks
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Databricks
 
Serverless machine learning operations
Serverless machine learning operationsServerless machine learning operations
Serverless machine learning operationsStepan Pushkarev
 
Code Once Use Often with Declarative Data Pipelines
Code Once Use Often with Declarative Data PipelinesCode Once Use Often with Declarative Data Pipelines
Code Once Use Often with Declarative Data PipelinesDatabricks
 
Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...
Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...
Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...Databricks
 
The Quest for an Open Source Data Science Platform
 The Quest for an Open Source Data Science Platform The Quest for an Open Source Data Science Platform
The Quest for an Open Source Data Science PlatformQAware GmbH
 
Leveraging Spark ML for Real-Time Credit Card Approvals with Anand Venugopal ...
Leveraging Spark ML for Real-Time Credit Card Approvals with Anand Venugopal ...Leveraging Spark ML for Real-Time Credit Card Approvals with Anand Venugopal ...
Leveraging Spark ML for Real-Time Credit Card Approvals with Anand Venugopal ...Databricks
 
Model Experiments Tracking and Registration using MLflow on Databricks
Model Experiments Tracking and Registration using MLflow on DatabricksModel Experiments Tracking and Registration using MLflow on Databricks
Model Experiments Tracking and Registration using MLflow on DatabricksDatabricks
 
Using Apache Spark for Predicting Degrading and Failing Parts in Aviation
Using Apache Spark for Predicting Degrading and Failing Parts in AviationUsing Apache Spark for Predicting Degrading and Failing Parts in Aviation
Using Apache Spark for Predicting Degrading and Failing Parts in AviationDatabricks
 
Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...
Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...
Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...Databricks
 
Porting R Models into Scala Spark
Porting R Models into Scala SparkPorting R Models into Scala Spark
Porting R Models into Scala Sparkcarl_pulley
 
KFServing, Model Monitoring with Apache Spark and a Feature Store
KFServing, Model Monitoring with Apache Spark and a Feature StoreKFServing, Model Monitoring with Apache Spark and a Feature Store
KFServing, Model Monitoring with Apache Spark and a Feature StoreDatabricks
 

La actualidad más candente (20)

Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
 
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens... ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 
When Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaWhen Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu Ma
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
 
Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...
Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...
Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...
 
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
 
Serverless machine learning operations
Serverless machine learning operationsServerless machine learning operations
Serverless machine learning operations
 
Code Once Use Often with Declarative Data Pipelines
Code Once Use Often with Declarative Data PipelinesCode Once Use Often with Declarative Data Pipelines
Code Once Use Often with Declarative Data Pipelines
 
Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...
Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...
Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...
 
The Quest for an Open Source Data Science Platform
 The Quest for an Open Source Data Science Platform The Quest for an Open Source Data Science Platform
The Quest for an Open Source Data Science Platform
 
Leveraging Spark ML for Real-Time Credit Card Approvals with Anand Venugopal ...
Leveraging Spark ML for Real-Time Credit Card Approvals with Anand Venugopal ...Leveraging Spark ML for Real-Time Credit Card Approvals with Anand Venugopal ...
Leveraging Spark ML for Real-Time Credit Card Approvals with Anand Venugopal ...
 
Model Experiments Tracking and Registration using MLflow on Databricks
Model Experiments Tracking and Registration using MLflow on DatabricksModel Experiments Tracking and Registration using MLflow on Databricks
Model Experiments Tracking and Registration using MLflow on Databricks
 
Using Apache Spark for Predicting Degrading and Failing Parts in Aviation
Using Apache Spark for Predicting Degrading and Failing Parts in AviationUsing Apache Spark for Predicting Degrading and Failing Parts in Aviation
Using Apache Spark for Predicting Degrading and Failing Parts in Aviation
 
Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...
Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...
Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...
 
Porting R Models into Scala Spark
Porting R Models into Scala SparkPorting R Models into Scala Spark
Porting R Models into Scala Spark
 
KFServing, Model Monitoring with Apache Spark and a Feature Store
KFServing, Model Monitoring with Apache Spark and a Feature StoreKFServing, Model Monitoring with Apache Spark and a Feature Store
KFServing, Model Monitoring with Apache Spark and a Feature Store
 

Similar a Deploying Data Science Engines to Production

Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionFormulatedby
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexApache Apex
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APIshareddatamsft
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_VeriticalsPeyman Mohajerian
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyNeo4j
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyNeo4j
 
Neo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael MooreNeo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael MooreNeo4j
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 editionDavid Talby
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development Spark Summit
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowDatabricks
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark TrainingSpark Summit
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...Deepak Chandramouli
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL ServerŁukasz Grala
 

Similar a Deploying Data Science Engines to Production (20)

Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Neo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael MooreNeo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael Moore
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflow
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
 

Último

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 

Último (20)

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 

Deploying Data Science Engines to Production

  • 1. Deploying Data Science Engines to Production Comparing Options + Code Examples Mostafa Majidpour Senior Data Scientist at Meredith Corp. October 20 2018 IDEAS SoCal
  • 2. About 140 million U.S. monthly unique visitors • #1 network for women and millennials https://www.comscore.com/Insights/Rankings 2
  • 3. Motivating Example • Scenario: • User’s browsing a website. We have access to the user’s cookie and/or past browsing behavior • Requirements: • Involves Predictive Modeling • Real time/ near real time scoring 3
  • 6. Deployment: To be or not to be? • According to Rexer Data Science Survey: • 37% of surveyed data scientists reported their models are sometimes/rarely deployed. • 12% of surveyed data scientists reported their models are always deployed. • http://www.rexeranalytics.com/files/Rexer_Data_Science_Survey_Hig hlights_Apr-2016.pdf 6
  • 7. Approach 1: Look-up table ● Pre-compute the scores for all possible inputs (or a subset of them) ● Store the scores in a look-up table +No need for a complex scoring environment - Table size grows fast with high cardinality features (~50K zip code x …) - Unused scoring for some permutations 7
  • 8. Approach 2: Code re-write for deployment - Time consuming - Prone to errors - Existence of comparable packages - Slows the impact of data science team on the business! +Ensures higher quality codes 8
  • 9. Approach 3: Deployable Data Science outcome What if the DS’s outcome (the ML pipeline) was readily deployable? +DS develops with more familiar tools (e.g. python & R) +DE/SWE does not have to re-write the DS outcome (Avoiding code duplication) +Ensures higher quality code ML pipeline includes <Pre-transformation + ML Algorithm + Post-transformation> Scoring Engine String Indexer Normalizer PCA Logistic Regression Scoring Engine ML Pipeline Raw Input Output 9
  • 10. Deployable Data Science outcome Available Solutions
  • 11. Decision Criteria Financial cost Supported languages in pipeline creation and runtime Ability to score multiple data points simultaneously (Dataframe vs. Row) Support for pre and post transformations (ML pipeline vs. ML model) SparkML support Scoring Latency Active community Good documentation 11
  • 12. Investigated Technologies ● PMML, jPMML ● PFA ● H2O ● Aloha ● Embedded Spark ● mllib-local (Spark) ● MLeap For detailed comparison: https://www.slideshare.net/formulatedby/a-journey-of-deploying-a-data-science-engine-to-production Scoring one data point at a time No support for pre and post transformations Slower than MLeap Only works in Scala Not fast enough Satisfies our main requirements Not mature enough 12
  • 13. MLeap ● Model creation: Python and Scala; Scoring: Scala (Integrates well with Java) ● Supports many transformations and ML models from SparkML, sklearn, TensorFlow, and xgboost ● Active community ● Fast (0.11ms vs. 22ms for Spark) ● Custom transformers ● Even Databricks recommends it: “Databricks recommends MLeap, which is a common serialization format and execution engine for machine learning pipelines. It supports serializing Apache Spark, scikit- learn, and TensorFlow pipelines into a bundle, so you can load and deploy your trained models to make predictions with new data.” ○ https://docs.databricks.com/spark/latest/mllib/index.html#model-export-label - Inconsistent documentation https://www.drivenbycode.com/mleap-quickly-release-spark-ml-pipelines/ 13
  • 14. Deployable DS outcome with MLeap Scoring Engine MLeap runtime (JVM) String Indexer Normalizer PCA Logistic Regression Scoring Engine MLeap runtime (JVM) Spark MLlib pipeline as MLeap bundle Export as MLeap bundle Python, R, Scala, or Java Java or Scala Data Science Playground Production Environment Raw Input Output 14
  • 15. MLeap sample code Spark ML + MLeap bundle + Scoring
  • 16. Build Spark ML pipeline 16
  • 17. Export as MLeap Bundle Use Scala version! ;) 17
  • 18. Score in JVM (1) 18
  • 19. Score in JVM (2) 19
  • 20. Use Case at Meredith • Recommend products to online users • Legacy system: reduced dimension lookup table with simple predictive models • Proposed system with SparkML and MLeap: boosted conversion rate by around 20% in different releases 20
  • 21. Summary ● Batch scoring? Do it in DS environment! No deployment needed ● Real time scoring? Relatively small number of input permutations? ○ Look-up table! Simple deployment ○ No! check out MLeap and alike (You do have a sample MLeap code, simple enough to start!) ● Consider deployment solution that exports the whole ML pipeline ● MLeap worked for us! Still needs lots of attention from community ● Not discussed because of cost: Databricks mlflow, Amazon SageMaker , ScienceOPS (yhat), Anaconda Enterprise, NStack, … ○ Big enterprise solutions are very recent ● Open source possibility: dbml-local (Databricks) 21
  • 22. Thanks to my colleagues at Meredith! Thank you! Questions? 22

Notas del editor

  1. Available Technologies/Solutions & Decision Factors