Deploying Data Science Engines to Production

Deploying Data Science Engines to Production
Comparing Options + Code Examples
Mostafa Majidpour Senior Data Scientist at Meredith Corp.
October 20
2018
IDEAS SoCal

About 140 million U.S. monthly unique visitors
• #1 network for women and millennials
https://www.comscore.com/Insights/Rankings 2

Motivating Example
• Scenario:
• User’s browsing a website. We have
access to the user’s cookie and/or past
browsing behavior
• Requirements:
• Involves Predictive Modeling
• Real time/ near real time scoring
3

Machine Learning Pipeline
Creation to Deployment

Deployment
Wall!
• https://speakerdeck.com/szilard/machine-learning-
software-in-practice-quo-vadis-invited-talk-kdd-conference-
applied-data-science-track-august-2017-halifax-canada
5

Deployment:
To be or not
to be?
• According to Rexer Data Science Survey:
• 37% of surveyed data scientists reported
their models are sometimes/rarely
deployed.
• 12% of surveyed data scientists reported
their models are always deployed.
• http://www.rexeranalytics.com/files/Rexer_Data_Science_Survey_Hig
hlights_Apr-2016.pdf
6

Approach 1: Look-up table
● Pre-compute the scores for all possible inputs (or a subset of them)
● Store the scores in a look-up table
+No need for a complex scoring environment
- Table size grows fast with high cardinality features (~50K zip code x …)
- Unused scoring for some permutations
7

Approach 2: Code re-write for deployment
- Time consuming
- Prone to errors
- Existence of comparable packages
- Slows the impact of data science team on the business!
+Ensures higher quality codes
8

Approach 3: Deployable Data Science outcome
What if the DS’s outcome (the ML pipeline) was readily deployable?
+DS develops with more familiar tools (e.g. python & R)
+DE/SWE does not have to re-write the DS outcome (Avoiding code duplication)
+Ensures higher quality code
ML pipeline includes <Pre-transformation + ML Algorithm + Post-transformation>
Scoring Engine
String
Indexer
Normalizer PCA
Logistic
Regression
Scoring Engine
ML Pipeline
Raw Input
Output 9

Deployable Data Science outcome
Available Solutions

Decision
Criteria
Financial cost
Supported languages
in pipeline creation
and runtime
Ability to score
multiple data points
simultaneously
(Dataframe vs. Row)
Support for pre and
post transformations
(ML pipeline vs. ML
model)
SparkML support Scoring Latency
Active community Good documentation
11

Investigated Technologies
● PMML, jPMML
● PFA
● H2O
● Aloha
● Embedded Spark
● mllib-local (Spark)
● MLeap
For detailed comparison: https://www.slideshare.net/formulatedby/a-journey-of-deploying-a-data-science-engine-to-production
Scoring one data point at a time
No support for pre and post transformations
Slower than MLeap
Only works in Scala
Not fast enough
Satisfies our main requirements
Not mature enough
12

MLeap
● Model creation: Python and Scala; Scoring: Scala (Integrates well with Java)
● Supports many transformations and ML models from SparkML, sklearn, TensorFlow, and xgboost
● Active community
● Fast (0.11ms vs. 22ms for Spark)
● Custom transformers
● Even Databricks recommends it: “Databricks recommends MLeap, which is a common serialization
format and execution engine for machine learning pipelines. It supports serializing Apache Spark, scikit-
learn, and TensorFlow pipelines into a bundle, so you can load and deploy your trained models to make
predictions with new data.”
○ https://docs.databricks.com/spark/latest/mllib/index.html#model-export-label
- Inconsistent documentation
https://www.drivenbycode.com/mleap-quickly-release-spark-ml-pipelines/
13

Deployable DS outcome with MLeap
Scoring Engine
MLeap runtime
(JVM)
String
Indexer
Normalizer PCA
Logistic
Regression
Scoring Engine
MLeap runtime
(JVM)
Spark MLlib pipeline as MLeap bundle
Export as
MLeap
bundle
Python, R, Scala, or Java
Java or Scala
Data Science Playground
Production Environment
Raw Input Output 14

MLeap sample code
Spark ML + MLeap bundle + Scoring

Export as MLeap Bundle
Use Scala version! ;)
17

Use Case at
Meredith
• Recommend products to
online users
• Legacy system: reduced
dimension lookup table with
simple predictive models
• Proposed system with SparkML
and MLeap: boosted
conversion rate by around 20%
in different releases
20

Summary
● Batch scoring? Do it in DS environment! No deployment needed
● Real time scoring? Relatively small number of input permutations?
○ Look-up table! Simple deployment
○ No! check out MLeap and alike (You do have a sample MLeap code, simple enough to start!)
● Consider deployment solution that exports the whole ML pipeline
● MLeap worked for us! Still needs lots of attention from community
● Not discussed because of cost: Databricks mlflow, Amazon SageMaker , ScienceOPS (yhat),
Anaconda Enterprise, NStack, …
○ Big enterprise solutions are very recent
● Open source possibility: dbml-local (Databricks)
21

Thanks to my colleagues at Meredith!
Thank you!
Questions?
22

Deploying Data Science Engines to Production

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Deploying Data Science Engines to Production

Similar a Deploying Data Science Engines to Production (20)

Último

Último (20)

Deploying Data Science Engines to Production

Notas del editor