Presented at IDEAS SoCal on Oct 20, 2018. I discuss main approaches of deploying data science engines to production and provide sample code for the comprehensive approach of real time scoring with MLeap and Spark ML.
1. Deploying Data Science Engines to Production
Comparing Options + Code Examples
Mostafa Majidpour Senior Data Scientist at Meredith Corp.
October 20
2018
IDEAS SoCal
2. About 140 million U.S. monthly unique visitors
• #1 network for women and millennials
https://www.comscore.com/Insights/Rankings 2
3. Motivating Example
• Scenario:
• User’s browsing a website. We have
access to the user’s cookie and/or past
browsing behavior
• Requirements:
• Involves Predictive Modeling
• Real time/ near real time scoring
3
6. Deployment:
To be or not
to be?
• According to Rexer Data Science Survey:
• 37% of surveyed data scientists reported
their models are sometimes/rarely
deployed.
• 12% of surveyed data scientists reported
their models are always deployed.
• http://www.rexeranalytics.com/files/Rexer_Data_Science_Survey_Hig
hlights_Apr-2016.pdf
6
7. Approach 1: Look-up table
● Pre-compute the scores for all possible inputs (or a subset of them)
● Store the scores in a look-up table
+No need for a complex scoring environment
- Table size grows fast with high cardinality features (~50K zip code x …)
- Unused scoring for some permutations
7
8. Approach 2: Code re-write for deployment
- Time consuming
- Prone to errors
- Existence of comparable packages
- Slows the impact of data science team on the business!
+Ensures higher quality codes
8
9. Approach 3: Deployable Data Science outcome
What if the DS’s outcome (the ML pipeline) was readily deployable?
+DS develops with more familiar tools (e.g. python & R)
+DE/SWE does not have to re-write the DS outcome (Avoiding code duplication)
+Ensures higher quality code
ML pipeline includes <Pre-transformation + ML Algorithm + Post-transformation>
Scoring Engine
String
Indexer
Normalizer PCA
Logistic
Regression
Scoring Engine
ML Pipeline
Raw Input
Output 9
11. Decision
Criteria
Financial cost
Supported languages
in pipeline creation
and runtime
Ability to score
multiple data points
simultaneously
(Dataframe vs. Row)
Support for pre and
post transformations
(ML pipeline vs. ML
model)
SparkML support Scoring Latency
Active community Good documentation
11
12. Investigated Technologies
● PMML, jPMML
● PFA
● H2O
● Aloha
● Embedded Spark
● mllib-local (Spark)
● MLeap
For detailed comparison: https://www.slideshare.net/formulatedby/a-journey-of-deploying-a-data-science-engine-to-production
Scoring one data point at a time
No support for pre and post transformations
Slower than MLeap
Only works in Scala
Not fast enough
Satisfies our main requirements
Not mature enough
12
13. MLeap
● Model creation: Python and Scala; Scoring: Scala (Integrates well with Java)
● Supports many transformations and ML models from SparkML, sklearn, TensorFlow, and xgboost
● Active community
● Fast (0.11ms vs. 22ms for Spark)
● Custom transformers
● Even Databricks recommends it: “Databricks recommends MLeap, which is a common serialization
format and execution engine for machine learning pipelines. It supports serializing Apache Spark, scikit-
learn, and TensorFlow pipelines into a bundle, so you can load and deploy your trained models to make
predictions with new data.”
○ https://docs.databricks.com/spark/latest/mllib/index.html#model-export-label
- Inconsistent documentation
https://www.drivenbycode.com/mleap-quickly-release-spark-ml-pipelines/
13
14. Deployable DS outcome with MLeap
Scoring Engine
MLeap runtime
(JVM)
String
Indexer
Normalizer PCA
Logistic
Regression
Scoring Engine
MLeap runtime
(JVM)
Spark MLlib pipeline as MLeap bundle
Export as
MLeap
bundle
Python, R, Scala, or Java
Java or Scala
Data Science Playground
Production Environment
Raw Input Output 14
20. Use Case at
Meredith
• Recommend products to
online users
• Legacy system: reduced
dimension lookup table with
simple predictive models
• Proposed system with SparkML
and MLeap: boosted
conversion rate by around 20%
in different releases
20
21. Summary
● Batch scoring? Do it in DS environment! No deployment needed
● Real time scoring? Relatively small number of input permutations?
○ Look-up table! Simple deployment
○ No! check out MLeap and alike (You do have a sample MLeap code, simple enough to start!)
● Consider deployment solution that exports the whole ML pipeline
● MLeap worked for us! Still needs lots of attention from community
● Not discussed because of cost: Databricks mlflow, Amazon SageMaker , ScienceOPS (yhat),
Anaconda Enterprise, NStack, …
○ Big enterprise solutions are very recent
● Open source possibility: dbml-local (Databricks)
21
22. Thanks to my colleagues at Meredith!
Thank you!
Questions?
22
Notas del editor
Available Technologies/Solutions & Decision Factors