Presented by Mostafa Madjipour., Senior Data Scientist at Time Inc.
Next DSS NYC Event 👉 https://datascience.salon/newyork/
Next DSS LA Event 👉 https://datascience.salon/la/
Reducing the gap between R&D and production is still a challenge for data science/ machine learning engineering groups in many companies. Typically, data scientists develop the data-driven models in a research-oriented programming environment (such as R and python). Next, the data/machine learning engineers rewrite the code (typically in another programming language) in a way that is easy to integrate with production services.
This process has some disadvantages: 1) It is time consuming; 2) slows the impact of data science team on business; 3) code rewriting is prone to errors.
A possible solution to overcome the aforementioned disadvantages would be to implement a deployment strategy that easily embeds/transforms the model created by data scientists. Packages such as jPMML, MLeap, PFA, and PMML among others are developed for this purpose.
In this talk we review some of the mentioned packages, motivated by a project at Time Inc. The project involves development of a near real-time recommender system, which includes a predictor engine, paired with a set of business rules.
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
1. A Journey of Deploying a Data Science
Engine to Production
Mostafa Majidpour Senior Data Scientist at Time Inc
December 14
2017
Los Angeles
2. Motivating Example
Scenario:
● User’s browsing a website. We have access to the user’s cookie and/or past browsing
behavior
Requirements:
● Involves Predictive Modeling
● Real time/ near real time scoring
5. Deployment: To be or not to be?
● According to Rexer Data Science Survey:
○ 37% of surveyed data scientists reported their models are sometimes/rarely deployed.
○ 12% of surveyed data scientists reported their models are always deployed.
http://www.rexeranalytics.com/files/Rexer_Data_Science_Survey_Highlights_Apr-2016.pdf
6. Approach 1: Look-up table
● No need for a complex scoring environment
● Pre-compute the scores for all possible inputs (or a subset of them)
● Store the scores in a look-up table
- Table size grows fast with high cardinality features (~50K zip code x …)
- Unused scoring for some permutations
7. Approach 2: Code re-write for deployment
- Time consuming
- Prone to errors
- Existence of comparable packages
- Slows the impact of data science team on the business!
8. Approach 3: Deployable Data Science outcome
What if the DS’s outcome (the ML pipeline) was readily deployable?
● DS develops with more familiar tools (e.g. python & R)
● DE/SWE does not worry about rewriting DS outcome (Avoid code duplication)
ML pipeline includes <Pre-transformation + ML Algorithm + Post-transformation>
10. Decision Criteria
● Money
● Supported languages in pipeline creation and runtime
● Ability to score multiple data points simultaneously (Dataframe vs. Row)
● Support for pre and post transformations (ML pipeline vs. ML model)
● SparkML support
● Scoring Latency
● Active community
● Good documentation
12. PMML (Predictive Model Markup Language)
● Independent of programming language
- Not suitable for our use case: Scoring only one data point at a time
- We had bunch of business rules that needed to be applied on output of ML model
- Only KMeans, LASSO, and SVM supported for Spark
● Mostly used in IBM, FICO, and KNIME, among others
13. jPMML (Java PMML)
● Package that implements PMML convertors to Java
● Model creation in Java, Python, or R; scoring environment in Java
● Covers many transformations/models from SparkML, sklearn, R, xgboost
- Scoring only one data point at a time
- We had bunch of business rules that needed to be applied on output of ML model
● Active community of users
14. PFA (Portable Format for Analytics)
● More complex than PMML definition of pipelines
● Mixture of transformations and ML models
- Almost no connection with Spark
- Small community
15. mllib-local (Spark)
- Not mature enough
- Almost zero documentation
- No consensus on its purpose
○ scikit-learn for Scala?
○ model serving tool?
16. H2O
● Ability to export ML engines as POJO or MOJO
● Could be integrated well with Java environment
● SparkML and H2O transformers can be mixed together
- Does not cover other elements of pipeline (pre and post transformations)
● Active community
17. Aloha
● Pipeline creation and scoring both in Scala
- No support for other languages
- “Academic Oriented” documentation + lack of enough examples
18. MLeap
● Creation: Python and Scala; Scoring: Scala (Integrates well with Java)
● Supports many transformations and ML models from SparkML, sklearn, TensorFlow, and
xgboost
● Active community
● Fast (0.11ms vs. 22ms for Spark)
● Custom transformers
- Inconsistent documentation
https://www.drivenbycode.com/mleap-quickly-release-spark-ml-pipelines/
19. Invested Technologies
● PMML, jPMML
● PFA
● H2O
● Aloha
● Embedded Spark
● MLeap
Scoring one data point at a time
No support for pre and post transformations
Most transformations do not exist
Only works in Scala
Not fast enough
Satisfies our main requirements
21. Use Case at Time Inc
● Recommend products to online users
● Legacy system: reduced dimension
lookup table with simple predictive
models
● Proposed system with SparkML and
MLeap: boosted conversion rate by
7% in phase I and with adding more
features 12% in phase II
22. Conclusion and Future
● MLeap worked for us!
● Not discussed because of cost: ScienceOPS (yhat), Anaconda Enterprise, Databricks,
NStack, Amazon SageMaker, …
● Open source possibility: dbml-local (Databricks)
23. Thanks to my colleagues at Time Inc!
Thank you!
Questions?
Notas del editor
Available Technologies/Solutions & Decision Factors