Data Science Salon: A Journey of Deploying a Data Science Engine to Production

A Journey of Deploying a Data Science
Engine to Production
Mostafa Majidpour Senior Data Scientist at Time Inc
December 14
2017
Los Angeles

Motivating Example
Scenario:
● User’s browsing a website. We have access to the user’s cookie and/or past browsing
behavior
Requirements:
● Involves Predictive Modeling
● Real time/ near real time scoring

Machine Learning Pipeline
Creation to Deployment

Deployment Wall!
https://speakerdeck.com/szilard/machine-learning-software-in-practice-quo-vadis-invited-talk-kdd-conference-applied-data-science-track-
august-2017-halifax-canada

Deployment: To be or not to be?
● According to Rexer Data Science Survey:
○ 37% of surveyed data scientists reported their models are sometimes/rarely deployed.
○ 12% of surveyed data scientists reported their models are always deployed.
http://www.rexeranalytics.com/files/Rexer_Data_Science_Survey_Highlights_Apr-2016.pdf

Approach 1: Look-up table
● No need for a complex scoring environment
● Pre-compute the scores for all possible inputs (or a subset of them)
● Store the scores in a look-up table
- Table size grows fast with high cardinality features (~50K zip code x …)
- Unused scoring for some permutations

Approach 2: Code re-write for deployment
- Time consuming
- Prone to errors
- Existence of comparable packages
- Slows the impact of data science team on the business!

Approach 3: Deployable Data Science outcome
What if the DS’s outcome (the ML pipeline) was readily deployable?
● DS develops with more familiar tools (e.g. python & R)
● DE/SWE does not worry about rewriting DS outcome (Avoid code duplication)
ML pipeline includes <Pre-transformation + ML Algorithm + Post-transformation>

Deployable Data Science outcome
Available Solutions

Decision Criteria
● Money
● Supported languages in pipeline creation and runtime
● Ability to score multiple data points simultaneously (Dataframe vs. Row)
● Support for pre and post transformations (ML pipeline vs. ML model)
● SparkML support
● Scoring Latency
● Active community
● Good documentation

Investigated Technologies
● PMML, jPMML
● PFA
● H2O
● Aloha
● Embedded Spark
● mllib-local (Spark)
● MLeap

PMML (Predictive Model Markup Language)
● Independent of programming language
- Not suitable for our use case: Scoring only one data point at a time
- We had bunch of business rules that needed to be applied on output of ML model
- Only KMeans, LASSO, and SVM supported for Spark
● Mostly used in IBM, FICO, and KNIME, among others

jPMML (Java PMML)
● Package that implements PMML convertors to Java
● Model creation in Java, Python, or R; scoring environment in Java
● Covers many transformations/models from SparkML, sklearn, R, xgboost
- Scoring only one data point at a time
- We had bunch of business rules that needed to be applied on output of ML model
● Active community of users

PFA (Portable Format for Analytics)
● More complex than PMML  definition of pipelines
● Mixture of transformations and ML models
- Almost no connection with Spark
- Small community

mllib-local (Spark)
- Not mature enough
- Almost zero documentation
- No consensus on its purpose
○ scikit-learn for Scala?
○ model serving tool?

H2O
● Ability to export ML engines as POJO or MOJO
● Could be integrated well with Java environment
● SparkML and H2O transformers can be mixed together
- Does not cover other elements of pipeline (pre and post transformations)

Aloha
● Pipeline creation and scoring both in Scala
- No support for other languages
- “Academic Oriented” documentation + lack of enough examples

MLeap
● Creation: Python and Scala; Scoring: Scala (Integrates well with Java)
● Supports many transformations and ML models from SparkML, sklearn, TensorFlow, and
xgboost
● Fast (0.11ms vs. 22ms for Spark)
● Custom transformers
- Inconsistent documentation
https://www.drivenbycode.com/mleap-quickly-release-spark-ml-pipelines/

Invested Technologies
● PMML, jPMML
● PFA
● H2O
● Aloha
● Embedded Spark
● MLeap
Scoring one data point at a time
No support for pre and post transformations
Most transformations do not exist
Only works in Scala
Not fast enough
Satisfies our main requirements

Use Case at Time Inc
● Recommend products to online users
● Legacy system: reduced dimension
lookup table with simple predictive
models
● Proposed system with SparkML and
MLeap: boosted conversion rate by
7% in phase I and with adding more
features 12% in phase II

Conclusion and Future
● MLeap worked for us!
● Not discussed because of cost: ScienceOPS (yhat), Anaconda Enterprise, Databricks,
NStack, Amazon SageMaker, …
● Open source possibility: dbml-local (Databricks)

Thanks to my colleagues at Time Inc!
Thank you!
Questions?

Data Science Salon: A Journey of Deploying a Data Science Engine to Production

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Data Science Salon: A Journey of Deploying a Data Science Engine to Production

Similar a Data Science Salon: A Journey of Deploying a Data Science Engine to Production (20)

Más de Formulatedby

Más de Formulatedby (20)

Último

Último (20)

Data Science Salon: A Journey of Deploying a Data Science Engine to Production

Notas del editor