1. Bucharest Big Data Meetup
Tech talks, use cases, all big data related topics
June 5th meetup
6:30 PM - 7:00 PM getting together
7:00 - 7:40 Productionizing Machine Learning,
Cosmin Pintoiu & Costina Batica @ Lentiq
7:40 - 8:15 Technology showdown: database vs blockchain,
Felix Crisan @ Blockchain Romania
8:15 - 8:45 Pizza and drinks sponsored by Netopia.
Sponsored by
Organizer
Valentina Crisan
3. Agenda
1. Machine Learning made easy
2. From prototype to production (motivation for workflow and RCB)
3. Reusable code blocks (implementation)
4. Workflow Manager
5. Model Server
6. Demo time
7. Roadmap
8. Conclusions and Q&A
6. The truth is
ü Productionizing ML is hard.
ü 5% of Machine Learning models/data science projects will be used in a Production Environment.
ü Most of the models developed will be “deployed” on power point slides.
ü If we want to be from this 5% we have to understand the problem of ML deployment and how to solve it.
7. Challenges of making the DS process successful
Small to medium companies
ü Lack of resources
ü Lack of skills and knowledge
ü Difficult to set up their own environment
ü Need additional developers for moving models into
production
ü Need devops for maintaining the environment
ü Struggles in defining the most valuable business problem
Enterprises
ü Over-centralized -> smaller and localized teams cannot be
agile
ü Lack of collaboration
ü Difficult to integrate new technologies in the enterprise
stack
ü Lack of visibility
ü Difficult to scale and put models in production
ü Centralized data ownership
8. A complete cycle
ü Gathering data
ü Preparing that data
ü Choosing a model
ü Training
ü Evaluation
ü Hyperparameter tuning
ü Prediction
We now have a model. What’s next?!
ü Serialize
ü Deploy
ü Serving / Scoring
ü Scaling
ü Update
ü Monitor
9.
10. Continuos features
Vector Assember Scaler Scaled Continuos
Feature Vector
Categorical features
String Indexer
One hot
encoder
Categorical Feature
Vector
Vector Assember
Vector Assember
Final Feature
Vector
Linear Regression
Model
Object Storage
Model Server
Model Server
Model Server
…
Load Balancer
Inference API
request
From prototype to production
11. requestEnd User
Application
prediction
Continuos features
Vector Assember Scaler Scaled Continuos
Feature Vector
Categorical features
String Indexer
One hot
encoder
Categorical Feature
Vector
Vector Assember
Vector Assember
Final Feature
Vector
Linear Regression
Model
Object Storage
Model Server
Model Server
Model Server
…
Load Balancer
Inference API
request
From prototype to production
Model Server
RCB & Workflow manager
13. What is a Reusable Code Block?
You can think of a Reusable Code Block as a template for
creating tasks. Users typically package frequent tasks such
as cleaning data, anonymizing data, training models etc.
Reusable Code Blocks are shared with the entire data lake
and are stored in Lentiq's global registry, meaning that any
user with access to it can reuse the code block to perform
similar tasks.
There are two possible sources for an RCB:
ü Custom Docker image: the image needs to be uploaded
to a public repository
ü A Jupyter Notebook (using Kaniko): one can choose
the notebook from a list of available published
notebooks, which are shared with all the users with
access to the notebook’s data lake. This makes
cooperation and sharing knowledge between
departments easy.
14. Reusable Code Blocks are shared with the entire data lake and are stored in Lentiq's global
registry, meaning that any user with access to it can reuse the code block to perform
similar tasks.
15. Dockerless containers. RCB and Kaniko
Kaniko is an open source tool created by Google for
building container images from a Dockerfile and pushing
them to a remote registry, without having root access to a
Docker daemon.
Kaniko enables building container images in environments
that cannot easily or securely run a Docker daemon, like a
Kubernetes cluster or a container.
It executes each command within a Dockerfile completely
in user-space, so the build does not require privileges.
Privileged mode should be avoided at all costs to ensure a
secure environment.
16. How does it work?
Kaniko builds as a root user within a container in an
unprivileged environment. The Kaniko executor then
fetches and extracts the base-image file system to root (the
base image is the image in the FROM line of the
Dockerfile).
It executes each command in order, and takes a snapshot of
the file system after each command. This snapshot is
created in user-space by walking the filesystem and
comparing it to the prior state that was stored in memory.
It appends any modifications to the filesystem as a new
layer to the base image, and makes any relevant changes to
image metadata. After executing every command in the
Dockerfile, the executor pushes the newly built image to
the desired registry.
17. Running Kaniko in a Kubernetes cluster
Kaniko is run as a container in the cluster.
The Job spec needs three arguments:
18. Running Kaniko in a Kubernetes cluster
Kaniko is run as a container in the cluster.
The Job spec needs three arguments:
ü -- dockerfile
ü -- context: a path to a Dockerfile. This can be:
• a Github repository (cloned using an init container)
• a place Kaniko has access to, like a GCS or S3 storage
bucket (compressed tar file) (or any other registry
supported by Docker credential helpers)
• a local directory (specified with an emptyDir volume)
ü -- destination (repository) where Kaniko pushes the image
19. Besides the Kubernetes Job definition, we have some additional requirements:
ü A Kubernetes cluster …
ü Kubernetes secret mounted as a data volume under /kaniko/.docker/config.json:
ü Contains registry credentials required for pushing the final image
ü For example, to push the image to a Docker Hub repository, you need to execute:
ü Otherwise, you create a Kubernetes secret with your registry credentials using the following command:
ü A configmap to store the Jupyter Notebook
Building inside a Kubernetes cluster
20. requestEnd User
Application
prediction
Continuos features
Vector Assember Scaler Scaled Continuos
Feature Vector
Categorical features
String Indexer
One hot
encoder
Categorical Feature
Vector
Vector Assember
Vector Assember
Final Feature
Vector
Linear Regression
Model
Object Storage
Model Server
Model Server
Model Server
…
Load Balancer
Inference API
request
From prototype to production
Model Server
RCB & Workflow manager
21. Why the need of workflows?!
How it used to be:
ü Train a model
ü Run a pipeline using scripts
ü Set manual triggers
ü Wait for jobs, ETL.
ü Monitor
What happened:
ü The need for more iterations
ü More experimentation
ü More work for ops
ü Tedious and repetitive tasks
ü Reduces productivity
We needed a tool to automate, schedule, and share machine learning pipelines
23. ü Develop - create model in any
framework (scikit-learn,
SparkML, Tensorflow, etc.)
ü Serialize – save it in a format that
can be stored and transmitted
over network
ü Serving – used the model for
online / batch inference
(prediction or scoring)
request
Regression, Clustering, Random Forest, K-mens, XGBoost, Neural Network
Object Storage
Model Server
Model Server
Model Server
…
Load Balancer
Inference API
request
End User
Application
prediction
Model Server
SparkMLlib
Scikit-learn
Tensorflow:
One Runtime
Model serialization / persistence
24. Why MLeap?
ü MLeap is a common serialization format and execution
engine for machine learning pipelines.
ü Minimizes the effort to serve models within a production
environment
ü MLeap provides simple interfaces to execute entire ML
pipelines, from feature transformers to classifiers,
regressions, clustering algorithms, and neural networks
25. https://www.slideshare.net/JenAman/mleap-productionize-data-science-workflows-using-spark
What else is there?
ü PMML (xml) - Predictive Model
Markup Language
ü ONNX (protobuf, DL, Tensors) -
Open Neural Network Exchange
ü NNEF(DL, Tensors) - Neural
networks exchange format
ü PFA (json) – Portable Format for
Analitycs
Hard-coded
models
(SQL, Java, Ruby)
PMML
Emerging
Solutions
(yHat,
DataRobot)
Enterprise
Solutions
(Microsoft, IBM,
SAS)
Quick to
implement
Open Sourced
Commited to
Spark/ Hadoop
API Server
Infrastructure
28. Auto Scaling ML serving
Roadmap
ü One of the hardest problems to solve is scaling in
a cost effective way
ü You have hundreds of API calls for predictions at
01:00 PM but we might encounter 100 000 calls
at 07:00 PM (this is the time most users will use
the app)
ü We need:
ü Target metric
ü Min-max capacity
ü Cool down period
30. ü Multiple trials
ü Bayesian methods
ü Hyper-parameter tuning using ParamGrid
ü Ideally to learn from previous runs
ü Run multiple experiments in parallel or sequentially
ü Ideally to learn from previous experiments (to guide future experiments)
Roadmap
HyperParameter
tuning
31. Distributed training
(Tensorflow & SparkML)
Roadmap
ü It can take a loooong time to
train
ü 1.000 cpus, gpu , tpu
ü HorovodRunner: Distributed
Deep Learning
32. Conclusions
ü Jupyter notebooks / code can be encapsulated inside docker containers to be shared
and reused
ü Workflow engine automate and schedule machine learning pipelines
ü Machine learning models are queried via REST APIs
ü Scalable model serving / inference using Model Server
33.
34.
35. ü https://github.com/mlflow/mlflow-example/
ü https://towardsdatascience.com/the-7-steps-of-machine-learning-2877d7e5548e
ü https://www.anaconda.com/productionizing-and-deploying-data-science-projects/
ü https://events.linuxfoundation.org/wp-content/uploads/2017/12/Productionizing-ML-Pipelines-with-the-Portable-Format-for-
Analytics-Nick-Pentreath-IBM.pdf
ü https://hackernoon.com/a-guide-to-scaling-machine-learning-models-in-production-aa8831163846
ü https://blog.algorithmia.com/deploying-machine-learning-at-scale/
ü https://medium.freecodecamp.org/a-beginners-guide-to-training-and-deploying-machine-learning-models-using-python-
48a313502e5a
ü https://towardsdatascience.com/how-to-train-your-neural-networks-in-parallel-with-keras-and-apache-spark-ea8a3f48cae6
ü https://hydrosphere.io/serving-docs/latest/components/runtimes.html
References: