This document discusses machine learning on Kubernetes and Google Cloud Platform. It provides an overview of machine learning, discusses machine learning in production environments, and describes how to use Google Cloud Platform and Kubernetes Engine (GKE) for machine learning workloads. It also introduces Kubeflow, an open source project that makes it easy to deploy and manage machine learning pipelines on Kubernetes. The document outlines best practices for designing machine learning systems on Kubernetes and provides additional resources.
2. Agenda
● Machine Learning Overview
● Machine Learning in Production
● Machine Learning on Google Cloud Platform (GCP)
● Kubernetes Overview
● Google Kubernetes Engine (GKE) Overview
● Kubeflow
● Design & Best Practices
3. About Me
● Google Developer Expert (on Google Cloud Platform Category)
● 11 years of experience in IT Industry
● Worked with various clients like Sabre/Citi Bank/Goldman Sachs/L&T
Infotech etc.
● Currently I work as Independent Consultant (as Technical
Adviser/Architect Role) & Tech Evangelist
4. What this Talk is (about/not about)
● About:
○ ML System Understanding
○ ML & Kubernetes Integration / Design
● Not About:
○ ML Code Syntax/Structure
○ ML Algorithms
5. Machine Learning Overview
● Teaching Computers to recognize patterns in the same way as our brains
do
● Model Building ---> Model Training ---> Model Serving
6. Machine Learning Overview
● Machine Learning Lifecycle:
○ Build Machine Learning Model:
■ Write Machine Learning Code in any supporting/framework e.g. TensorFlow, SciKit
Learn, XGBoost, PyTorch
○ Input Data:
■ You divide Input data into Training & Testing Data
■ Inference/Serving time you pass Inference Input Data
■ Data may have labels or not
○ Train the Model with Input Data:
■ Training generates Model (some kind of Graph e.g. TensorFlow Graph/DAG)
7. Machine Learning Overview
● Machine Learning Lifecycle:
○ Serve/Inference:
■ You can take the model & serve it as REST api endpoint
○ Predictions:
■ You use these REST api endpoints for Online/Batch Prediction (Confidence Value)
9. At what Stage are you with ML today
● Experimenting / Learning
● Building Proof of Concepts (POCs) / Prototyping
● Designing (Deployment/Workflows/Scaling/Management) for Production
10. Machine Learning In Production
● Few extra things to take care of:
○ Collaborative Environment with folks in different roles e.g. Data Scientists / Platform
Engineers / DevOPs / Researchers
○ Production ML Applications are designed to run 24/7/365
○ Input Data (Training/Testing & Inference) is floating continuously - Streaming/Batch
○ You can use different kind of frameworks for ML models building e.g. TensorFlow, SciKit
Learn, XGBoost, PyTorch etc.
○ These models constantly updated, improved upon & deployed
○ Repetitive ML Tasks like Feature Engineering, Hyperparameter Tuning, Data Cleansing &
Validations
11. Machine Learning In Production
● Few extra things to take care of:
○ Config Separation on different environments
○ RBAC (Role Based Access Control)
○ Different Deployment/Hosting Options : Cloud (e.g. GCP) or Private Data Centers/Cloud
(e.g. VMWare Based)
○ Different Hardwares/Accelerators for Compute intensive workloads e.g. GPUs/TPUs
○ Scaling Requirements:
■ Distributed Processing (Training or Serving)
■ Distributed Processing (e.g. One Model is running on multiple GPUs/TPUs or one
GPU is used to run multiple Models)
12. Machine Learning on GCP
● 3 ways:
○ ML as an API ( Cloud Vision API, Cloud Video Intelligence API, Cloud Speech API, Cloud
Natural Language API, Cloud Translation API)
○ AutoML
○ Custom Models
■ With Cloud ML Engine
■ With Kubernetes / GKE / Kubeflow etc.
13. Kubernetes Overview
● Kubernetes is an Open Source system for Container Orchestration
(Deployment/Management/Scaling)
● Features:
○ Scheduling
○ Self Healing / Auto Repairing
○ Scaling (Manual / Auto Scaling / Scaling Out / Scaling In)
○ ...
14. Google Kubernete Engine (GKE) Overview
● Managed Service for Kubernetes on Google Cloud (focused on
Deployment/Management/Scaling)
● Provides Reliable, Efficient & Secured way to run Kubernetes Clusters (on
GCP)
● GKE On-Prem
15. Google Kubernete Engine (GKE) Overview
● Features:
○ Fully Managed
○ Auto Scaling / Auto Upgrade / Auto Repair
○ Integration : IAM / StackDriver / VPC
○ Security, Compliance, Runs on Optimized OS (COS)
○ Accelerators Support : GPUs/TPUs
○ Various Cluster Topologies : Zonal Clusters / Regional Clusters
○ Workload Portability : On-Premises / Cloud
○ ...
16. Kubeflow:
● Focused on Deployment of ML Workflows on Kubernetes (Simple,
Portable & Scalable)
● Goal: is to support deployment of Best-of-breed Open Source Systems for
ML to diverse Infrastructure
● Anywhere you are running Kubernetes, you can run Kubeflow
17. Kubeflow:
● Features:
○ Pipelines: for deploying & managing End to End ML Workflows.
○ Integration:
■ Jupyter Notebooks
■ TensorFlow Model Training Controller
■ Seldon Core : for Model Serving
○ Multi-Framework Support: TensorFlow, PyTorch, Apache MXNet
○ Share/Reuse using AI Hub
18. Design & Best Practices:
● Separate out Compute & Storage
● Scaling & Self Healing Capabilities
● Cloud & GKE Topologies
● Docker Best Practices
● Kubernetes Best Practices
● ML Framework Best Practices
20. Google Cloud Platform - Resources
● Google Cloud Platform 101 (Cloud Next ‘19):
https://www.youtube.com/watch?v=vmOMataJZWw
● Google Cloud Developer Cheat Sheet:
https://raw.githubusercontent.com/gregsramblings/google-cloud-4-
words/master/Poster-medres.png
● 100+ announcements from Google Cloud Next ‘19:
https://cloud.google.com/blog/topics/inside-google-cloud/100-plus-
announcements-from-google-cloud-next19
21. Google Cloud Platform - Resources
● Google Cloud Next ‘19 Sessions:
https://www.youtube.com/playlist?list=PLIivdWyY5sqIXvUGVrFuZibCUdK
VzEoUw
● GCP Certification Resources: https://github.com/ddneves/awesome-
gcp-certifications