SlideShare una empresa de Scribd logo
1 de 63
Descargar para leer sin conexión
Michelangelo
Jeremy Hermann, Machine Learning Platform @ Uber
MISSION
Enable engineers and data scientists across the
company to easily build and deploy machine learning
solutions at scale.
AGENDA
○ ML at Uber
○ Why Build an ML Platform?
○ Key Platform Components
○ System Architecture
○ ML as Software Engineering
○ What’s Next?
ML at Uber
ML at Uber
○ Uber Eats
○ ETAs
○ Autonomous Cars
○ Customer Support
○ Dispatch
○ Personalization
○ Demand Modeling
○ Dynamic Pricing
○ Forecasting
○ Maps
○ Fraud
○ Destination Predictions
○ Anomaly Detection
○ Capacity Planning
○ And many more...
ML at Uber - ETAs
○ ETAs are core to customer experience and used
by myriad internal systems
○ ETA are generated by route-based algorithm
called Garafu
○ Garafu is often incorrect - but it’s incorrect in
predictable ways
○ ML model predicts the Garafu error
○ Use the predicted error to correct the ETA
○ ETAs now dramatically more accurate
ML at Uber - Eats
○ Models used for
○ Ranking of restaurants and
dishes
○ Delivery times
○ Search ranking
○ 100s of ML models called to render
Eats homepage
ML at Uber - Autonomous Cars
ML at Uber - Dispatch
○ Optimize matching of rider and
driver
○ Predict if open rider app will
make trip request
ML at Uber - Map Making
ML at Uber - Map Making
ML at Uber - Map Making
ML at Uber - Destination Prediction
ML at Uber - Spatiotemporal Forecasting
Supply
○ Available Drivers
Demand
○ Open Apps
Other
○ Request Times
○ Arrival Times
○ Airport Demand
ML at Uber - Customer Support
○ 5 customer-agent communication
channels
○ Hundreds of thousands of tickets
surfacing daily on the platform across
400+ cities
○ NLP models classify tickets and
suggest response templates
○ Reduce ticket resolution time by 10%+
with same or higher CSAT
Why build an ML platform?
Early challenges with machine learning
○ Limited scale with Python and R
○ Pipelines not reliable or reproducible
○ Many one-off production systems for serving
Goals of platform
○ Standardize workflows and tools
○ Provide scalable support for end-to-end ML workflow
○ Democratize and accelerate machine learning through ease of use
Motivation for Platform
Same basic ML workflow & system requirements for
○ Traditional ML & deep learning
○ Supervised, unsupervised, & semi-supervised
learning
○ Online or continuous learning
○ Batch, online, & mobile deployments
○ Time-series forecasting
Machine Learning Workflow
MANAGE DATA
TRAIN MODELS
EVALUATE MODELS
DEPLOY MODELS
MAKE PREDICTIONS
MONITOR PREDICTIONS
Key Platform Components
Key Components:
Feature Store & Feature Engineering
Problem
○ Hardest part of ML is finding good features
○ Same features are often used by different models built by different teams
Solution
○ Centralized feature store for collecting and sharing features
○ Platform team curates core set of widely applicable features
○ Modellers contribute more features as part of ongoing model building
○ Meta-data for each feature to track ownership, how computed, where used, etc
○ Modellers select features by name & join key. Offline & online pipelines
auto-configured
Feature Store (aka Palette)
DSL for Feature Engineering
Batch Training Job
(Spark)
Training
Algo
DSL
Batch Prediction Job
(Spark)
Trained
Model
DSL
Pure function expressions for
○ Feature selection
○ Feature transformations (for derived & composite features)
Standard set of accessor functions for
○ Feature store
○ Basis features
○ Column stats (min, max, mean, std-dev, etc)
Standard transformation functions + UDFs
Examples
○ @palette:store:orders:prep_time_avg_1week:rs_uuid
○ nFill(@basis:distance, mean(@basis:distance))
Pipeline for Offline Training with Feature Store
SPARK or SQL FEATURE DSL TRAINING ALGO
RAW DATA
BASIS
FEATURES
TRANSFORMED
FEATURES
MODEL
HIVE
FEATURE STORE
FEATURE STORE
FEATURES
HIVE
DATA LAKE
Pipeline for Online Serving with Feature Store
FEATURE DSL
SERVABLE
MODEL
BASIS
FEATURES
TRANSFORMED
FEATURES
CASSANDRA
FEATURE STORE
FEATURE STORE
FEATURES
CLIENT
SERVICE
Key Components:
Scalable Model Training
Large-scale distributed training (billions of samples)
○ Decision trees
○ Linear and logistic models
○ Unsupervised learning
○ Time series forecasting
○ Hyperparameter search for all model types
Smart pipeline management to balance speed and reliability
○ Fuse operators into single job for speed
○ Break operators into separate jobs to reliability
Distributed Training of Non-DL Models
○ Data-parallelism works best
when model is small enough to
fit on each GPU
○ Ring-allreduce is more efficient
than parameter servers for
averaging weights
○ Faster training and better GPU
utilization
○ Much simpler training scripts
○ More details at
http://eng.uber.com/horovod
Distributed Training of Deep Learning Models with Horovod
Key Components:
Partitioned Models
Problem
○ Often want to train a model per city or per product
○ Hard to train and deploy 100s or 1000s of individual models
Solution
○ Let users define hierarchical partitioning scheme
○ Automatically train model per partition
○ Manage and deploy as single logical model
Partitioned Models
GLOBAL
COUNTRY
CITY
Define partition scheme1
GLOBAL
COUNTRY
CITY
Make train / test split2
GLOBAL
COUNTRY
CITY
Keep same split and partition for each level3
M
M
M M M
M
M M M
GLOBAL
COUNTRY
CITY
Train model for every node4
M
M
M M
M
M M M
GLOBAL
COUNTRY
CITY
Prune bad models5
M
M
M M
M
M M M
GLOBAL
COUNTRY
CITY
At serving time, route to best model for each node6
Key Components:
Model Visualization
Evaluate Models
Problem
○ It takes many iterations to produce a good model
○ Keeping track of how a model was built is important
○ Evaluating and comparing models is hard
With every trained model, we capture standard metadata and reports
○ Full model configuration, including train and test datasets
○ Training job metrics
○ Model accuracy metrics
○ Performance of model after deployment
Model Visualization - Regression Model
Model Visualization - Classification Model
Model Visualization - Feature Report
Model Visualization - Decision Tree
Key Components:
Sharded Deployment and Serving
Prediction Service
○ Thrift service container for one or more models
○ Scale out in Docker on Mesos
○ Single or multi-tenant deployments
○ Connection management and batched / parallelized queries to Cassandra
○ Monitoring & alerting
Deployment
○ Model & DSL packaged as JAR file
○ One click deploy across DCs via standard Uber deployment infrastructure
○ Health checks and rollback
Online Prediction Service
Realtime Predict Service
Deployed ModelDeployed Model
Client
Service
Deployed Model
Model
Cassandra Feature StoreRouting
Infra
DSLModel Manager1
2
3
4
5
Online Prediction Service
Online Prediction Service
Typical p95 latency from client service
○ ~5ms when all features from client service
○ ~10ms when joining pre-computed features from Cassandra
Peak prediction volume across current online deployments
○ 600k+ QPS
Problem
○ Prediction service can serve as many models as will fit into memory
○ Easy to run out of memory with large deployments of complex models
Solution
○ Organize serving cluster into number of physical shards
○ Introduce client facing concept of ‘virtual shard’ that is specified at deploy time
○ Virtual shards are mapped by system to physical shards
○ Models are loaded by service instances in the correct physical shard(s)
○ Gateway service routes to correct physical shard based on request header
Sharded Deployment
Client
Service
Predict Service
Predict Service
Predict Service
Predict Service
Predict Service
A B C
D E F
G H I
Unsharded Deployment
Client
Service
Gateway
Routing Table
Predict Service
Predict Service
Predict Service
Predict Service
Predict Service
(Shard 2)
E F
G H I
Predict Service
Predict Service
Predict Service
Predict Service
Predict Service
(Shard 1)
A B C
D
Sharded Deployment
Key Components:
Deployment Labels
Problem
○ Multiple models per container (entirely different or multiple versions of same)
○ Support experimentation
○ Support automated retrain / redeploy
○ Cumbersome to have client service manage routing
Solution
○ Models deployed to 'label'
○ Labels can be used for experimentation or different use cases
○ Predict service routes request to most recent model w/ specified label
○ Labels have schema so deploys won't break
Deployment Labels
Key Components:
Live Model Performance Monitoring
Monitor Predictions
Problem
○ Models trained and evaluated against
historical data
○ Need to ensure deployed model is
making good predictions going forward
Solution
○ Log predictions & join to actual
outcomes
○ Publish metrics feature and prediction
distributions over time
○ Dashboards and alerts
System Architecture
Management
Monitor
API
Python / Java
What’s Next?
What’s Next?
○ Python ML for ease of use and broader algorithm support
○ Notebook-centered model building workflow
○ Online / continuous learning
○ AutoML to automate more of the modelling work
Thank you!
eng.uber.com/michelangelo
eng.uber.com/horovod
uber.com/careers
Proprietary and confidential © 2017 Uber Technologies, Inc. All rights reserved. No part of this
document may be reproduced or utilized in any form or by any means, electronic or mechanical,
including photocopying, recording, or by any information storage or retrieval systems, without
permission in writing from Uber. This document is intended only for the use of the individual or entity
to whom it is addressed and contains information that is privileged, confidential or otherwise exempt
from disclosure under applicable law. All recipients of this document are notified that the information
contained herein includes proprietary and confidential information of Uber, and recipient may not
make use of, disseminate, or in any way disclose this document or any of the enclosed information to
any person other than employees of addressee to the extent necessary for consultations with
authorized personnel of Uber.

Más contenido relacionado

La actualidad más candente

Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud MLScaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud MLSeldon
 
Large-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookLarge-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookFaisal Siddiqi
 
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Databricks
 
Whats new in_mlflow
Whats new in_mlflowWhats new in_mlflow
Whats new in_mlflowDatabricks
 
Open source ml systems that need to be built
Open source ml systems that need to be builtOpen source ml systems that need to be built
Open source ml systems that need to be builtNikhil Garg
 
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...MLconf
 
TFX: A tensor flow-based production-scale machine learning platform
TFX: A tensor flow-based production-scale machine learning platformTFX: A tensor flow-based production-scale machine learning platform
TFX: A tensor flow-based production-scale machine learning platformShunya Ueta
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionFormulatedby
 
Extracting information from images using deep learning and transfer learning ...
Extracting information from images using deep learning and transfer learning ...Extracting information from images using deep learning and transfer learning ...
Extracting information from images using deep learning and transfer learning ...PAPIs.io
 
running Tensorflow in Production
running Tensorflow in Productionrunning Tensorflow in Production
running Tensorflow in ProductionMatthias Feys
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systemsXavier Amatriain
 
“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOpsRui Quintino
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflowDatabricks
 
Kyryl Truskovskyi: Kubeflow for end2end machine learning lifecycle
Kyryl Truskovskyi: Kubeflow for end2end machine learning lifecycleKyryl Truskovskyi: Kubeflow for end2end machine learning lifecycle
Kyryl Truskovskyi: Kubeflow for end2end machine learning lifecycleLviv Startup Club
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...Databricks
 
Machine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMachine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMateusz Dymczyk
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Productionizing Deep Reinforcement Learning with Spark and MLflow
Productionizing Deep Reinforcement Learning with Spark and MLflowProductionizing Deep Reinforcement Learning with Spark and MLflow
Productionizing Deep Reinforcement Learning with Spark and MLflowDatabricks
 
How to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsHow to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsDatabricks
 

La actualidad más candente (20)

Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud MLScaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
 
Large-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookLarge-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at Facebook
 
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
 
Whats new in_mlflow
Whats new in_mlflowWhats new in_mlflow
Whats new in_mlflow
 
Open source ml systems that need to be built
Open source ml systems that need to be builtOpen source ml systems that need to be built
Open source ml systems that need to be built
 
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
 
TFX: A tensor flow-based production-scale machine learning platform
TFX: A tensor flow-based production-scale machine learning platformTFX: A tensor flow-based production-scale machine learning platform
TFX: A tensor flow-based production-scale machine learning platform
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
 
Extracting information from images using deep learning and transfer learning ...
Extracting information from images using deep learning and transfer learning ...Extracting information from images using deep learning and transfer learning ...
Extracting information from images using deep learning and transfer learning ...
 
running Tensorflow in Production
running Tensorflow in Productionrunning Tensorflow in Production
running Tensorflow in Production
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems
 
“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflow
 
Kyryl Truskovskyi: Kubeflow for end2end machine learning lifecycle
Kyryl Truskovskyi: Kubeflow for end2end machine learning lifecycleKyryl Truskovskyi: Kubeflow for end2end machine learning lifecycle
Kyryl Truskovskyi: Kubeflow for end2end machine learning lifecycle
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 
Machine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMachine Learning for (JVM) Developers
Machine Learning for (JVM) Developers
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Productionizing Deep Reinforcement Learning with Spark and MLflow
Productionizing Deep Reinforcement Learning with Spark and MLflowProductionizing Deep Reinforcement Learning with Spark and MLflow
Productionizing Deep Reinforcement Learning with Spark and MLflow
 
How to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsHow to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML models
 

Similar a Machine Learning Platform at Uber

Deploying ML models in the enterprise
Deploying ML models in the enterpriseDeploying ML models in the enterprise
Deploying ML models in the enterprisedoppenhe
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureFei Chen
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfvitm11
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
 
Machine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowAnant Corporation
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Anant Corporation
 
Scaling machinelearning as a service at uber li Erran li - 2016
Scaling machinelearning as a service at uber li Erran li - 2016Scaling machinelearning as a service at uber li Erran li - 2016
Scaling machinelearning as a service at uber li Erran li - 2016Karthik Murugesan
 
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on DatabricksDataScienceConferenc1
 
Machine Learning Platform @Flipkart - Slash N Conference 2018
Machine Learning Platform @Flipkart - Slash N Conference 2018Machine Learning Platform @Flipkart - Slash N Conference 2018
Machine Learning Platform @Flipkart - Slash N Conference 2018Naresh Sankapelly
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowDatabricks
 
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...DataScienceConferenc1
 
Mohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowMohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowLviv Startup Club
 
Mohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowMohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowEdunomica
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaData Science Milan
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabszekeLabs Technologies
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018Adam Gibson
 
GenAI at UBER: Scaling Infrastructure
GenAI at UBER: Scaling InfrastructureGenAI at UBER: Scaling Infrastructure
GenAI at UBER: Scaling InfrastructureMemory Fabric Forum
 
Ml infra at an early stage
Ml infra at an early stageMl infra at an early stage
Ml infra at an early stageNick Handel
 

Similar a Machine Learning Platform at Uber (20)

Deploying ML models in the enterprise
Deploying ML models in the enterpriseDeploying ML models in the enterprise
Deploying ML models in the enterprise
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Machine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow
Machine Learning Orchestration with Airflow
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
 
Scaling machinelearning as a service at uber li Erran li - 2016
Scaling machinelearning as a service at uber li Erran li - 2016Scaling machinelearning as a service at uber li Erran li - 2016
Scaling machinelearning as a service at uber li Erran li - 2016
 
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
 
Machine Learning Platform @Flipkart - Slash N Conference 2018
Machine Learning Platform @Flipkart - Slash N Conference 2018Machine Learning Platform @Flipkart - Slash N Conference 2018
Machine Learning Platform @Flipkart - Slash N Conference 2018
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflow
 
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...
 
Mohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowMohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with Kubeflow
 
Mohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowMohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with Kubeflow
 
Microservices patterns
Microservices patternsMicroservices patterns
Microservices patterns
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
C2_W1---.pdf
C2_W1---.pdfC2_W1---.pdf
C2_W1---.pdf
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018
 
GenAI at UBER: Scaling Infrastructure
GenAI at UBER: Scaling InfrastructureGenAI at UBER: Scaling Infrastructure
GenAI at UBER: Scaling Infrastructure
 
Ml infra at an early stage
Ml infra at an early stageMl infra at an early stage
Ml infra at an early stage
 

Más de Costanoa Ventures

Rachel Thomas, Co-Founder, LeanIn presents Women in the Workplace data at Sea...
Rachel Thomas, Co-Founder, LeanIn presents Women in the Workplace data at Sea...Rachel Thomas, Co-Founder, LeanIn presents Women in the Workplace data at Sea...
Rachel Thomas, Co-Founder, LeanIn presents Women in the Workplace data at Sea...Costanoa Ventures
 
Costanoa Expert Series: How to Build Playbooks Salespeople Love
Costanoa Expert Series: How to Build Playbooks Salespeople LoveCostanoa Expert Series: How to Build Playbooks Salespeople Love
Costanoa Expert Series: How to Build Playbooks Salespeople LoveCostanoa Ventures
 
Costanoa Expert Series: What Business Leaders Should Know About Design- Order 4
Costanoa Expert Series: What Business Leaders Should Know About Design- Order 4Costanoa Expert Series: What Business Leaders Should Know About Design- Order 4
Costanoa Expert Series: What Business Leaders Should Know About Design- Order 4Costanoa Ventures
 
Costanoa Expert Series: What Business Leaders Should Know About Design- Order 3
Costanoa Expert Series: What Business Leaders Should Know About Design- Order 3Costanoa Expert Series: What Business Leaders Should Know About Design- Order 3
Costanoa Expert Series: What Business Leaders Should Know About Design- Order 3Costanoa Ventures
 
Costanoa Expert Series: What Business Leaders Should Know About Design- Order 2
Costanoa Expert Series: What Business Leaders Should Know About Design- Order 2Costanoa Expert Series: What Business Leaders Should Know About Design- Order 2
Costanoa Expert Series: What Business Leaders Should Know About Design- Order 2Costanoa Ventures
 
Costanoa Expert Series: What Business Leaders Should Know About Design- Order 1
Costanoa Expert Series: What Business Leaders Should Know About Design- Order 1Costanoa Expert Series: What Business Leaders Should Know About Design- Order 1
Costanoa Expert Series: What Business Leaders Should Know About Design- Order 1Costanoa Ventures
 
Costanoa Expert Series: Turning Win/Loss Analysis into Buyer Insights
Costanoa Expert Series: Turning Win/Loss Analysis into Buyer InsightsCostanoa Expert Series: Turning Win/Loss Analysis into Buyer Insights
Costanoa Expert Series: Turning Win/Loss Analysis into Buyer InsightsCostanoa Ventures
 

Más de Costanoa Ventures (7)

Rachel Thomas, Co-Founder, LeanIn presents Women in the Workplace data at Sea...
Rachel Thomas, Co-Founder, LeanIn presents Women in the Workplace data at Sea...Rachel Thomas, Co-Founder, LeanIn presents Women in the Workplace data at Sea...
Rachel Thomas, Co-Founder, LeanIn presents Women in the Workplace data at Sea...
 
Costanoa Expert Series: How to Build Playbooks Salespeople Love
Costanoa Expert Series: How to Build Playbooks Salespeople LoveCostanoa Expert Series: How to Build Playbooks Salespeople Love
Costanoa Expert Series: How to Build Playbooks Salespeople Love
 
Costanoa Expert Series: What Business Leaders Should Know About Design- Order 4
Costanoa Expert Series: What Business Leaders Should Know About Design- Order 4Costanoa Expert Series: What Business Leaders Should Know About Design- Order 4
Costanoa Expert Series: What Business Leaders Should Know About Design- Order 4
 
Costanoa Expert Series: What Business Leaders Should Know About Design- Order 3
Costanoa Expert Series: What Business Leaders Should Know About Design- Order 3Costanoa Expert Series: What Business Leaders Should Know About Design- Order 3
Costanoa Expert Series: What Business Leaders Should Know About Design- Order 3
 
Costanoa Expert Series: What Business Leaders Should Know About Design- Order 2
Costanoa Expert Series: What Business Leaders Should Know About Design- Order 2Costanoa Expert Series: What Business Leaders Should Know About Design- Order 2
Costanoa Expert Series: What Business Leaders Should Know About Design- Order 2
 
Costanoa Expert Series: What Business Leaders Should Know About Design- Order 1
Costanoa Expert Series: What Business Leaders Should Know About Design- Order 1Costanoa Expert Series: What Business Leaders Should Know About Design- Order 1
Costanoa Expert Series: What Business Leaders Should Know About Design- Order 1
 
Costanoa Expert Series: Turning Win/Loss Analysis into Buyer Insights
Costanoa Expert Series: Turning Win/Loss Analysis into Buyer InsightsCostanoa Expert Series: Turning Win/Loss Analysis into Buyer Insights
Costanoa Expert Series: Turning Win/Loss Analysis into Buyer Insights
 

Último

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 

Último (20)

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 

Machine Learning Platform at Uber

  • 1. Michelangelo Jeremy Hermann, Machine Learning Platform @ Uber
  • 2. MISSION Enable engineers and data scientists across the company to easily build and deploy machine learning solutions at scale.
  • 3. AGENDA ○ ML at Uber ○ Why Build an ML Platform? ○ Key Platform Components ○ System Architecture ○ ML as Software Engineering ○ What’s Next?
  • 5. ML at Uber ○ Uber Eats ○ ETAs ○ Autonomous Cars ○ Customer Support ○ Dispatch ○ Personalization ○ Demand Modeling ○ Dynamic Pricing ○ Forecasting ○ Maps ○ Fraud ○ Destination Predictions ○ Anomaly Detection ○ Capacity Planning ○ And many more...
  • 6. ML at Uber - ETAs ○ ETAs are core to customer experience and used by myriad internal systems ○ ETA are generated by route-based algorithm called Garafu ○ Garafu is often incorrect - but it’s incorrect in predictable ways ○ ML model predicts the Garafu error ○ Use the predicted error to correct the ETA ○ ETAs now dramatically more accurate
  • 7. ML at Uber - Eats ○ Models used for ○ Ranking of restaurants and dishes ○ Delivery times ○ Search ranking ○ 100s of ML models called to render Eats homepage
  • 8. ML at Uber - Autonomous Cars
  • 9. ML at Uber - Dispatch ○ Optimize matching of rider and driver ○ Predict if open rider app will make trip request
  • 10. ML at Uber - Map Making
  • 11. ML at Uber - Map Making
  • 12. ML at Uber - Map Making
  • 13. ML at Uber - Destination Prediction
  • 14. ML at Uber - Spatiotemporal Forecasting Supply ○ Available Drivers Demand ○ Open Apps Other ○ Request Times ○ Arrival Times ○ Airport Demand
  • 15. ML at Uber - Customer Support ○ 5 customer-agent communication channels ○ Hundreds of thousands of tickets surfacing daily on the platform across 400+ cities ○ NLP models classify tickets and suggest response templates ○ Reduce ticket resolution time by 10%+ with same or higher CSAT
  • 16. Why build an ML platform?
  • 17. Early challenges with machine learning ○ Limited scale with Python and R ○ Pipelines not reliable or reproducible ○ Many one-off production systems for serving Goals of platform ○ Standardize workflows and tools ○ Provide scalable support for end-to-end ML workflow ○ Democratize and accelerate machine learning through ease of use Motivation for Platform
  • 18. Same basic ML workflow & system requirements for ○ Traditional ML & deep learning ○ Supervised, unsupervised, & semi-supervised learning ○ Online or continuous learning ○ Batch, online, & mobile deployments ○ Time-series forecasting Machine Learning Workflow MANAGE DATA TRAIN MODELS EVALUATE MODELS DEPLOY MODELS MAKE PREDICTIONS MONITOR PREDICTIONS
  • 20. Key Components: Feature Store & Feature Engineering
  • 21. Problem ○ Hardest part of ML is finding good features ○ Same features are often used by different models built by different teams Solution ○ Centralized feature store for collecting and sharing features ○ Platform team curates core set of widely applicable features ○ Modellers contribute more features as part of ongoing model building ○ Meta-data for each feature to track ownership, how computed, where used, etc ○ Modellers select features by name & join key. Offline & online pipelines auto-configured Feature Store (aka Palette)
  • 22. DSL for Feature Engineering Batch Training Job (Spark) Training Algo DSL Batch Prediction Job (Spark) Trained Model DSL Pure function expressions for ○ Feature selection ○ Feature transformations (for derived & composite features) Standard set of accessor functions for ○ Feature store ○ Basis features ○ Column stats (min, max, mean, std-dev, etc) Standard transformation functions + UDFs Examples ○ @palette:store:orders:prep_time_avg_1week:rs_uuid ○ nFill(@basis:distance, mean(@basis:distance))
  • 23. Pipeline for Offline Training with Feature Store SPARK or SQL FEATURE DSL TRAINING ALGO RAW DATA BASIS FEATURES TRANSFORMED FEATURES MODEL HIVE FEATURE STORE FEATURE STORE FEATURES HIVE DATA LAKE
  • 24. Pipeline for Online Serving with Feature Store FEATURE DSL SERVABLE MODEL BASIS FEATURES TRANSFORMED FEATURES CASSANDRA FEATURE STORE FEATURE STORE FEATURES CLIENT SERVICE
  • 26. Large-scale distributed training (billions of samples) ○ Decision trees ○ Linear and logistic models ○ Unsupervised learning ○ Time series forecasting ○ Hyperparameter search for all model types Smart pipeline management to balance speed and reliability ○ Fuse operators into single job for speed ○ Break operators into separate jobs to reliability Distributed Training of Non-DL Models
  • 27. ○ Data-parallelism works best when model is small enough to fit on each GPU ○ Ring-allreduce is more efficient than parameter servers for averaging weights ○ Faster training and better GPU utilization ○ Much simpler training scripts ○ More details at http://eng.uber.com/horovod Distributed Training of Deep Learning Models with Horovod
  • 29. Problem ○ Often want to train a model per city or per product ○ Hard to train and deploy 100s or 1000s of individual models Solution ○ Let users define hierarchical partitioning scheme ○ Automatically train model per partition ○ Manage and deploy as single logical model Partitioned Models
  • 32. GLOBAL COUNTRY CITY Keep same split and partition for each level3
  • 33. M M M M M M M M M GLOBAL COUNTRY CITY Train model for every node4
  • 34. M M M M M M M M GLOBAL COUNTRY CITY Prune bad models5
  • 35. M M M M M M M M GLOBAL COUNTRY CITY At serving time, route to best model for each node6
  • 37. Evaluate Models Problem ○ It takes many iterations to produce a good model ○ Keeping track of how a model was built is important ○ Evaluating and comparing models is hard With every trained model, we capture standard metadata and reports ○ Full model configuration, including train and test datasets ○ Training job metrics ○ Model accuracy metrics ○ Performance of model after deployment
  • 38. Model Visualization - Regression Model
  • 39. Model Visualization - Classification Model
  • 40. Model Visualization - Feature Report
  • 41. Model Visualization - Decision Tree
  • 43. Prediction Service ○ Thrift service container for one or more models ○ Scale out in Docker on Mesos ○ Single or multi-tenant deployments ○ Connection management and batched / parallelized queries to Cassandra ○ Monitoring & alerting Deployment ○ Model & DSL packaged as JAR file ○ One click deploy across DCs via standard Uber deployment infrastructure ○ Health checks and rollback Online Prediction Service
  • 44. Realtime Predict Service Deployed ModelDeployed Model Client Service Deployed Model Model Cassandra Feature StoreRouting Infra DSLModel Manager1 2 3 4 5 Online Prediction Service
  • 45. Online Prediction Service Typical p95 latency from client service ○ ~5ms when all features from client service ○ ~10ms when joining pre-computed features from Cassandra Peak prediction volume across current online deployments ○ 600k+ QPS
  • 46. Problem ○ Prediction service can serve as many models as will fit into memory ○ Easy to run out of memory with large deployments of complex models Solution ○ Organize serving cluster into number of physical shards ○ Introduce client facing concept of ‘virtual shard’ that is specified at deploy time ○ Virtual shards are mapped by system to physical shards ○ Models are loaded by service instances in the correct physical shard(s) ○ Gateway service routes to correct physical shard based on request header Sharded Deployment
  • 47. Client Service Predict Service Predict Service Predict Service Predict Service Predict Service A B C D E F G H I Unsharded Deployment
  • 48. Client Service Gateway Routing Table Predict Service Predict Service Predict Service Predict Service Predict Service (Shard 2) E F G H I Predict Service Predict Service Predict Service Predict Service Predict Service (Shard 1) A B C D Sharded Deployment
  • 50. Problem ○ Multiple models per container (entirely different or multiple versions of same) ○ Support experimentation ○ Support automated retrain / redeploy ○ Cumbersome to have client service manage routing Solution ○ Models deployed to 'label' ○ Labels can be used for experimentation or different use cases ○ Predict service routes request to most recent model w/ specified label ○ Labels have schema so deploys won't break Deployment Labels
  • 51. Key Components: Live Model Performance Monitoring
  • 52. Monitor Predictions Problem ○ Models trained and evaluated against historical data ○ Need to ensure deployed model is making good predictions going forward Solution ○ Log predictions & join to actual outcomes ○ Publish metrics feature and prediction distributions over time ○ Dashboards and alerts
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 61. What’s Next? ○ Python ML for ease of use and broader algorithm support ○ Notebook-centered model building workflow ○ Online / continuous learning ○ AutoML to automate more of the modelling work
  • 63. Proprietary and confidential © 2017 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.