SlideShare una empresa de Scribd logo
1 de 63
Descargar para leer sin conexión
Michelangelo
Jeremy Hermann, Machine Learning Platform @ Uber
MISSION
Enable engineers and data scientists across the
company to easily build and deploy machine learning
solutions at scale.
AGENDA
○ ML at Uber
○ Why Build an ML Platform?
○ Key Platform Components
○ System Architecture
○ ML as Software Engineering
○ What’s Next?
ML at Uber
ML at Uber
○ Uber Eats
○ ETAs
○ Autonomous Cars
○ Customer Support
○ Dispatch
○ Personalization
○ Demand Modeling
○ Dynamic Pricing
○ Forecasting
○ Maps
○ Fraud
○ Destination Predictions
○ Anomaly Detection
○ Capacity Planning
○ And many more...
ML at Uber - ETAs
○ ETAs are core to customer experience and used
by myriad internal systems
○ ETA are generated by route-based algorithm
called Garafu
○ Garafu is often incorrect - but it’s incorrect in
predictable ways
○ ML model predicts the Garafu error
○ Use the predicted error to correct the ETA
○ ETAs now dramatically more accurate
ML at Uber - Eats
○ Models used for
○ Ranking of restaurants and
dishes
○ Delivery times
○ Search ranking
○ 100s of ML models called to render
Eats homepage
ML at Uber - Autonomous Cars
ML at Uber - Dispatch
○ Optimize matching of rider and
driver
○ Predict if open rider app will
make trip request
ML at Uber - Map Making
ML at Uber - Map Making
ML at Uber - Map Making
ML at Uber - Destination Prediction
ML at Uber - Spatiotemporal Forecasting
Supply
○ Available Drivers
Demand
○ Open Apps
Other
○ Request Times
○ Arrival Times
○ Airport Demand
ML at Uber - Customer Support
○ 5 customer-agent communication
channels
○ Hundreds of thousands of tickets
surfacing daily on the platform across
400+ cities
○ NLP models classify tickets and
suggest response templates
○ Reduce ticket resolution time by 10%+
with same or higher CSAT
Why build an ML platform?
Early challenges with machine learning
○ Limited scale with Python and R
○ Pipelines not reliable or reproducible
○ Many one-off production systems for serving
Goals of platform
○ Standardize workflows and tools
○ Provide scalable support for end-to-end ML workflow
○ Democratize and accelerate machine learning through ease of use
Motivation for Platform
Same basic ML workflow & system requirements for
○ Traditional ML & deep learning
○ Supervised, unsupervised, & semi-supervised
learning
○ Online or continuous learning
○ Batch, online, & mobile deployments
○ Time-series forecasting
Machine Learning Workflow
MANAGE DATA
TRAIN MODELS
EVALUATE MODELS
DEPLOY MODELS
MAKE PREDICTIONS
MONITOR PREDICTIONS
Key Platform Components
Key Components:
Feature Store & Feature Engineering
Problem
○ Hardest part of ML is finding good features
○ Same features are often used by different models built by different teams
Solution
○ Centralized feature store for collecting and sharing features
○ Platform team curates core set of widely applicable features
○ Modellers contribute more features as part of ongoing model building
○ Meta-data for each feature to track ownership, how computed, where used, etc
○ Modellers select features by name & join key. Offline & online pipelines
auto-configured
Feature Store (aka Palette)
DSL for Feature Engineering
Batch Training Job
(Spark)
Training
Algo
DSL
Batch Prediction Job
(Spark)
Trained
Model
DSL
Pure function expressions for
○ Feature selection
○ Feature transformations (for derived & composite features)
Standard set of accessor functions for
○ Feature store
○ Basis features
○ Column stats (min, max, mean, std-dev, etc)
Standard transformation functions + UDFs
Examples
○ @palette:store:orders:prep_time_avg_1week:rs_uuid
○ nFill(@basis:distance, mean(@basis:distance))
Pipeline for Offline Training with Feature Store
SPARK or SQL FEATURE DSL TRAINING ALGO
RAW DATA
BASIS
FEATURES
TRANSFORMED
FEATURES
MODEL
HIVE
FEATURE STORE
FEATURE STORE
FEATURES
HIVE
DATA LAKE
Pipeline for Online Serving with Feature Store
FEATURE DSL
SERVABLE
MODEL
BASIS
FEATURES
TRANSFORMED
FEATURES
CASSANDRA
FEATURE STORE
FEATURE STORE
FEATURES
CLIENT
SERVICE
Key Components:
Scalable Model Training
Large-scale distributed training (billions of samples)
○ Decision trees
○ Linear and logistic models
○ Unsupervised learning
○ Time series forecasting
○ Hyperparameter search for all model types
Smart pipeline management to balance speed and reliability
○ Fuse operators into single job for speed
○ Break operators into separate jobs to reliability
Distributed Training of Non-DL Models
○ Data-parallelism works best
when model is small enough to
fit on each GPU
○ Ring-allreduce is more efficient
than parameter servers for
averaging weights
○ Faster training and better GPU
utilization
○ Much simpler training scripts
○ More details at
http://eng.uber.com/horovod
Distributed Training of Deep Learning Models with Horovod
Key Components:
Partitioned Models
Problem
○ Often want to train a model per city or per product
○ Hard to train and deploy 100s or 1000s of individual models
Solution
○ Let users define hierarchical partitioning scheme
○ Automatically train model per partition
○ Manage and deploy as single logical model
Partitioned Models
GLOBAL
COUNTRY
CITY
Define partition scheme1
GLOBAL
COUNTRY
CITY
Make train / test split2
GLOBAL
COUNTRY
CITY
Keep same split and partition for each level3
M
M
M M M
M
M M M
GLOBAL
COUNTRY
CITY
Train model for every node4
M
M
M M
M
M M M
GLOBAL
COUNTRY
CITY
Prune bad models5
M
M
M M
M
M M M
GLOBAL
COUNTRY
CITY
At serving time, route to best model for each node6
Key Components:
Model Visualization
Evaluate Models
Problem
○ It takes many iterations to produce a good model
○ Keeping track of how a model was built is important
○ Evaluating and comparing models is hard
With every trained model, we capture standard metadata and reports
○ Full model configuration, including train and test datasets
○ Training job metrics
○ Model accuracy metrics
○ Performance of model after deployment
Model Visualization - Regression Model
Model Visualization - Classification Model
Model Visualization - Feature Report
Model Visualization - Decision Tree
Key Components:
Sharded Deployment and Serving
Prediction Service
○ Thrift service container for one or more models
○ Scale out in Docker on Mesos
○ Single or multi-tenant deployments
○ Connection management and batched / parallelized queries to Cassandra
○ Monitoring & alerting
Deployment
○ Model & DSL packaged as JAR file
○ One click deploy across DCs via standard Uber deployment infrastructure
○ Health checks and rollback
Online Prediction Service
Realtime Predict Service
Deployed ModelDeployed Model
Client
Service
Deployed Model
Model
Cassandra Feature StoreRouting
Infra
DSLModel Manager1
2
3
4
5
Online Prediction Service
Online Prediction Service
Typical p95 latency from client service
○ ~5ms when all features from client service
○ ~10ms when joining pre-computed features from Cassandra
Peak prediction volume across current online deployments
○ 600k+ QPS
Problem
○ Prediction service can serve as many models as will fit into memory
○ Easy to run out of memory with large deployments of complex models
Solution
○ Organize serving cluster into number of physical shards
○ Introduce client facing concept of ‘virtual shard’ that is specified at deploy time
○ Virtual shards are mapped by system to physical shards
○ Models are loaded by service instances in the correct physical shard(s)
○ Gateway service routes to correct physical shard based on request header
Sharded Deployment
Client
Service
Predict Service
Predict Service
Predict Service
Predict Service
Predict Service
A B C
D E F
G H I
Unsharded Deployment
Client
Service
Gateway
Routing Table
Predict Service
Predict Service
Predict Service
Predict Service
Predict Service
(Shard 2)
E F
G H I
Predict Service
Predict Service
Predict Service
Predict Service
Predict Service
(Shard 1)
A B C
D
Sharded Deployment
Key Components:
Deployment Labels
Problem
○ Multiple models per container (entirely different or multiple versions of same)
○ Support experimentation
○ Support automated retrain / redeploy
○ Cumbersome to have client service manage routing
Solution
○ Models deployed to 'label'
○ Labels can be used for experimentation or different use cases
○ Predict service routes request to most recent model w/ specified label
○ Labels have schema so deploys won't break
Deployment Labels
Key Components:
Live Model Performance Monitoring
Monitor Predictions
Problem
○ Models trained and evaluated against
historical data
○ Need to ensure deployed model is
making good predictions going forward
Solution
○ Log predictions & join to actual
outcomes
○ Publish metrics feature and prediction
distributions over time
○ Dashboards and alerts
System Architecture
Management
Monitor
API
Python / Java
What’s Next?
What’s Next?
○ Python ML for ease of use and broader algorithm support
○ Notebook-centered model building workflow
○ Online / continuous learning
○ AutoML to automate more of the modelling work
Thank you!
eng.uber.com/michelangelo
eng.uber.com/horovod
uber.com/careers
Proprietary and confidential © 2017 Uber Technologies, Inc. All rights reserved. No part of this
document may be reproduced or utilized in any form or by any means, electronic or mechanical,
including photocopying, recording, or by any information storage or retrieval systems, without
permission in writing from Uber. This document is intended only for the use of the individual or entity
to whom it is addressed and contains information that is privileged, confidential or otherwise exempt
from disclosure under applicable law. All recipients of this document are notified that the information
contained herein includes proprietary and confidential information of Uber, and recipient may not
make use of, disseminate, or in any way disclose this document or any of the enclosed information to
any person other than employees of addressee to the extent necessary for consultations with
authorized personnel of Uber.

Más contenido relacionado

La actualidad más candente

データ連携の新しいカタチ - 変更データキャプチャ/プラットフォームイベントを MuleSoft Anypoint Platform と組み合わせて試してみよう
データ連携の新しいカタチ - 変更データキャプチャ/プラットフォームイベントを MuleSoft Anypoint Platform と組み合わせて試してみようデータ連携の新しいカタチ - 変更データキャプチャ/プラットフォームイベントを MuleSoft Anypoint Platform と組み合わせて試してみよう
データ連携の新しいカタチ - 変更データキャプチャ/プラットフォームイベントを MuleSoft Anypoint Platform と組み合わせて試してみよう
Salesforce Developers Japan
 

La actualidad más candente (20)

Modern Data Flow
Modern Data FlowModern Data Flow
Modern Data Flow
 
Wix's ML Platform
Wix's ML PlatformWix's ML Platform
Wix's ML Platform
 
Power BI Report Server Enterprise Architecture, Tools to Publish reports and ...
Power BI Report Server Enterprise Architecture, Tools to Publish reports and ...Power BI Report Server Enterprise Architecture, Tools to Publish reports and ...
Power BI Report Server Enterprise Architecture, Tools to Publish reports and ...
 
Power BI Made Simple
Power BI Made SimplePower BI Made Simple
Power BI Made Simple
 
Microsoft power bi
Microsoft power biMicrosoft power bi
Microsoft power bi
 
Power BI Architecture
Power BI ArchitecturePower BI Architecture
Power BI Architecture
 
Spark & Zeppelin을 활용한 머신러닝 실전 적용기
Spark & Zeppelin을 활용한 머신러닝 실전 적용기Spark & Zeppelin을 활용한 머신러닝 실전 적용기
Spark & Zeppelin을 활용한 머신러닝 실전 적용기
 
データ連携の新しいカタチ - 変更データキャプチャ/プラットフォームイベントを MuleSoft Anypoint Platform と組み合わせて試してみよう
データ連携の新しいカタチ - 変更データキャプチャ/プラットフォームイベントを MuleSoft Anypoint Platform と組み合わせて試してみようデータ連携の新しいカタチ - 変更データキャプチャ/プラットフォームイベントを MuleSoft Anypoint Platform と組み合わせて試してみよう
データ連携の新しいカタチ - 変更データキャプチャ/プラットフォームイベントを MuleSoft Anypoint Platform と組み合わせて試してみよう
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conference
 
Microsoft Power BI Overview
Microsoft Power BI OverviewMicrosoft Power BI Overview
Microsoft Power BI Overview
 
Drone Data Flowing Through Apache NiFi
Drone Data Flowing Through Apache NiFiDrone Data Flowing Through Apache NiFi
Drone Data Flowing Through Apache NiFi
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
PART 1: Intro To JasperReports IO And How To Build Your First Report
PART 1: Intro To JasperReports IO And How To Build Your First ReportPART 1: Intro To JasperReports IO And How To Build Your First Report
PART 1: Intro To JasperReports IO And How To Build Your First Report
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Nifi workshop
Nifi workshopNifi workshop
Nifi workshop
 
Apache spark 소개 및 실습
Apache spark 소개 및 실습Apache spark 소개 및 실습
Apache spark 소개 및 실습
 
Spring Boot Actuator 2.0 & Micrometer #jjug_ccc #ccc_a1
Spring Boot Actuator 2.0 & Micrometer #jjug_ccc #ccc_a1Spring Boot Actuator 2.0 & Micrometer #jjug_ccc #ccc_a1
Spring Boot Actuator 2.0 & Micrometer #jjug_ccc #ccc_a1
 

Similar a Michelangelo - Machine Learning Platform - 2018

Deploying ML models in the enterprise
Deploying ML models in the enterpriseDeploying ML models in the enterprise
Deploying ML models in the enterprise
doppenhe
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
vitm11
 

Similar a Michelangelo - Machine Learning Platform - 2018 (20)

Deploying ML models in the enterprise
Deploying ML models in the enterpriseDeploying ML models in the enterprise
Deploying ML models in the enterprise
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Machine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow
Machine Learning Orchestration with Airflow
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
 
Scaling machinelearning as a service at uber li Erran li - 2016
Scaling machinelearning as a service at uber li Erran li - 2016Scaling machinelearning as a service at uber li Erran li - 2016
Scaling machinelearning as a service at uber li Erran li - 2016
 
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
 
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
 
Machine Learning Platform @Flipkart - Slash N Conference 2018
Machine Learning Platform @Flipkart - Slash N Conference 2018Machine Learning Platform @Flipkart - Slash N Conference 2018
Machine Learning Platform @Flipkart - Slash N Conference 2018
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflow
 
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...
 
Mohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowMohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with Kubeflow
 
Mohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowMohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with Kubeflow
 
Microservices patterns
Microservices patternsMicroservices patterns
Microservices patterns
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
C2_W1---.pdf
C2_W1---.pdfC2_W1---.pdf
C2_W1---.pdf
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018
 
GenAI at UBER: Scaling Infrastructure
GenAI at UBER: Scaling InfrastructureGenAI at UBER: Scaling Infrastructure
GenAI at UBER: Scaling Infrastructure
 

Más de Karthik Murugesan

BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
Karthik Murugesan
 
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Karthik Murugesan
 

Más de Karthik Murugesan (20)

Rakuten - Recommendation Platform
Rakuten - Recommendation PlatformRakuten - Recommendation Platform
Rakuten - Recommendation Platform
 
Yahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slidesYahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slides
 
Free servers to build Big Data Systems on: Bing's Approach
Free servers to build Big Data Systems on: Bing's  Approach Free servers to build Big Data Systems on: Bing's  Approach
Free servers to build Big Data Systems on: Bing's Approach
 
Microsoft cosmos
Microsoft cosmosMicrosoft cosmos
Microsoft cosmos
 
Microsoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER IntroductionMicrosoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER Introduction
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019
 
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
Unifying Twitter around a single ML platform  - Twitter AI Platform 2019Unifying Twitter around a single ML platform  - Twitter AI Platform 2019
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...
 
The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019
 
Developing a ML model using TF Estimator
Developing a ML model using TF EstimatorDeveloping a ML model using TF Estimator
Developing a ML model using TF Estimator
 
Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018
 
Netflix factstore for recommendations - 2018
Netflix factstore  for recommendations - 2018Netflix factstore  for recommendations - 2018
Netflix factstore for recommendations - 2018
 
Trends in Music Recommendations 2018
Trends in Music Recommendations 2018Trends in Music Recommendations 2018
Trends in Music Recommendations 2018
 
Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017
 
State Of AI 2018
State Of AI 2018State Of AI 2018
State Of AI 2018
 
Spotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music DiscoverySpotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music Discovery
 
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
 
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Michelangelo - Machine Learning Platform - 2018

  • 1. Michelangelo Jeremy Hermann, Machine Learning Platform @ Uber
  • 2. MISSION Enable engineers and data scientists across the company to easily build and deploy machine learning solutions at scale.
  • 3. AGENDA ○ ML at Uber ○ Why Build an ML Platform? ○ Key Platform Components ○ System Architecture ○ ML as Software Engineering ○ What’s Next?
  • 5. ML at Uber ○ Uber Eats ○ ETAs ○ Autonomous Cars ○ Customer Support ○ Dispatch ○ Personalization ○ Demand Modeling ○ Dynamic Pricing ○ Forecasting ○ Maps ○ Fraud ○ Destination Predictions ○ Anomaly Detection ○ Capacity Planning ○ And many more...
  • 6. ML at Uber - ETAs ○ ETAs are core to customer experience and used by myriad internal systems ○ ETA are generated by route-based algorithm called Garafu ○ Garafu is often incorrect - but it’s incorrect in predictable ways ○ ML model predicts the Garafu error ○ Use the predicted error to correct the ETA ○ ETAs now dramatically more accurate
  • 7. ML at Uber - Eats ○ Models used for ○ Ranking of restaurants and dishes ○ Delivery times ○ Search ranking ○ 100s of ML models called to render Eats homepage
  • 8. ML at Uber - Autonomous Cars
  • 9. ML at Uber - Dispatch ○ Optimize matching of rider and driver ○ Predict if open rider app will make trip request
  • 10. ML at Uber - Map Making
  • 11. ML at Uber - Map Making
  • 12. ML at Uber - Map Making
  • 13. ML at Uber - Destination Prediction
  • 14. ML at Uber - Spatiotemporal Forecasting Supply ○ Available Drivers Demand ○ Open Apps Other ○ Request Times ○ Arrival Times ○ Airport Demand
  • 15. ML at Uber - Customer Support ○ 5 customer-agent communication channels ○ Hundreds of thousands of tickets surfacing daily on the platform across 400+ cities ○ NLP models classify tickets and suggest response templates ○ Reduce ticket resolution time by 10%+ with same or higher CSAT
  • 16. Why build an ML platform?
  • 17. Early challenges with machine learning ○ Limited scale with Python and R ○ Pipelines not reliable or reproducible ○ Many one-off production systems for serving Goals of platform ○ Standardize workflows and tools ○ Provide scalable support for end-to-end ML workflow ○ Democratize and accelerate machine learning through ease of use Motivation for Platform
  • 18. Same basic ML workflow & system requirements for ○ Traditional ML & deep learning ○ Supervised, unsupervised, & semi-supervised learning ○ Online or continuous learning ○ Batch, online, & mobile deployments ○ Time-series forecasting Machine Learning Workflow MANAGE DATA TRAIN MODELS EVALUATE MODELS DEPLOY MODELS MAKE PREDICTIONS MONITOR PREDICTIONS
  • 20. Key Components: Feature Store & Feature Engineering
  • 21. Problem ○ Hardest part of ML is finding good features ○ Same features are often used by different models built by different teams Solution ○ Centralized feature store for collecting and sharing features ○ Platform team curates core set of widely applicable features ○ Modellers contribute more features as part of ongoing model building ○ Meta-data for each feature to track ownership, how computed, where used, etc ○ Modellers select features by name & join key. Offline & online pipelines auto-configured Feature Store (aka Palette)
  • 22. DSL for Feature Engineering Batch Training Job (Spark) Training Algo DSL Batch Prediction Job (Spark) Trained Model DSL Pure function expressions for ○ Feature selection ○ Feature transformations (for derived & composite features) Standard set of accessor functions for ○ Feature store ○ Basis features ○ Column stats (min, max, mean, std-dev, etc) Standard transformation functions + UDFs Examples ○ @palette:store:orders:prep_time_avg_1week:rs_uuid ○ nFill(@basis:distance, mean(@basis:distance))
  • 23. Pipeline for Offline Training with Feature Store SPARK or SQL FEATURE DSL TRAINING ALGO RAW DATA BASIS FEATURES TRANSFORMED FEATURES MODEL HIVE FEATURE STORE FEATURE STORE FEATURES HIVE DATA LAKE
  • 24. Pipeline for Online Serving with Feature Store FEATURE DSL SERVABLE MODEL BASIS FEATURES TRANSFORMED FEATURES CASSANDRA FEATURE STORE FEATURE STORE FEATURES CLIENT SERVICE
  • 26. Large-scale distributed training (billions of samples) ○ Decision trees ○ Linear and logistic models ○ Unsupervised learning ○ Time series forecasting ○ Hyperparameter search for all model types Smart pipeline management to balance speed and reliability ○ Fuse operators into single job for speed ○ Break operators into separate jobs to reliability Distributed Training of Non-DL Models
  • 27. ○ Data-parallelism works best when model is small enough to fit on each GPU ○ Ring-allreduce is more efficient than parameter servers for averaging weights ○ Faster training and better GPU utilization ○ Much simpler training scripts ○ More details at http://eng.uber.com/horovod Distributed Training of Deep Learning Models with Horovod
  • 29. Problem ○ Often want to train a model per city or per product ○ Hard to train and deploy 100s or 1000s of individual models Solution ○ Let users define hierarchical partitioning scheme ○ Automatically train model per partition ○ Manage and deploy as single logical model Partitioned Models
  • 32. GLOBAL COUNTRY CITY Keep same split and partition for each level3
  • 33. M M M M M M M M M GLOBAL COUNTRY CITY Train model for every node4
  • 34. M M M M M M M M GLOBAL COUNTRY CITY Prune bad models5
  • 35. M M M M M M M M GLOBAL COUNTRY CITY At serving time, route to best model for each node6
  • 37. Evaluate Models Problem ○ It takes many iterations to produce a good model ○ Keeping track of how a model was built is important ○ Evaluating and comparing models is hard With every trained model, we capture standard metadata and reports ○ Full model configuration, including train and test datasets ○ Training job metrics ○ Model accuracy metrics ○ Performance of model after deployment
  • 38. Model Visualization - Regression Model
  • 39. Model Visualization - Classification Model
  • 40. Model Visualization - Feature Report
  • 41. Model Visualization - Decision Tree
  • 43. Prediction Service ○ Thrift service container for one or more models ○ Scale out in Docker on Mesos ○ Single or multi-tenant deployments ○ Connection management and batched / parallelized queries to Cassandra ○ Monitoring & alerting Deployment ○ Model & DSL packaged as JAR file ○ One click deploy across DCs via standard Uber deployment infrastructure ○ Health checks and rollback Online Prediction Service
  • 44. Realtime Predict Service Deployed ModelDeployed Model Client Service Deployed Model Model Cassandra Feature StoreRouting Infra DSLModel Manager1 2 3 4 5 Online Prediction Service
  • 45. Online Prediction Service Typical p95 latency from client service ○ ~5ms when all features from client service ○ ~10ms when joining pre-computed features from Cassandra Peak prediction volume across current online deployments ○ 600k+ QPS
  • 46. Problem ○ Prediction service can serve as many models as will fit into memory ○ Easy to run out of memory with large deployments of complex models Solution ○ Organize serving cluster into number of physical shards ○ Introduce client facing concept of ‘virtual shard’ that is specified at deploy time ○ Virtual shards are mapped by system to physical shards ○ Models are loaded by service instances in the correct physical shard(s) ○ Gateway service routes to correct physical shard based on request header Sharded Deployment
  • 47. Client Service Predict Service Predict Service Predict Service Predict Service Predict Service A B C D E F G H I Unsharded Deployment
  • 48. Client Service Gateway Routing Table Predict Service Predict Service Predict Service Predict Service Predict Service (Shard 2) E F G H I Predict Service Predict Service Predict Service Predict Service Predict Service (Shard 1) A B C D Sharded Deployment
  • 50. Problem ○ Multiple models per container (entirely different or multiple versions of same) ○ Support experimentation ○ Support automated retrain / redeploy ○ Cumbersome to have client service manage routing Solution ○ Models deployed to 'label' ○ Labels can be used for experimentation or different use cases ○ Predict service routes request to most recent model w/ specified label ○ Labels have schema so deploys won't break Deployment Labels
  • 51. Key Components: Live Model Performance Monitoring
  • 52. Monitor Predictions Problem ○ Models trained and evaluated against historical data ○ Need to ensure deployed model is making good predictions going forward Solution ○ Log predictions & join to actual outcomes ○ Publish metrics feature and prediction distributions over time ○ Dashboards and alerts
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 61. What’s Next? ○ Python ML for ease of use and broader algorithm support ○ Notebook-centered model building workflow ○ Online / continuous learning ○ AutoML to automate more of the modelling work
  • 63. Proprietary and confidential © 2017 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.