SlideShare una empresa de Scribd logo
1 de 22
Descargar para leer sin conexión
Michelangelo Palette
Feature Engineering @ Uber
Amit Nene
Staff Engineer,
Michelangelo ML
Platform
Eric Chen
Engineering Manager,
Michelangelo ML
Platform
Enable engineers and data scientists across the
company to easily build and deploy machine learning
solutions at scale.
ML-as-a-service
○ Managing Data/Features
○ Tools for managing, end-to-end, heterogenous
training workflows
○ Batch, online & mobile serving
○ Feature and Model drift monitoring
Michelangelo @ Uber
MANAGE DATA
TRAIN MODELS
EVALUATE MODELS
DEPLOY MODELS
MAKE PREDICTIONS
MONITOR PREDICTIONS
Feature Engineering @ Uber
○ Example: ETA for EATS order
○ Key ML features
○ How large is the order?
○ How busy is the restaurant?
○ How quick is the restaurant?
○ How busy is the traffic?
Managing Features
One of the hardest problems in ML
○ Finding good Features & labels
○ Data in production: reliability, scale, low latency
○ Data parity: training/serving skew
○ Real-time features: traditional tools don’t work
Palette Feature Store
Uber-specific curated and crowd-sourced feature database that is easy to use with machine
learning projects.
One stop shop
○ Search for features in single catalog/spec: rider, driver, restaurant, trip, eaters, etc.
○ Define new features + create production pipelines from spec
○ Share features across Uber: cut redundancy, use consistent data
○ Enable tooling: Data Drift Detection, Auto Feature Selection, etc.
Feature Store Organization
Organized as <entity>:<feature-group>:<feature-name>:<join-key>
Eg. @palette:restaurant:realtime_group:orders_last_30min:restaurant_uuid
Backed by a dual datastore system: similarities to lambda
○ Offline
○ Offline (Hive based) store for bulk access of features
○ Bulk retrieval of features across time
○ Online
○ KV store (Cassandra) for serving latest known value
○ Supports lookup/join of latest feature values in real time
○ Data synced between online & offline
○ Key to avoiding training/serving skew
EATS Features revisited
○ How large is the order? ← Input
○ How busy is the restaurant?
○ How quick is the restaurant?
○ How busy is the traffic?
Creating Batch Features
Offline Batch jobs
Features join
Model
Training
job
Online
Store
(Serving)
Offline
Store
(Training)
Features join
Model
Scoring
Service
Data
dispersal
Feature
Store
Palette
Feature
spec
General trends, not sensitive to
exact time of event
Ingested from Hive queries or
Spark jobs
How quick is the restaurant ?
○ Aggregate trends
○ Use Hive QL from warehouse
○ @palette:restaurant:batch_aggr:
prepTime:rId
Hive QL
Apache Hive and Apache Spark are either registered trademarks or
trademarks of the Apache Software Foundation in the United
States and/or other countries. No endorsement by The Apache
Software Foundation is implied by the use of this mark.
Creating Real-time Features
Flink-as-service
Streaming jobs
Features join
Model
Scoring
Service
Offline
Store
(Training)
Online
Store
(Serving)
Features join
Model
Training
job
Log +
Backfill
Feature
Store
Features reflecting the latest
state of the world
Ingest from streaming jobs
How busy is the restaurant ?
○ kafka topic with events
○ perform realtime aggregations
○ @palette:restaurant:rt_aggr:nMeal:
rId
Palette
Feature
spec
Apache Flink is either a registered trademark or trademark of the
Apache Software Foundation in the United States and/or other
countries. No endorsement by The Apache Software Foundation is
implied by the use of this mark.
Flink SQL
Bring Your Own Features
Feature maintained by customers
Mechanisms for hooking
serving/training endpoints
Users maintain data parity
How busy is the region ?
○ RPC: external traffic feed
○ Log RPCs for training
○ @palette:region:traffic:nBusy:regionId
Features join
Model
Scoring
Service
Features join
Model
Training
job
Palette
Feature
spec
Offline
Proxy
Online
Proxy
Custom
store
Batch
API
Service
endpoint
RPC
Palette Feature Joins
Join @basis features
with supplied
@palette features into
single feature vector
Join billion+ rows at
points-in-time:
dominates overhead
Join/Lookup 10s of
tables at serving time
at low latency
order_i
d
nOrder restaur
ant_uui
d
latlong Label
ETA
(trainin
g)
timesta
mp
1 4 uuid1 (10,20) 40m t1
2 3 uuid2 (30,40) 35m t2
rId prepTime timestamp
uuid1 20m t1
uuid2 15m t2
join_key = rId
@basis features
training/scoring feature vector
Time + Key
join
order
_id
rId latlong Label
ETA
(trainin
g)
prepT
ime
..
1 uuid1 (10,20) 40m 20m ..
2 uuid2 (30,40) 35m 15m ..
@palette:restaurant:agg_stats:prepTime:
restaurant_uuid
@palette:restaurant:re
altime_stats:nBusy:rId
@palette:region:stats:n
Busy:regionId
Done with Feature Engineering ?
○ Feature Store Features
○ nOrder: How large is the order? (basis)
○ nMeal: How busy is the restaurant? (near real-time)
○ prepTime: How quick is the restaurant? (batch feature)
○ nBusy: How busy is the traffic? (external feature)
○ Ready to use ?
○ Model specific feature transformations
○ Chaining of features
○ Feature Transformers
Feature Consumption
Feature Store Features
○ nOrder: input feature
○ nMeal: consume directly
○ prepTime: needs transformation before use
○ nBusy: input latlong but need regionId
Setting up consumption pipelines
○ nMeal: r_id -> nMeal
○ prepTime: r_id -> prepTime -> featureImpute
○ nBusy: r_id -> lat, log -> regionId(lat, log) -> nBusy
In arbitrary order
Michelangelo Transformers
Transformer: Given a record defined a set of fields, add/modify/remove fields in the record
PipelineModel: A sequence of transformers
Spark ML: Transformer/PipelineModel on DataFrames
Michelangelo Transformers: extended transformer framework for both Apache Spark and
Apache Spark-less environments
Estimator: Analyze the data and produce a transformer
Defining a Pipeline Model
Join Palette Features
Apply Feature Eng Rules
String Indexing
One-Hot Encoding
DL Inferencing
Result Retrieval
Feature consumption
○ Feature extraction: Palette feature retrieval expressed as a transform
○ Feature mutation: Scala-like DSL for simple transforms
○ Model-centric Feature Engineering: string indexer, one-hot encoder,
threshold decision
○ Result retrieval
Modeling
○ Model inferencing (also Michelangelo Transformer)
Michelangelo Transformers Example
class MyModel (override val uid: String) extends Model[MyModel] with
MyModelParam with MLWritable with MATransformer {
...
override def transform(dataset: Dataset[_]): DataFrame = ...
override def scoreInstance(instance: util.Map[String, Object]): util.Map[String,
Object] = ...
}
class MyEstimator(override val uid: String) extends
Estimator[MyEstimator] with Params with DefaultParamsWritable {
...
override def fit(dataset: Dataset[_]): MyModel = ...
}
Palette retrieval as a Transformer
Palette Feature Transformer
Feature Meta Store
RPC Feature Proxy
Cassandra Access
Hive Access
tx_p1 = PaletteTransformer([
"@palette:restaurant:realtime_feature:nMeal:r_id",
"@palette:restaurant:batch_feature:prepTime:r_id",
"@palette:restaurant:property:lat:r_id",
"@palette:restaurant:property:log:r_id"
])
tx_p2 = PaletteTransformer([
"@palette:region:service_feature:nBusy:region_id"
])
DSL Estimator / Transformer
DSL Estimator
Code Gen / Compiler
DSL Transformer
Online classloader Offline classloader
es_dsl1 = DSLEstimator(lambdas = [
["region_id", "regionId(@palette:restaurant:property:lat:r_id,
@palette:restaurant:property:r_id"]
])
es_dsl2 = DSLEstimator(lambdas = [
["prepTime": nFill(nVal("@palette:restaurant:batch_feature:prepTime:r_id"),
avg("@palette:restaurant:batch_feature:prepTime:r_id")))"],
["nMeal": nVal("@palette:restaurant:realtime_feature:nMean:r_id")],
["nOrder": nVal("@basis:nOrder")],
["nBusy": nVal("@palette:region:service_feature:nBusy:region_id")]
])
Uber Eats Example Cont.
Computation order
○ nMeal: rId -> nMeal
○ prepTime: rId -> prepTime -> featureImpute
○ busyScale: rId -> lat, log -> regionId(lat, log) -> busyScale
Palette Transformer
id -> nMeal
id -> prepTime
id -> lat, log
DSL Transformer
lag, log -> regionID
Palette Transformer
regionID -> nBusy
DSL Transformer
impute(nMeal)
impute(prepTime)
Dev Tools: Authoring and Debugging a Pipeline
Palette feature generation
● Apache Hive QL, Apache Flink SQL
Interactive authoring
● PySpark + iPython Jupyter notebook
Centralized model store
● Serialization / Deserialization (Spark ML,
MLReadable/Writeable)
● Online and offline accessibility
basis_feature_sql = "..."
df = spark.sql(basis_feature_sql)
pipeline = Pipeline(stages=[tx_p1, es_dsl1, tx_p2, es_dsl2t, vec_asm, l_r)
pipeline_model = pipeline.fit(df)
scored_def = pipeline_model.transform(df)
model_id = MA_store.save_model(pipeline_model)
draft_id = MA_store.save_pipeline(basis_feature_sql, pipeline)
retrain_job = MA_API.train(draft_id, new_basis_feature_sql)
Takeaways
Feature Store: Batch, Realtime and External Features with online and offline parity
Offline scalability: Joins across billions of rows
Online serving latency: Parallel IO, fast storage with caching
Feature Transformers: Setup chains of transformations at training/serving time
Pipeline reliability and monitoring out-of-the-box
Thank you
https://eng.uber.com/michelangelo/
https://www.uber.com/careers/

Más contenido relacionado

La actualidad más candente

Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Icebergkbajda
 
Apache Flink Training: DataStream API Part 1 Basic
 Apache Flink Training: DataStream API Part 1 Basic Apache Flink Training: DataStream API Part 1 Basic
Apache Flink Training: DataStream API Part 1 BasicFlink Forward
 
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...Databricks
 
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudVertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudMárton Kodok
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks FundamentalsDalibor Wijas
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Julien Le Dem
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPDatabricks
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup Omid Vahdaty
 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer GuideDeon Huang
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheusKasper Nissen
 

La actualidad más candente (20)

Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Apache Flink Training: DataStream API Part 1 Basic
 Apache Flink Training: DataStream API Part 1 Basic Apache Flink Training: DataStream API Part 1 Basic
Apache Flink Training: DataStream API Part 1 Basic
 
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
 
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudVertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer Guide
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheus
 

Similar a 2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber

Building a Real-Time Feature Store at iFood
Building a Real-Time Feature Store at iFoodBuilding a Real-Time Feature Store at iFood
Building a Real-Time Feature Store at iFoodDatabricks
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...Chetan Khatri
 
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...Piyush Kumar
 
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...Data Con LA
 
Simplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseSimplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseFeatureByte
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaChetan Khatri
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
 
Sprint 49 review
Sprint 49 reviewSprint 49 review
Sprint 49 reviewManageIQ
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupJim Dowling
 
Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Karthik Murugesan
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkFaisal Siddiqi
 
Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkUber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkWenrui Meng
 
Scaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and FeastScaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and FeastDatabricks
 
Scaling machinelearning as a service at uber li Erran li - 2016
Scaling machinelearning as a service at uber li Erran li - 2016Scaling machinelearning as a service at uber li Erran li - 2016
Scaling machinelearning as a service at uber li Erran li - 2016Karthik Murugesan
 
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016PAPIs.io
 
Ecommerce as an Engine
Ecommerce as an EngineEcommerce as an Engine
Ecommerce as an Enginestephskardal
 

Similar a 2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber (20)

Building a Real-Time Feature Store at iFood
Building a Real-Time Feature Store at iFoodBuilding a Real-Time Feature Store at iFood
Building a Real-Time Feature Store at iFood
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
 
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
 
KFServing and Feast
KFServing and FeastKFServing and Feast
KFServing and Feast
 
Simplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseSimplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data Warehouse
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Sprint 49 review
Sprint 49 reviewSprint 49 review
Sprint 49 review
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
 
Sprint 59
Sprint 59Sprint 59
Sprint 59
 
Sprint 54
Sprint 54Sprint 54
Sprint 54
 
Sprint 53
Sprint 53Sprint 53
Sprint 53
 
Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkUber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache Flink
 
Scaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and FeastScaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and Feast
 
Scaling machinelearning as a service at uber li Erran li - 2016
Scaling machinelearning as a service at uber li Erran li - 2016Scaling machinelearning as a service at uber li Erran li - 2016
Scaling machinelearning as a service at uber li Erran li - 2016
 
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
 
Ecommerce as an Engine
Ecommerce as an EngineEcommerce as an Engine
Ecommerce as an Engine
 

Más de Karthik Murugesan

Rakuten - Recommendation Platform
Rakuten - Recommendation PlatformRakuten - Recommendation Platform
Rakuten - Recommendation PlatformKarthik Murugesan
 
Yahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slidesYahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slidesKarthik Murugesan
 
Free servers to build Big Data Systems on: Bing's Approach
Free servers to build Big Data Systems on: Bing's  Approach Free servers to build Big Data Systems on: Bing's  Approach
Free servers to build Big Data Systems on: Bing's Approach Karthik Murugesan
 
Microsoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER IntroductionMicrosoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER IntroductionKarthik Murugesan
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2Karthik Murugesan
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesKarthik Murugesan
 
The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019Karthik Murugesan
 
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
Unifying Twitter around a single ML platform  - Twitter AI Platform 2019Unifying Twitter around a single ML platform  - Twitter AI Platform 2019
Unifying Twitter around a single ML platform - Twitter AI Platform 2019Karthik Murugesan
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...Karthik Murugesan
 
The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019Karthik Murugesan
 
Developing a ML model using TF Estimator
Developing a ML model using TF EstimatorDeveloping a ML model using TF Estimator
Developing a ML model using TF EstimatorKarthik Murugesan
 
Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018Karthik Murugesan
 
Netflix factstore for recommendations - 2018
Netflix factstore  for recommendations - 2018Netflix factstore  for recommendations - 2018
Netflix factstore for recommendations - 2018Karthik Murugesan
 
Trends in Music Recommendations 2018
Trends in Music Recommendations 2018Trends in Music Recommendations 2018
Trends in Music Recommendations 2018Karthik Murugesan
 
Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017Karthik Murugesan
 
Spotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music DiscoverySpotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music DiscoveryKarthik Murugesan
 
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform Karthik Murugesan
 
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Karthik Murugesan
 

Más de Karthik Murugesan (20)

Rakuten - Recommendation Platform
Rakuten - Recommendation PlatformRakuten - Recommendation Platform
Rakuten - Recommendation Platform
 
Yahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slidesYahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slides
 
Free servers to build Big Data Systems on: Bing's Approach
Free servers to build Big Data Systems on: Bing's  Approach Free servers to build Big Data Systems on: Bing's  Approach
Free servers to build Big Data Systems on: Bing's Approach
 
Microsoft cosmos
Microsoft cosmosMicrosoft cosmos
Microsoft cosmos
 
Microsoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER IntroductionMicrosoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER Introduction
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019
 
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
Unifying Twitter around a single ML platform  - Twitter AI Platform 2019Unifying Twitter around a single ML platform  - Twitter AI Platform 2019
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...
 
The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019
 
Developing a ML model using TF Estimator
Developing a ML model using TF EstimatorDeveloping a ML model using TF Estimator
Developing a ML model using TF Estimator
 
Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018
 
Netflix factstore for recommendations - 2018
Netflix factstore  for recommendations - 2018Netflix factstore  for recommendations - 2018
Netflix factstore for recommendations - 2018
 
Trends in Music Recommendations 2018
Trends in Music Recommendations 2018Trends in Music Recommendations 2018
Trends in Music Recommendations 2018
 
Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017
 
State Of AI 2018
State Of AI 2018State Of AI 2018
State Of AI 2018
 
Spotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music DiscoverySpotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music Discovery
 
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
 
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
 

Último

定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一Fs
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Dana Luther
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Lucknow
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一Fs
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 

Último (20)

定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 

2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber

  • 1. Michelangelo Palette Feature Engineering @ Uber Amit Nene Staff Engineer, Michelangelo ML Platform Eric Chen Engineering Manager, Michelangelo ML Platform
  • 2. Enable engineers and data scientists across the company to easily build and deploy machine learning solutions at scale. ML-as-a-service ○ Managing Data/Features ○ Tools for managing, end-to-end, heterogenous training workflows ○ Batch, online & mobile serving ○ Feature and Model drift monitoring Michelangelo @ Uber MANAGE DATA TRAIN MODELS EVALUATE MODELS DEPLOY MODELS MAKE PREDICTIONS MONITOR PREDICTIONS
  • 3. Feature Engineering @ Uber ○ Example: ETA for EATS order ○ Key ML features ○ How large is the order? ○ How busy is the restaurant? ○ How quick is the restaurant? ○ How busy is the traffic?
  • 4. Managing Features One of the hardest problems in ML ○ Finding good Features & labels ○ Data in production: reliability, scale, low latency ○ Data parity: training/serving skew ○ Real-time features: traditional tools don’t work
  • 5. Palette Feature Store Uber-specific curated and crowd-sourced feature database that is easy to use with machine learning projects. One stop shop ○ Search for features in single catalog/spec: rider, driver, restaurant, trip, eaters, etc. ○ Define new features + create production pipelines from spec ○ Share features across Uber: cut redundancy, use consistent data ○ Enable tooling: Data Drift Detection, Auto Feature Selection, etc.
  • 6. Feature Store Organization Organized as <entity>:<feature-group>:<feature-name>:<join-key> Eg. @palette:restaurant:realtime_group:orders_last_30min:restaurant_uuid Backed by a dual datastore system: similarities to lambda ○ Offline ○ Offline (Hive based) store for bulk access of features ○ Bulk retrieval of features across time ○ Online ○ KV store (Cassandra) for serving latest known value ○ Supports lookup/join of latest feature values in real time ○ Data synced between online & offline ○ Key to avoiding training/serving skew
  • 7. EATS Features revisited ○ How large is the order? ← Input ○ How busy is the restaurant? ○ How quick is the restaurant? ○ How busy is the traffic?
  • 8. Creating Batch Features Offline Batch jobs Features join Model Training job Online Store (Serving) Offline Store (Training) Features join Model Scoring Service Data dispersal Feature Store Palette Feature spec General trends, not sensitive to exact time of event Ingested from Hive queries or Spark jobs How quick is the restaurant ? ○ Aggregate trends ○ Use Hive QL from warehouse ○ @palette:restaurant:batch_aggr: prepTime:rId Hive QL Apache Hive and Apache Spark are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of this mark.
  • 9. Creating Real-time Features Flink-as-service Streaming jobs Features join Model Scoring Service Offline Store (Training) Online Store (Serving) Features join Model Training job Log + Backfill Feature Store Features reflecting the latest state of the world Ingest from streaming jobs How busy is the restaurant ? ○ kafka topic with events ○ perform realtime aggregations ○ @palette:restaurant:rt_aggr:nMeal: rId Palette Feature spec Apache Flink is either a registered trademark or trademark of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of this mark. Flink SQL
  • 10. Bring Your Own Features Feature maintained by customers Mechanisms for hooking serving/training endpoints Users maintain data parity How busy is the region ? ○ RPC: external traffic feed ○ Log RPCs for training ○ @palette:region:traffic:nBusy:regionId Features join Model Scoring Service Features join Model Training job Palette Feature spec Offline Proxy Online Proxy Custom store Batch API Service endpoint RPC
  • 11. Palette Feature Joins Join @basis features with supplied @palette features into single feature vector Join billion+ rows at points-in-time: dominates overhead Join/Lookup 10s of tables at serving time at low latency order_i d nOrder restaur ant_uui d latlong Label ETA (trainin g) timesta mp 1 4 uuid1 (10,20) 40m t1 2 3 uuid2 (30,40) 35m t2 rId prepTime timestamp uuid1 20m t1 uuid2 15m t2 join_key = rId @basis features training/scoring feature vector Time + Key join order _id rId latlong Label ETA (trainin g) prepT ime .. 1 uuid1 (10,20) 40m 20m .. 2 uuid2 (30,40) 35m 15m .. @palette:restaurant:agg_stats:prepTime: restaurant_uuid @palette:restaurant:re altime_stats:nBusy:rId @palette:region:stats:n Busy:regionId
  • 12. Done with Feature Engineering ? ○ Feature Store Features ○ nOrder: How large is the order? (basis) ○ nMeal: How busy is the restaurant? (near real-time) ○ prepTime: How quick is the restaurant? (batch feature) ○ nBusy: How busy is the traffic? (external feature) ○ Ready to use ? ○ Model specific feature transformations ○ Chaining of features ○ Feature Transformers
  • 13. Feature Consumption Feature Store Features ○ nOrder: input feature ○ nMeal: consume directly ○ prepTime: needs transformation before use ○ nBusy: input latlong but need regionId Setting up consumption pipelines ○ nMeal: r_id -> nMeal ○ prepTime: r_id -> prepTime -> featureImpute ○ nBusy: r_id -> lat, log -> regionId(lat, log) -> nBusy In arbitrary order
  • 14. Michelangelo Transformers Transformer: Given a record defined a set of fields, add/modify/remove fields in the record PipelineModel: A sequence of transformers Spark ML: Transformer/PipelineModel on DataFrames Michelangelo Transformers: extended transformer framework for both Apache Spark and Apache Spark-less environments Estimator: Analyze the data and produce a transformer
  • 15. Defining a Pipeline Model Join Palette Features Apply Feature Eng Rules String Indexing One-Hot Encoding DL Inferencing Result Retrieval Feature consumption ○ Feature extraction: Palette feature retrieval expressed as a transform ○ Feature mutation: Scala-like DSL for simple transforms ○ Model-centric Feature Engineering: string indexer, one-hot encoder, threshold decision ○ Result retrieval Modeling ○ Model inferencing (also Michelangelo Transformer)
  • 16. Michelangelo Transformers Example class MyModel (override val uid: String) extends Model[MyModel] with MyModelParam with MLWritable with MATransformer { ... override def transform(dataset: Dataset[_]): DataFrame = ... override def scoreInstance(instance: util.Map[String, Object]): util.Map[String, Object] = ... } class MyEstimator(override val uid: String) extends Estimator[MyEstimator] with Params with DefaultParamsWritable { ... override def fit(dataset: Dataset[_]): MyModel = ... }
  • 17. Palette retrieval as a Transformer Palette Feature Transformer Feature Meta Store RPC Feature Proxy Cassandra Access Hive Access tx_p1 = PaletteTransformer([ "@palette:restaurant:realtime_feature:nMeal:r_id", "@palette:restaurant:batch_feature:prepTime:r_id", "@palette:restaurant:property:lat:r_id", "@palette:restaurant:property:log:r_id" ]) tx_p2 = PaletteTransformer([ "@palette:region:service_feature:nBusy:region_id" ])
  • 18. DSL Estimator / Transformer DSL Estimator Code Gen / Compiler DSL Transformer Online classloader Offline classloader es_dsl1 = DSLEstimator(lambdas = [ ["region_id", "regionId(@palette:restaurant:property:lat:r_id, @palette:restaurant:property:r_id"] ]) es_dsl2 = DSLEstimator(lambdas = [ ["prepTime": nFill(nVal("@palette:restaurant:batch_feature:prepTime:r_id"), avg("@palette:restaurant:batch_feature:prepTime:r_id")))"], ["nMeal": nVal("@palette:restaurant:realtime_feature:nMean:r_id")], ["nOrder": nVal("@basis:nOrder")], ["nBusy": nVal("@palette:region:service_feature:nBusy:region_id")] ])
  • 19. Uber Eats Example Cont. Computation order ○ nMeal: rId -> nMeal ○ prepTime: rId -> prepTime -> featureImpute ○ busyScale: rId -> lat, log -> regionId(lat, log) -> busyScale Palette Transformer id -> nMeal id -> prepTime id -> lat, log DSL Transformer lag, log -> regionID Palette Transformer regionID -> nBusy DSL Transformer impute(nMeal) impute(prepTime)
  • 20. Dev Tools: Authoring and Debugging a Pipeline Palette feature generation ● Apache Hive QL, Apache Flink SQL Interactive authoring ● PySpark + iPython Jupyter notebook Centralized model store ● Serialization / Deserialization (Spark ML, MLReadable/Writeable) ● Online and offline accessibility basis_feature_sql = "..." df = spark.sql(basis_feature_sql) pipeline = Pipeline(stages=[tx_p1, es_dsl1, tx_p2, es_dsl2t, vec_asm, l_r) pipeline_model = pipeline.fit(df) scored_def = pipeline_model.transform(df) model_id = MA_store.save_model(pipeline_model) draft_id = MA_store.save_pipeline(basis_feature_sql, pipeline) retrain_job = MA_API.train(draft_id, new_basis_feature_sql)
  • 21. Takeaways Feature Store: Batch, Realtime and External Features with online and offline parity Offline scalability: Joins across billions of rows Online serving latency: Parallel IO, fast storage with caching Feature Transformers: Setup chains of transformations at training/serving time Pipeline reliability and monitoring out-of-the-box