SlideShare una empresa de Scribd logo
1 de 18
MetaConfig driven
FeatureStore@MakeMyTrip
~/Piyush
Head Data Platform Engineering
Namasté
About MakeMyTrip
Deliverables of this presentation:
- Why common feature store?
- Productionizating ML via standardization
- Machine Learning Life Cycle
- Prediction Serving + Challenges
- FeatureStore Components
- Architecture
- Tools
- Next Steps
- References
Motivation
Developing Unified Personalization platform for improving customer experience of millions of Indian
travellers
Business Goal: Through Hyper Personalization
● Raise Engagement
● Drive Conversions + Boost Revenue
● Migrating Business Rule Engines to ML Models ( across different LOBs @MakeMyTrip)
Tech Goal:
● Machine Learning Models are as good as the data they are trained on. Needs good Data Management.
● ML Systems are trained on set of features, a feature is a input to model which can be a column in a
dataset or complex computed metric or some other model output too
● Feature Store is a central common repository for highly curated features which are described through
well structured configuration. Enables us to scale machine learning workflows @MakeMyTrip.
Before Feature Store : state of data platform
● Siloed Data Sets + Serving APIs created per use-case / projects leading to complex
data pipelines | Machine Learning if not implemented in right manner creates high tech debt
○ Personalization : Cosmos
○ Customer Segmentation : HYDRA
○ Hotel Ranking / Sequencing + Intendo
○ DP : Dynamic / Differential Pricing : Hotel & Flights
○ Anomaly Detection, Destination trends, Demand Anomalies
● RealTime Features require Data Engineering support from Data Scientists
● Lack of standardization & discovery : Feature definitions are duplicated into the
different data pipelines even if it is same / computed multiple times and change to
definitions means fixing across different pipelines.
● Features used in training and serving were inconsistent
Productionizing ML via Standardization
● MetaConfigs & Feature Catalog : Documentation
● Reusability of features across projects / teams
● Standardized access of features between Training &
Serving | Data Governance + Data Quality
● More Self-serve : Reduces Data Scientist Time on DE
Tasks
● Reduce Time to get to Production for ML Projects
● Reduce Data Tech-Debt & Improved Feature Quality
Feature Store : Online
+ Historical
Data Store 1
Data Store 2
Data Store N
Raw Data
Data Sets 1
Data Sets N
Structured
Data
Feature Engineering
MODEL : TRAINING + DEPLOY
Machine Learning Life Cycle
ML LifeCycle Image source : UCB RISE LABs
Addition : FEATURE PIPELINES
Prediction serving
- ASK : 10 -30 ms / < 30 ms
- Challenges : DNN : Complex models
- Hardware : GPUs / TPUs
- SageMaker provides abstraction / middle layer between applications and complex
models thru docker containers
- Online : SageMaker Endpoints
- Batch : Scoring : Pre-materialize predictions into a low latency store ( like redis
cluster / BoulderDB)
- Problems :
- Requires substantial computation and space
- Example doing the scoring for all customers
- Costly update -> rescore everything
FeatureStore Glossary
Feature : a measurable property of a phenomenon
under observation defined in FSConfig
FSConfig: used for storing config/ DSL + code to
compute features, feature version information,
feature analysis data and feature documentation
FSCompute: Computation Engine developed over
SPARK, supports mosts of the spark APIs for historical
and Online(Streaming)
FeatureStore : serves as a repository of features that
can be used for training and evaluation of machine
learning models.
FeatureGroup: internal to the system, to group
common compute jobs of related features having the
same entity, input data sources and filter conditions,
thereby optimizing the compute process.
FSScheduler: Internal service to create a feature
DAG(with Dependency Resolution) and trigger their
execution while handling retries and back pressure.
FS-DSA : Data Science Automation for Model Training
+ Deployment integrated with Feature Store |
Enables versioned and reproducible experiments.
FSBrokerAPI : Online Serving RESTful API endpoint for
consumer applications
FeatureStore Components & Data Flow
User Funnel Activity
Streams
Client-Side
Server-Side
DATA CAPTURE COMPUTE + FSConfig SERVING + STORAGE
Transactional Data
Booking Master
FSConfig :
Feature
Catalog
Master Datastore
Product Master, User
Master, Device
Master
New
Data
Streams
ML Automation
BT-Compute
BATCH Feature
Compute Jobs
RT-Compute
Feature
Compute
SERVING API
Offline Models
Online Models
Batch BULK API
(DataFrame)
Feature Definitions
BoulderDB REDIS
Feature
Storage
Job Scheduler
Sagemaker
TRAIN
Training + HPO
Deploy
Docker / Batch
Transform
FSConfig : Feature Definitions & Metadata
Feature Name :
<Entity>::<Feature_shortname>::<
Data Time Interval>::<Refresh
Frequency>::<Version>
Entity : <UserID>_<profileType> Short Name :
listing_conversion_rank
Versioning : v2 + Process :
RT/BT
FeatureGroup : (System
Generated ID)
8fda73d1_2eee_4cfc_a20f_e9afb1
78fbc3
Entity:
["uuid", "profile_type"]
Features [Array] Time Window(Refresh/
Data - Time duration): (ISO
Time Interval) P1D
Data Source [Array]:
[user_master, txn_search]
Data Store: GLUE/S3 Database Name: blueshift Table Name: [user_master,
txn_search]
Data Sink: Serving [Array] Data Store: GLUE
Catalog/S3/Redis/BoulderDb
Database Name :
rocksDB_<WAL Dir Path>
Table Name :
rocksDB_<columnFamily>
Compute Logic DSL + Spark SQL: metric_expr,
group_by_expr, filter_expr,
window_function,
window_function_alias
Code (Python/Scala/Java)
: GIT/Gerrit URI
Model(sagemaker) /
Embedding
Environment: Production Workspace: Dev/Staging/Production Namespace: <Project
Name>
Apache LIVY + Databricks
JOBs API Config
FS Store | online + historical
Output Schema (internal to the system)
● Historical Feature Data schema on S3 Parquet
|-- entity: string (nullable = false)
|-- uuid_profileType::listing_conv_rank::P30D::P15M::v1: long (nullable = false)
|-- uuid_profileType::listing_view_rank::P30D::P15M::v1: long (nullable = false)
|-- uuid_profileType::cnt_distinct_bk_bankid::P30D::P15M::v1: map (nullable = false)
| |-- key: string
| |-- value: integer (valueContainsNull = true)
..
..
All features in that feature group
● Online Serving Data Schema on REDIS + BoulderDB
○ Serving at Feature Group level
Key -> <Entity_id>#<Feature_group_id>/<Feature_split>
Value -> Hashes
key -> Feature_name
Value -> Feature_value
TimeStamp -> Compute_Processed_Time
○ Serving at Feature Level
Key -> <Entity_id>#<Feature_name>
Value -> Hashes
key -> Feature_name
Value -> Feature_value
SERVING Config
- lambda (batch_feature_name
linkage for RT features)
- Support for linear QUERY DAGs
- MVEL based post-processing on any
feature per service/model if needed
Feature backfill (back_fill_required,
back_fill_duration)
FS-BrokerAPI : Online Feature Serving Framework
Data Access LayerREQUEST HANDLER Orchestration Layer
Orchestration +
Broker
Extractors Transport
Business Logics
+ MVEL
Extractors Transport
<uri>/v1/getFeature
s
(POST Request)
AKKA(Actors)
Request
Validations Feature
Definition
Request
Handler
REDIS
Boulder
DB
FeaturesbyName
FeaturesbyModel
FeaturesbyService
BoulderDB : Online Serving Store
- Build on top of RocksDB (embedded data store: developed by Facebook) : reducing
the distance to data on serving layer.
- Steps added to compute layer: post-processing:
- BT-Compute Layer after processing data through SPARK(distributed) - writes into SST Files across
various executors into shared object storage : S3
- Split spark dataframe into non-overlapping ranges : individual split is sorted by KEY, then it is ingested
into sst file per partition / executor
- Cluster coordinator : Consul
- Atomic switching of DB snapshots
- Data is sharded (helps with proximity by Namespace) and replicated(RF=2)
Tools
Next Steps
- Feature Stats Visualization / Analytics & Monitoring // Feature
Catalog
- Seamless integration with Experimentation Framework
- Per User Databases on top of feature-store for Personalization
- Notebook integration : More better Data Science Tools for Data
Scientists with Python libraries
- Perf Tools : Query Optimization & Analysis
References
- https://www.logicalclocks.com/feature-store/
- https://eng.uber.com/scaling-michelangelo/
- Airbnb : Zipline
- HopsML + Hopsworks
- Go-JEK : FEAST
- The Design of Systems for Real-time Prediction Serving | DataEngConf SF '18
- https://medium.com/makemytrip-engineering
Piyush Kumar
E : piyush.kumar@makemytrip.com
W : www.makemytrip.com
T : https://twitter.com/piykumar
Thank you !!

Más contenido relacionado

La actualidad más candente

Right-size Deployment Instances to Meet Enterprise Demand
Right-size Deployment Instances to Meet Enterprise Demand Right-size Deployment Instances to Meet Enterprise Demand
Right-size Deployment Instances to Meet Enterprise Demand
WSO2
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
Guido Schmutz
 
Adopting AnswerModules ModuleSuite
Adopting AnswerModules ModuleSuiteAdopting AnswerModules ModuleSuite
Adopting AnswerModules ModuleSuite
AnswerModules
 
Ingesting and Processing IoT Data - using MQTT, Kafka Connect and KSQL
Ingesting and Processing IoT Data - using MQTT, Kafka Connect and KSQLIngesting and Processing IoT Data - using MQTT, Kafka Connect and KSQL
Ingesting and Processing IoT Data - using MQTT, Kafka Connect and KSQL
Guido Schmutz
 

La actualidad más candente (20)

Hopsworks Feature Store 2.0 a new paradigm
Hopsworks Feature Store  2.0   a new paradigmHopsworks Feature Store  2.0   a new paradigm
Hopsworks Feature Store 2.0 a new paradigm
 
Integration for real-time Kafka SQL
Integration for real-time Kafka SQLIntegration for real-time Kafka SQL
Integration for real-time Kafka SQL
 
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDKBigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 
Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...
Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...
Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...
 
WebAPI::DBIC - Automated RESTful API's
WebAPI::DBIC - Automated RESTful API'sWebAPI::DBIC - Automated RESTful API's
WebAPI::DBIC - Automated RESTful API's
 
Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021
 
Introduction to ksqlDB and stream processing (Vish Srinivasan - Confluent)
Introduction to ksqlDB and stream processing (Vish Srinivasan  - Confluent)Introduction to ksqlDB and stream processing (Vish Srinivasan  - Confluent)
Introduction to ksqlDB and stream processing (Vish Srinivasan - Confluent)
 
Right-size Deployment Instances to Meet Enterprise Demand
Right-size Deployment Instances to Meet Enterprise Demand Right-size Deployment Instances to Meet Enterprise Demand
Right-size Deployment Instances to Meet Enterprise Demand
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
From Kafka to BigQuery - Strata Singapore
From Kafka to BigQuery - Strata SingaporeFrom Kafka to BigQuery - Strata Singapore
From Kafka to BigQuery - Strata Singapore
 
Apache Gobblin at MZ
Apache Gobblin at MZApache Gobblin at MZ
Apache Gobblin at MZ
 
Adopting AnswerModules ModuleSuite
Adopting AnswerModules ModuleSuiteAdopting AnswerModules ModuleSuite
Adopting AnswerModules ModuleSuite
 
MemSQL 201: Advanced Tips and Tricks Webcast
MemSQL 201: Advanced Tips and Tricks WebcastMemSQL 201: Advanced Tips and Tricks Webcast
MemSQL 201: Advanced Tips and Tricks Webcast
 
All Streams Ahead! ksqlDB Workshop ANZ
All Streams Ahead! ksqlDB Workshop ANZAll Streams Ahead! ksqlDB Workshop ANZ
All Streams Ahead! ksqlDB Workshop ANZ
 
Ingesting and Processing IoT Data - using MQTT, Kafka Connect and KSQL
Ingesting and Processing IoT Data - using MQTT, Kafka Connect and KSQLIngesting and Processing IoT Data - using MQTT, Kafka Connect and KSQL
Ingesting and Processing IoT Data - using MQTT, Kafka Connect and KSQL
 
george.farquhar.resume2
george.farquhar.resume2george.farquhar.resume2
george.farquhar.resume2
 
Building a Streaming Platform with Kafka
Building a Streaming Platform with KafkaBuilding a Streaming Platform with Kafka
Building a Streaming Platform with Kafka
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
 

Similar a Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serving Platform powering Machine Learning @MakeMyTrip by Piyush Kumar

Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
SQUADEX
 
SaaS transformation with OCE - uEngineCloud
SaaS transformation with OCE - uEngineCloudSaaS transformation with OCE - uEngineCloud
SaaS transformation with OCE - uEngineCloud
uEngine Solutions
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Lucidworks
 
Samuel Bayeta
Samuel BayetaSamuel Bayeta
Samuel Bayeta
Sam B
 

Similar a Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serving Platform powering Machine Learning @MakeMyTrip by Piyush Kumar (20)

Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreHamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature Store
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
 
Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)
 
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
 
BI 2008 Simple
BI 2008 SimpleBI 2008 Simple
BI 2008 Simple
 
PyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdfPyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdf
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real Experience
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
SaaS transformation with OCE - uEngineCloud
SaaS transformation with OCE - uEngineCloudSaaS transformation with OCE - uEngineCloud
SaaS transformation with OCE - uEngineCloud
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
 
Azure Data platform
Azure Data platformAzure Data platform
Azure Data platform
 
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
 
Samuel Bayeta
Samuel BayetaSamuel Bayeta
Samuel Bayeta
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
GPPB2020 - Milan - Power BI dataflows deep dive
GPPB2020 - Milan - Power BI dataflows deep diveGPPB2020 - Milan - Power BI dataflows deep dive
GPPB2020 - Milan - Power BI dataflows deep dive
 

Más de Data Con LA

Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 

Más de Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Último

Último (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serving Platform powering Machine Learning @MakeMyTrip by Piyush Kumar

  • 3. Deliverables of this presentation: - Why common feature store? - Productionizating ML via standardization - Machine Learning Life Cycle - Prediction Serving + Challenges - FeatureStore Components - Architecture - Tools - Next Steps - References
  • 4. Motivation Developing Unified Personalization platform for improving customer experience of millions of Indian travellers Business Goal: Through Hyper Personalization ● Raise Engagement ● Drive Conversions + Boost Revenue ● Migrating Business Rule Engines to ML Models ( across different LOBs @MakeMyTrip) Tech Goal: ● Machine Learning Models are as good as the data they are trained on. Needs good Data Management. ● ML Systems are trained on set of features, a feature is a input to model which can be a column in a dataset or complex computed metric or some other model output too ● Feature Store is a central common repository for highly curated features which are described through well structured configuration. Enables us to scale machine learning workflows @MakeMyTrip.
  • 5. Before Feature Store : state of data platform ● Siloed Data Sets + Serving APIs created per use-case / projects leading to complex data pipelines | Machine Learning if not implemented in right manner creates high tech debt ○ Personalization : Cosmos ○ Customer Segmentation : HYDRA ○ Hotel Ranking / Sequencing + Intendo ○ DP : Dynamic / Differential Pricing : Hotel & Flights ○ Anomaly Detection, Destination trends, Demand Anomalies ● RealTime Features require Data Engineering support from Data Scientists ● Lack of standardization & discovery : Feature definitions are duplicated into the different data pipelines even if it is same / computed multiple times and change to definitions means fixing across different pipelines. ● Features used in training and serving were inconsistent
  • 6. Productionizing ML via Standardization ● MetaConfigs & Feature Catalog : Documentation ● Reusability of features across projects / teams ● Standardized access of features between Training & Serving | Data Governance + Data Quality ● More Self-serve : Reduces Data Scientist Time on DE Tasks ● Reduce Time to get to Production for ML Projects ● Reduce Data Tech-Debt & Improved Feature Quality Feature Store : Online + Historical Data Store 1 Data Store 2 Data Store N Raw Data Data Sets 1 Data Sets N Structured Data Feature Engineering MODEL : TRAINING + DEPLOY
  • 7. Machine Learning Life Cycle ML LifeCycle Image source : UCB RISE LABs Addition : FEATURE PIPELINES
  • 8. Prediction serving - ASK : 10 -30 ms / < 30 ms - Challenges : DNN : Complex models - Hardware : GPUs / TPUs - SageMaker provides abstraction / middle layer between applications and complex models thru docker containers - Online : SageMaker Endpoints - Batch : Scoring : Pre-materialize predictions into a low latency store ( like redis cluster / BoulderDB) - Problems : - Requires substantial computation and space - Example doing the scoring for all customers - Costly update -> rescore everything
  • 9. FeatureStore Glossary Feature : a measurable property of a phenomenon under observation defined in FSConfig FSConfig: used for storing config/ DSL + code to compute features, feature version information, feature analysis data and feature documentation FSCompute: Computation Engine developed over SPARK, supports mosts of the spark APIs for historical and Online(Streaming) FeatureStore : serves as a repository of features that can be used for training and evaluation of machine learning models. FeatureGroup: internal to the system, to group common compute jobs of related features having the same entity, input data sources and filter conditions, thereby optimizing the compute process. FSScheduler: Internal service to create a feature DAG(with Dependency Resolution) and trigger their execution while handling retries and back pressure. FS-DSA : Data Science Automation for Model Training + Deployment integrated with Feature Store | Enables versioned and reproducible experiments. FSBrokerAPI : Online Serving RESTful API endpoint for consumer applications
  • 10. FeatureStore Components & Data Flow User Funnel Activity Streams Client-Side Server-Side DATA CAPTURE COMPUTE + FSConfig SERVING + STORAGE Transactional Data Booking Master FSConfig : Feature Catalog Master Datastore Product Master, User Master, Device Master New Data Streams ML Automation BT-Compute BATCH Feature Compute Jobs RT-Compute Feature Compute SERVING API Offline Models Online Models Batch BULK API (DataFrame) Feature Definitions BoulderDB REDIS Feature Storage Job Scheduler Sagemaker TRAIN Training + HPO Deploy Docker / Batch Transform
  • 11. FSConfig : Feature Definitions & Metadata Feature Name : <Entity>::<Feature_shortname>::< Data Time Interval>::<Refresh Frequency>::<Version> Entity : <UserID>_<profileType> Short Name : listing_conversion_rank Versioning : v2 + Process : RT/BT FeatureGroup : (System Generated ID) 8fda73d1_2eee_4cfc_a20f_e9afb1 78fbc3 Entity: ["uuid", "profile_type"] Features [Array] Time Window(Refresh/ Data - Time duration): (ISO Time Interval) P1D Data Source [Array]: [user_master, txn_search] Data Store: GLUE/S3 Database Name: blueshift Table Name: [user_master, txn_search] Data Sink: Serving [Array] Data Store: GLUE Catalog/S3/Redis/BoulderDb Database Name : rocksDB_<WAL Dir Path> Table Name : rocksDB_<columnFamily> Compute Logic DSL + Spark SQL: metric_expr, group_by_expr, filter_expr, window_function, window_function_alias Code (Python/Scala/Java) : GIT/Gerrit URI Model(sagemaker) / Embedding Environment: Production Workspace: Dev/Staging/Production Namespace: <Project Name> Apache LIVY + Databricks JOBs API Config
  • 12. FS Store | online + historical Output Schema (internal to the system) ● Historical Feature Data schema on S3 Parquet |-- entity: string (nullable = false) |-- uuid_profileType::listing_conv_rank::P30D::P15M::v1: long (nullable = false) |-- uuid_profileType::listing_view_rank::P30D::P15M::v1: long (nullable = false) |-- uuid_profileType::cnt_distinct_bk_bankid::P30D::P15M::v1: map (nullable = false) | |-- key: string | |-- value: integer (valueContainsNull = true) .. .. All features in that feature group ● Online Serving Data Schema on REDIS + BoulderDB ○ Serving at Feature Group level Key -> <Entity_id>#<Feature_group_id>/<Feature_split> Value -> Hashes key -> Feature_name Value -> Feature_value TimeStamp -> Compute_Processed_Time ○ Serving at Feature Level Key -> <Entity_id>#<Feature_name> Value -> Hashes key -> Feature_name Value -> Feature_value SERVING Config - lambda (batch_feature_name linkage for RT features) - Support for linear QUERY DAGs - MVEL based post-processing on any feature per service/model if needed Feature backfill (back_fill_required, back_fill_duration)
  • 13. FS-BrokerAPI : Online Feature Serving Framework Data Access LayerREQUEST HANDLER Orchestration Layer Orchestration + Broker Extractors Transport Business Logics + MVEL Extractors Transport <uri>/v1/getFeature s (POST Request) AKKA(Actors) Request Validations Feature Definition Request Handler REDIS Boulder DB FeaturesbyName FeaturesbyModel FeaturesbyService
  • 14. BoulderDB : Online Serving Store - Build on top of RocksDB (embedded data store: developed by Facebook) : reducing the distance to data on serving layer. - Steps added to compute layer: post-processing: - BT-Compute Layer after processing data through SPARK(distributed) - writes into SST Files across various executors into shared object storage : S3 - Split spark dataframe into non-overlapping ranges : individual split is sorted by KEY, then it is ingested into sst file per partition / executor - Cluster coordinator : Consul - Atomic switching of DB snapshots - Data is sharded (helps with proximity by Namespace) and replicated(RF=2)
  • 15. Tools
  • 16. Next Steps - Feature Stats Visualization / Analytics & Monitoring // Feature Catalog - Seamless integration with Experimentation Framework - Per User Databases on top of feature-store for Personalization - Notebook integration : More better Data Science Tools for Data Scientists with Python libraries - Perf Tools : Query Optimization & Analysis
  • 17. References - https://www.logicalclocks.com/feature-store/ - https://eng.uber.com/scaling-michelangelo/ - Airbnb : Zipline - HopsML + Hopsworks - Go-JEK : FEAST - The Design of Systems for Real-time Prediction Serving | DataEngConf SF '18 - https://medium.com/makemytrip-engineering
  • 18. Piyush Kumar E : piyush.kumar@makemytrip.com W : www.makemytrip.com T : https://twitter.com/piykumar Thank you !!