SlideShare una empresa de Scribd logo
1 de 32
MLOps and Data Quality:
Deploying Reliable ML Models in
Production
Presented by:
Stepan Pushkarev, CTO @ Provectus
Rinat Gareev, ML Solutions Architect @ Provectus
Webinar Objectives
1. Explore best practices of building and deploying reliable Machine Learning
models
2. Review existing open source tools and reference architectures for
implementation of Data Quality components as part of your MLOps
pipelines
3. Get qualified for Provectus ML Infrastructure Acceleration Program – A
fully funded discovery workshop
Agenda
● Introduction and Why
● How: Common Practical Challenges and Solutions
○ Data Testing
○ Model Testing
● MLOps: Wiring Things Together
● Provectus ML Infrastructure Acceleration Program
Introductions
Stepan Pushkarev
Chief Technology
Officer, Provectus
Rinat Gareev
ML Solutions Architect,
Provectus
AI-First Consultancy & Solutions Provider
Сlients ranging from
fast-growing startups to
large enterprises
450 employees and
growing
Established in 2010
HQ in Palo Alto
Offices across the US,
Canada, and Europe
We are obsessed about leveraging cloud, data, and AI to reimagine the way
businesses operate, compete, and deliver customer value
Innovative Tech Vendors
Seeking for niche expertise to differentiate
and win the market
Midsize to Large Enterprises
Seeking to accelerate innovation, achieve
operational excellence
Our Clients
Why Quality Data Matters?
After Data Cleaning 0.91
TFIDF, PoS, Stop Words 0.695
Scikit Learn Default 0.69
Python Hyperopt 0.73
ACCURACY
Sigmod2016
Sanjay Krishnan (UC Berkeley)
And Jiannan Wang (Simon Fraser U.)
https://sigmod2016.org/sigmod_tutorial1.shtml
End-to-end deep learning image classification
models to detect child gaze, strabismus,
crescent, and dark iris/pupil population.
GoCheck Kids
Case Study
Before After Data QA
Precision 32% 40%
Recall 89% 91%
FPR 19% 17%
PR AUC 57% 76%
Machine Learning Lifecycle
Data Ingestion
Data Cleaning
Data Merging
Data Labeling
Feature Engineering
Versioned
Dataset
Model Training
Experimentation
Model Packaging
Model
Candidate
Regression Testing
Model Selection
Production
Deployment
Monitoring
Data Preparation ML Engineering Delivery & Operations
All Stages of ML Lifecycle Require QA
Data Ingestion
Data Cleaning
Data Merging
Data Labeling
Feature Engineering
Versioned
Dataset
Model Training
Experimentation
Model Packaging
Model
Candidate
Regression Testing
Model Selection
Production
Deployment
Monitoring
Data Preparation ML Engineering Delivery & Operations
Data
Tests
Code
Tests
Model
Tests
Data
Tests
Code
Tests
Model
Tests
Data
Tests
Code
Tests
Error Cascades
* from "Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI”,
N. Sambasivan et al., SIGCHI, ACM (2021)
How: Practical Challenges and
Solutions
Common Challenge #1:
How to find & access the data I trust?
1. Data is scattered across multiple data sources and
technologies: RDMS, DWH, Data Lakes, Blobs
2. Data ownership is not clear
3. Data requirements and SLAs are not clear
4. Metadata is not discoverable
5. As a result, all investments into Data and ML are killed by
data access and discoverability issues
Solution: Migrate to Data Mesh
Data Mesh is in the convergence of
Distributed Domain-Driven Architecture, Self-
Serve Platform Design, and Product Thinking
with Data
● Brings data closer to Domain Context
● Introduces the concept of Data as a
Product and all appropriate data
contracts
● Sorts out data ownership issues
https://martinfowler.com/articles/data-monolith-to-mesh.html
Invest into Global Data Catalog
The solution to answer questions like:
● Does this data exist? Where is it?
● What is the source of truth of the data?
● Who and/or which team is the owner?
● Who are the users of the data?
● Are there existing assets I can reuse?
● Can I trust this data?
* There are no established leaders
* Commercial vendors are not listed
Common Challenge #2:
How to get started with QA for Data and ML?
1. What exactly to test?
2. Who should test (Traditional QA, Data Engs, ML Engs,
Analysts)?
3. What tools to use?
4. As a result, low productivity of ML Engineers having to deal
with data quality issues.
Data: What to Test
Default data quality checks:
● Duplicates
● Missing values
● Syntax errors
● Format errors
● Semantic errors
● Integrity
Advanced unsupervised methods:
● Distribution tests
● KS, Chi-squared tests
● Outlier detection with AutoML
● Auto Constraints suggestion
● Data Profiling for Complex
Dependencies
Default data quality checks:
● Duplicates
● Missing values
● Syntax errors
● Format errors
● Semantic errors
● Integrity checks
Data: What to Test
Unsupervised Constraints Generation
Use cases:
● existing data with poor
documentation or
schema
● rapidly evolving data
● rich structure
● starting from scratch
1. Compute data
profiles/summaries
2. Generate checks on:
● types
● completeness
● ranges
● uniqueness
● distributions
Extensible:
● e.g., conventions on
column naming
3. Evaluate on
holdout subset
4. Review and add to
test suites
● Deequ
● GreatExpectations
● Tensorflow Data Validation
● dbt
Data Testing: Available Tools
* Commercial vendors are not listed
Model Testing
Model Testing: Analyzing Input and
Output Datasets
Model Testing: Datasets Are Test
Suites with Test Cases
● Golden UAT datasets
● Security datasets
● Production traffic replay
● Regression datasets
● Datasets for bias
● Datasets for edge cases
Model Testing: Bias
Bias is considered to be a disproportionate inclination or prejudice for or against an idea or thing.
10+ Bias Types
● Selection Bias — The selection of data in such
a way that the sample is not representative of
the population
● The Framing Effect — Annotation questions
that are constructed with a particular slant
● Systematic Bias — Consistent and repeatable
error.
● Outlier Data, Missing Values, Filtering Data
● Bias / Variance Trade off
● Personal Perception Bias
Model Testing: Available Tools
Adversarial Testing & Model Robustness:
1. Cleverhans by Ian Goodfellow & Nicolas Papernot
2. Adversarial Robustness Toolbox (ART) by DARPA
Bias and Fairness
1. AWS SageMaker Clarify
2. AIF360 by IBM
3. Aequitas by University of Chicago
MLOps: Wiring Things
Together
The Core of MLOps Pipelines
Model Code
ML Pipeline Code
Infrastructure as a
Code
Versioned Dataset
Production Metrics &
Alerts
Model Artifacts
Prediction Service
ML Metrics
Automated Pipeline Execution
Pipeline Metadata
Alerts Reports
Feature Store
Orchestration: Idempotent Execution
Feedback Loop for Production Data
The Core of MLOps Pipelines
Model Code
ML Pipeline Code
Infrastructure as a
Code
Versioned Dataset
Production Metrics &
Alerts
Model Artifacts
Prediction Service
ML Metrics
Automated Pipeline Execution
Pipeline Metadata
Alerts Reports
Feature Store
Orchestration: Idempotent Execution
Feedback Loop for Production Data
Data Quality Checks
Expanding Validation Pipelines
Feature Store ML Model
Versioned Dataset
Batch Quality
Checkpoints
Dataset Rules
Validation
Dataset
Bias Checker
Statistical Assertions
Outlier Detector
Deployed Model
Model
Validation
Model
Test for Bias
Model
Security Test
Regression
Test
Business
Acceptance
Traffic
Replay
1. You cannot deploy ML models to production without a clear
Data QA Strategy in place.
2. As a leader, focus on organizing data teams around product
features, to make them fully responsible for Data as a Product.
3. Design Data QA components as an essential part of your MLOps
foundation.
Final Recommendations
125 University Avenue
Suite 295, Palo Alto
California, 94301
provectus.com
Questions, details?
We would be happy to answer!

Más contenido relacionado

La actualidad más candente

Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOps
Weaveworks
 

La actualidad más candente (20)

MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at Scale
 
Ml ops past_present_future
Ml ops past_present_futureMl ops past_present_future
Ml ops past_present_future
 
Ml ops intro session
Ml ops   intro sessionMl ops   intro session
Ml ops intro session
 
MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive Advantage
 
MLOps with Azure DevOps
MLOps with Azure DevOpsMLOps with Azure DevOps
MLOps with Azure DevOps
 
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)
 
Google Vertex AI
Google Vertex AIGoogle Vertex AI
Google Vertex AI
 
Apply MLOps at Scale
Apply MLOps at ScaleApply MLOps at Scale
Apply MLOps at Scale
 
“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps
 
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full LifecycleMLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
 
What is MLOps
What is MLOpsWhat is MLOps
What is MLOps
 
Machine Learning Operations & Azure
Machine Learning Operations & AzureMachine Learning Operations & Azure
Machine Learning Operations & Azure
 
Ml ops on AWS
Ml ops on AWSMl ops on AWS
Ml ops on AWS
 
MLOps with serverless architectures (October 2018)
MLOps with serverless architectures (October 2018)MLOps with serverless architectures (October 2018)
MLOps with serverless architectures (October 2018)
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOps
 
MLOps by Sasha Rosenbaum
MLOps by Sasha RosenbaumMLOps by Sasha Rosenbaum
MLOps by Sasha Rosenbaum
 
Seamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowSeamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflow
 
MLOps with Kubeflow
MLOps with Kubeflow MLOps with Kubeflow
MLOps with Kubeflow
 

Similar a MLOps and Data Quality: Deploying Reliable ML Models in Production

Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Sri Ambati
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
Provectus
 
Using Data Science to Build an End-to-End Recommendation System
Using Data Science to Build an End-to-End Recommendation SystemUsing Data Science to Build an End-to-End Recommendation System
Using Data Science to Build an End-to-End Recommendation System
VMware Tanzu
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
DataWorks Summit
 

Similar a MLOps and Data Quality: Deploying Reliable ML Models in Production (20)

AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software Engineering
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
 
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
GDG Cloud Southlake #3 Charles Adetiloye: Enterprise MLOps in Practice
GDG Cloud Southlake #3 Charles Adetiloye: Enterprise MLOps in PracticeGDG Cloud Southlake #3 Charles Adetiloye: Enterprise MLOps in Practice
GDG Cloud Southlake #3 Charles Adetiloye: Enterprise MLOps in Practice
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in production
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in Production
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
Making Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedMaking Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons Learned
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
#Interactive Session by Vivek Patle and Jahnavi Umarji, "Empowering Functiona...
#Interactive Session by Vivek Patle and Jahnavi Umarji, "Empowering Functiona...#Interactive Session by Vivek Patle and Jahnavi Umarji, "Empowering Functiona...
#Interactive Session by Vivek Patle and Jahnavi Umarji, "Empowering Functiona...
 
Using Data Science to Build an End-to-End Recommendation System
Using Data Science to Build an End-to-End Recommendation SystemUsing Data Science to Build an End-to-End Recommendation System
Using Data Science to Build an End-to-End Recommendation System
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
Driving Digital Transformation with Machine Learning in Oracle Analytics
Driving Digital Transformation with Machine Learning in Oracle AnalyticsDriving Digital Transformation with Machine Learning in Oracle Analytics
Driving Digital Transformation with Machine Learning in Oracle Analytics
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
DevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-usDevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-us
 
Data kitchen 7 agile steps - big data fest 9-18-2015
Data kitchen   7 agile steps - big data fest 9-18-2015Data kitchen   7 agile steps - big data fest 9-18-2015
Data kitchen 7 agile steps - big data fest 9-18-2015
 

Más de Provectus

AI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and BeyondAI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and Beyond
Provectus
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
Provectus
 

Más de Provectus (20)

Choosing the right IDP Solution
Choosing the right IDP SolutionChoosing the right IDP Solution
Choosing the right IDP Solution
 
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
 
Choosing the Right Document Processing Solution for Healthcare Organizations
Choosing the Right Document Processing Solution for Healthcare OrganizationsChoosing the Right Document Processing Solution for Healthcare Organizations
Choosing the Right Document Processing Solution for Healthcare Organizations
 
AI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and BeyondAI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and Beyond
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
 
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMRCost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
 
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
 
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K..."Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
 
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ..."How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
 
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky..."Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
 
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2..."Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
 
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma..."Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
 
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ..."Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
 
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
 
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
 
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
"Integrate your front end apps with serverless backend in the cloud", Sebasti..."Integrate your front end apps with serverless backend in the cloud", Sebasti...
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
 
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
 
How to implement authorization in your backend with AWS IAM
How to implement authorization in your backend with AWS IAMHow to implement authorization in your backend with AWS IAM
How to implement authorization in your backend with AWS IAM
 
Yurii Gavrilin | ML Interpretability: From A to Z | Kazan ODSC Meetup
Yurii Gavrilin | ML Interpretability: From A to Z | Kazan ODSC MeetupYurii Gavrilin | ML Interpretability: From A to Z | Kazan ODSC Meetup
Yurii Gavrilin | ML Interpretability: From A to Z | Kazan ODSC Meetup
 
Andrei Grigoriev | Version Control in Data Science | Kazan ODSC Meetup
Andrei Grigoriev | Version Control in Data Science | Kazan ODSC MeetupAndrei Grigoriev | Version Control in Data Science | Kazan ODSC Meetup
Andrei Grigoriev | Version Control in Data Science | Kazan ODSC Meetup
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

MLOps and Data Quality: Deploying Reliable ML Models in Production

  • 1. MLOps and Data Quality: Deploying Reliable ML Models in Production Presented by: Stepan Pushkarev, CTO @ Provectus Rinat Gareev, ML Solutions Architect @ Provectus
  • 2. Webinar Objectives 1. Explore best practices of building and deploying reliable Machine Learning models 2. Review existing open source tools and reference architectures for implementation of Data Quality components as part of your MLOps pipelines 3. Get qualified for Provectus ML Infrastructure Acceleration Program – A fully funded discovery workshop
  • 3. Agenda ● Introduction and Why ● How: Common Practical Challenges and Solutions ○ Data Testing ○ Model Testing ● MLOps: Wiring Things Together ● Provectus ML Infrastructure Acceleration Program
  • 4. Introductions Stepan Pushkarev Chief Technology Officer, Provectus Rinat Gareev ML Solutions Architect, Provectus
  • 5. AI-First Consultancy & Solutions Provider Сlients ranging from fast-growing startups to large enterprises 450 employees and growing Established in 2010 HQ in Palo Alto Offices across the US, Canada, and Europe We are obsessed about leveraging cloud, data, and AI to reimagine the way businesses operate, compete, and deliver customer value
  • 6. Innovative Tech Vendors Seeking for niche expertise to differentiate and win the market Midsize to Large Enterprises Seeking to accelerate innovation, achieve operational excellence Our Clients
  • 7. Why Quality Data Matters? After Data Cleaning 0.91 TFIDF, PoS, Stop Words 0.695 Scikit Learn Default 0.69 Python Hyperopt 0.73 ACCURACY Sigmod2016 Sanjay Krishnan (UC Berkeley) And Jiannan Wang (Simon Fraser U.) https://sigmod2016.org/sigmod_tutorial1.shtml
  • 8. End-to-end deep learning image classification models to detect child gaze, strabismus, crescent, and dark iris/pupil population. GoCheck Kids Case Study Before After Data QA Precision 32% 40% Recall 89% 91% FPR 19% 17% PR AUC 57% 76%
  • 9. Machine Learning Lifecycle Data Ingestion Data Cleaning Data Merging Data Labeling Feature Engineering Versioned Dataset Model Training Experimentation Model Packaging Model Candidate Regression Testing Model Selection Production Deployment Monitoring Data Preparation ML Engineering Delivery & Operations
  • 10. All Stages of ML Lifecycle Require QA Data Ingestion Data Cleaning Data Merging Data Labeling Feature Engineering Versioned Dataset Model Training Experimentation Model Packaging Model Candidate Regression Testing Model Selection Production Deployment Monitoring Data Preparation ML Engineering Delivery & Operations Data Tests Code Tests Model Tests Data Tests Code Tests Model Tests Data Tests Code Tests
  • 11. Error Cascades * from "Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI”, N. Sambasivan et al., SIGCHI, ACM (2021)
  • 12. How: Practical Challenges and Solutions
  • 13. Common Challenge #1: How to find & access the data I trust? 1. Data is scattered across multiple data sources and technologies: RDMS, DWH, Data Lakes, Blobs 2. Data ownership is not clear 3. Data requirements and SLAs are not clear 4. Metadata is not discoverable 5. As a result, all investments into Data and ML are killed by data access and discoverability issues
  • 14. Solution: Migrate to Data Mesh Data Mesh is in the convergence of Distributed Domain-Driven Architecture, Self- Serve Platform Design, and Product Thinking with Data ● Brings data closer to Domain Context ● Introduces the concept of Data as a Product and all appropriate data contracts ● Sorts out data ownership issues https://martinfowler.com/articles/data-monolith-to-mesh.html
  • 15. Invest into Global Data Catalog The solution to answer questions like: ● Does this data exist? Where is it? ● What is the source of truth of the data? ● Who and/or which team is the owner? ● Who are the users of the data? ● Are there existing assets I can reuse? ● Can I trust this data? * There are no established leaders * Commercial vendors are not listed
  • 16. Common Challenge #2: How to get started with QA for Data and ML? 1. What exactly to test? 2. Who should test (Traditional QA, Data Engs, ML Engs, Analysts)? 3. What tools to use? 4. As a result, low productivity of ML Engineers having to deal with data quality issues.
  • 17. Data: What to Test Default data quality checks: ● Duplicates ● Missing values ● Syntax errors ● Format errors ● Semantic errors ● Integrity
  • 18. Advanced unsupervised methods: ● Distribution tests ● KS, Chi-squared tests ● Outlier detection with AutoML ● Auto Constraints suggestion ● Data Profiling for Complex Dependencies Default data quality checks: ● Duplicates ● Missing values ● Syntax errors ● Format errors ● Semantic errors ● Integrity checks Data: What to Test
  • 19. Unsupervised Constraints Generation Use cases: ● existing data with poor documentation or schema ● rapidly evolving data ● rich structure ● starting from scratch 1. Compute data profiles/summaries 2. Generate checks on: ● types ● completeness ● ranges ● uniqueness ● distributions Extensible: ● e.g., conventions on column naming 3. Evaluate on holdout subset 4. Review and add to test suites
  • 20. ● Deequ ● GreatExpectations ● Tensorflow Data Validation ● dbt Data Testing: Available Tools * Commercial vendors are not listed
  • 22. Model Testing: Analyzing Input and Output Datasets
  • 23. Model Testing: Datasets Are Test Suites with Test Cases ● Golden UAT datasets ● Security datasets ● Production traffic replay ● Regression datasets ● Datasets for bias ● Datasets for edge cases
  • 24. Model Testing: Bias Bias is considered to be a disproportionate inclination or prejudice for or against an idea or thing.
  • 25. 10+ Bias Types ● Selection Bias — The selection of data in such a way that the sample is not representative of the population ● The Framing Effect — Annotation questions that are constructed with a particular slant ● Systematic Bias — Consistent and repeatable error. ● Outlier Data, Missing Values, Filtering Data ● Bias / Variance Trade off ● Personal Perception Bias
  • 26. Model Testing: Available Tools Adversarial Testing & Model Robustness: 1. Cleverhans by Ian Goodfellow & Nicolas Papernot 2. Adversarial Robustness Toolbox (ART) by DARPA Bias and Fairness 1. AWS SageMaker Clarify 2. AIF360 by IBM 3. Aequitas by University of Chicago
  • 28. The Core of MLOps Pipelines Model Code ML Pipeline Code Infrastructure as a Code Versioned Dataset Production Metrics & Alerts Model Artifacts Prediction Service ML Metrics Automated Pipeline Execution Pipeline Metadata Alerts Reports Feature Store Orchestration: Idempotent Execution Feedback Loop for Production Data
  • 29. The Core of MLOps Pipelines Model Code ML Pipeline Code Infrastructure as a Code Versioned Dataset Production Metrics & Alerts Model Artifacts Prediction Service ML Metrics Automated Pipeline Execution Pipeline Metadata Alerts Reports Feature Store Orchestration: Idempotent Execution Feedback Loop for Production Data Data Quality Checks
  • 30. Expanding Validation Pipelines Feature Store ML Model Versioned Dataset Batch Quality Checkpoints Dataset Rules Validation Dataset Bias Checker Statistical Assertions Outlier Detector Deployed Model Model Validation Model Test for Bias Model Security Test Regression Test Business Acceptance Traffic Replay
  • 31. 1. You cannot deploy ML models to production without a clear Data QA Strategy in place. 2. As a leader, focus on organizing data teams around product features, to make them fully responsible for Data as a Product. 3. Design Data QA components as an essential part of your MLOps foundation. Final Recommendations
  • 32. 125 University Avenue Suite 295, Palo Alto California, 94301 provectus.com Questions, details? We would be happy to answer!