SlideShare a Scribd company logo
1 of 25
Automated Data Quality
Assurance
with Machine Learning
Milica Petrović
InCube Group
2
Talk Outline
2
What’s Wrong with Data Quality?
3
Error Detection using ML
Demo & Features
1
4 Error Remediation using RPA
3
2
What’s Wrong with Data Quality?
3
Error Detection using ML
Demo & Features
1
4 Error Remediation using RPA
4
Data Quality Today
What’s wrong with it?
5
Data Quality Today
Our Take at a Solution
Data Quality Today
 Manually coded SQL
rules
 Uni-/bi-variate checks
 Too much data
 Too many error types
 Too narrow focus
 Too late
 Automate: simultaneous error detection & faster process
 Reusability: tailored ML algorithms reused for fields of similar type
 Deep dive: discovery of new error types based on multivariate dependencies
 Unsupervised
 Capture multivariate
relationships
Challenges
Solutions with Machine Learning
Autoencoders
6
2
What’s Wrong with Data Quality?
3
Error Detection using ML
Demo & Features
1
4 Error Remediation using RPA
7
Autoencoders for Data Quality
Use and Architecture
Target: Reconstruct input
Bottleneck: Ensures network learns structure of input data
For good data only
INPUTINPUT OUTPUT
Differentiate between
good and bad data
Reconstruction: Good reconstruction only for data with structure similar to
training data
8
Autoencoders for Data Quality
Assumptions and Training
Imperfect data: Training requires large share of good data
Limited potency of the network: More layers not always better
INPUT OUTPUT INPUT
For good data only
9
Discriminating Good and Bad Data Records
Clustering the Reconstruction Errors
MeanSquaredError
Individual Data Records
Challenge: Where to draw the separating line?
Kernel Density Estimate
10
Discriminating Good and Bad Data Records
Clustering the Reconstruction Errors
MeanSquaredError
Individual Data Records
Challenge: Where to draw the separating line?
Kernel Density Estimate
11
Discriminating Good and Bad Data Records
Sequence of Autoencoders
1st iteration
Keep Rest of Data
Challenge: Magnitude of reconstruction error varies across
data error types
Remove Detected
Anomalies
12
Discriminating Good and Bad Data Records
Sequence of Autoencoders
13
Discriminating Good and Bad Data Records
Sequence of Autoencoders
Across iterations: Increase model complexity
Stopping: When threshold separates large chunk of data
14
2
What’s Wrong with Data Quality?
3
Error Detection using ML
Demo & Features
1
4 Error Remediation using RPA
15
Video Demo
Birth date
16
Video Demo
First name
17
Video Demo
Revenue
18
variational
autoencoder
complete
autoencoder with
regularization
undercomplete
autoencoder with
custom loss
one-hot
encoding
character
fields
categorical
fields
numerical
feature
extraction date
fields
normalization
numerical
fields
Reusability of Pre-Processing and Model Setup
Setup per feature type
Pre-processing
Models
19
2
What’s Wrong with Data Quality?
3
Error Detection using ML
Demo & Features
1
4 Error Remediation using RPA
20
Identifying Where DQ Exception Occurred
Searching for Correct Data using the ID
Customer Number Name Street Number City Zip Code
65465456 Jane Doe Riverstreet 32 Zurich 8004
12564321 John Doe Central Town i Zurich 8001
56543651 Peter Smith Quiet Street 2 Zurich 8045
36364448 Lisa Miller Busy Street 3 Zurich 8022
AUTOENCODER
Example: CRM data at a bank
21
Account Opening Form
Fictitious
Bank of
Zurich
Sender:
Customer Number: 12564321
John Doe (leave empty, filled by the bank)
Central Town 1
8001 Zurich Application Date: 1.9.2019
Opening Date: 1.1.2020
Finding Correct Data via “Intelligent” Robots
Computer Vision or NLP Help Find and Read Correct Data
Customer Number Name Street Number City Zip Code
65465456 Jane Doe Riverstreet 32 Zurich 8004
12564321 John Doe Central Town i Zurich 8001
56543651 Peter Smith Quiet Street 2 Zurich 8045
36364448 Lisa Miller Busy Street 3 Zurich 8022
22
Remediation Proposal
RPA Robot Proposes DQ Exception Correction
Customer Number Name Street Number City Zip Code
65465456 Jane Doe Riverstreet 32 Zurich 8004
12564321 John Doe Central Town i Zurich 8001
56543651 Peter Smith Quiet Street 2 Zurich 8045
36364448 Lisa Miller Busy Street 3 Zurich 8022
Customer Number Name Street Number City Zip Code
65465456 Jane Doe Riverstreet 32 Zurich 8004
12564321 John Doe Central Town 1 Zurich 8001
56543651 Peter Smith Quiet Street 2 Zurich 8045
36364448 Lisa Miller Busy Street 3 Zurich 8022
SUGGESTS THE CHANGE
BASED ON
23
Error Detection
via ML
Client Data
Repository
Extract DQ
exceptions using
autoencoders
+
RPA robot finds correct data using
NLP and computer vision, and
proposes remediation of DQ
exceptions
Cognitive
Services
SME reviews, and
approves or rejects
DQ resolutions are used to improve error detection
and correction
RPA robot
corrects
DQ exceptions
Corrected data
flows back into repository
Automation of Data Quality Remediation
Process Overview
24
Key Findings from Projects and Experience
High reusability: Only one-time customization effort per data type
Extension: ML can replicate rule-based DQ checks and find new errors
High scalability: Operations can be performed in parallel
Multivariate relationships: Detection of interdependencies
Unsupervised learning! Training data quality matters
Competency: Subject Matter Experts can focus on value-adding tasks
AutoencodersRPA
Cost savings: Automate interactions with existing IT infrastructure
Automated Data Quality Assurance with Machine Learning and Autoencoders

More Related Content

Similar to Automated Data Quality Assurance with Machine Learning and Autoencoders

Data Lineage, Property Based Testing & Neo4j
Data Lineage, Property Based Testing & Neo4j Data Lineage, Property Based Testing & Neo4j
Data Lineage, Property Based Testing & Neo4j Neo4j
 
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Databricks
 
Machine Learning for Malware Classification and Clustering
Machine Learning for Malware Classification and ClusteringMachine Learning for Malware Classification and Clustering
Machine Learning for Malware Classification and ClusteringAshwini Almad
 
Machine Learning for Malware Classification and Clustering
Machine Learning for Malware Classification and ClusteringMachine Learning for Malware Classification and Clustering
Machine Learning for Malware Classification and ClusteringEndgameInc
 
Lessons learned validating 60,000 pages of api documentation
Lessons learned validating 60,000 pages of api documentationLessons learned validating 60,000 pages of api documentation
Lessons learned validating 60,000 pages of api documentationBob Binder
 
AutoML - The Future of AI
AutoML - The Future of AIAutoML - The Future of AI
AutoML - The Future of AINing Jiang
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineMichael Gerke
 
Web Crawler Detection Model
Web Crawler Detection ModelWeb Crawler Detection Model
Web Crawler Detection ModelMatinZivdar
 
Streaming - Why Should I Care?
Streaming - Why Should I Care?Streaming - Why Should I Care?
Streaming - Why Should I Care?Christian Trebing
 
Icse2013 shang
Icse2013 shangIcse2013 shang
Icse2013 shangSAIL_QU
 
"Implementing data quality automation with open source stack" - Max Martynov,...
"Implementing data quality automation with open source stack" - Max Martynov,..."Implementing data quality automation with open source stack" - Max Martynov,...
"Implementing data quality automation with open source stack" - Max Martynov,...Grid Dynamics
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowDatabricks
 
Black_Friday_Sales_Trushita
Black_Friday_Sales_TrushitaBlack_Friday_Sales_Trushita
Black_Friday_Sales_TrushitaTrushita Redij
 
BlueHat v18 || Improving security posture through increased agility with meas...
BlueHat v18 || Improving security posture through increased agility with meas...BlueHat v18 || Improving security posture through increased agility with meas...
BlueHat v18 || Improving security posture through increased agility with meas...BlueHat Security Conference
 
[Tutorial] building machine learning models for predictive maintenance applic...
[Tutorial] building machine learning models for predictive maintenance applic...[Tutorial] building machine learning models for predictive maintenance applic...
[Tutorial] building machine learning models for predictive maintenance applic...PAPIs.io
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Clean & Dirty Acceptance Tests with Cucumber & Watir
Clean & Dirty Acceptance Tests with Cucumber & WatirClean & Dirty Acceptance Tests with Cucumber & Watir
Clean & Dirty Acceptance Tests with Cucumber & WatirDanny Smith
 
ANPR based Security System Using ALR
ANPR based Security System Using ALRANPR based Security System Using ALR
ANPR based Security System Using ALRAshok Basnet
 

Similar to Automated Data Quality Assurance with Machine Learning and Autoencoders (20)

Data Lineage, Property Based Testing & Neo4j
Data Lineage, Property Based Testing & Neo4j Data Lineage, Property Based Testing & Neo4j
Data Lineage, Property Based Testing & Neo4j
 
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
 
Machine Learning for Malware Classification and Clustering
Machine Learning for Malware Classification and ClusteringMachine Learning for Malware Classification and Clustering
Machine Learning for Malware Classification and Clustering
 
Machine Learning for Malware Classification and Clustering
Machine Learning for Malware Classification and ClusteringMachine Learning for Malware Classification and Clustering
Machine Learning for Malware Classification and Clustering
 
Lessons learned validating 60,000 pages of api documentation
Lessons learned validating 60,000 pages of api documentationLessons learned validating 60,000 pages of api documentation
Lessons learned validating 60,000 pages of api documentation
 
AutoML - The Future of AI
AutoML - The Future of AIAutoML - The Future of AI
AutoML - The Future of AI
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
 
Web Crawler Detection Model
Web Crawler Detection ModelWeb Crawler Detection Model
Web Crawler Detection Model
 
Streaming - Why Should I Care?
Streaming - Why Should I Care?Streaming - Why Should I Care?
Streaming - Why Should I Care?
 
Validation Is (Not) Easy
Validation Is (Not) EasyValidation Is (Not) Easy
Validation Is (Not) Easy
 
Icse2013 shang
Icse2013 shangIcse2013 shang
Icse2013 shang
 
"Implementing data quality automation with open source stack" - Max Martynov,...
"Implementing data quality automation with open source stack" - Max Martynov,..."Implementing data quality automation with open source stack" - Max Martynov,...
"Implementing data quality automation with open source stack" - Max Martynov,...
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
 
Black_Friday_Sales_Trushita
Black_Friday_Sales_TrushitaBlack_Friday_Sales_Trushita
Black_Friday_Sales_Trushita
 
BlueHat v18 || Improving security posture through increased agility with meas...
BlueHat v18 || Improving security posture through increased agility with meas...BlueHat v18 || Improving security posture through increased agility with meas...
BlueHat v18 || Improving security posture through increased agility with meas...
 
[Tutorial] building machine learning models for predictive maintenance applic...
[Tutorial] building machine learning models for predictive maintenance applic...[Tutorial] building machine learning models for predictive maintenance applic...
[Tutorial] building machine learning models for predictive maintenance applic...
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Clean & Dirty Acceptance Tests with Cucumber & Watir
Clean & Dirty Acceptance Tests with Cucumber & WatirClean & Dirty Acceptance Tests with Cucumber & Watir
Clean & Dirty Acceptance Tests with Cucumber & Watir
 
Vdsg /Craftworks Industrial-AI
Vdsg /Craftworks Industrial-AIVdsg /Craftworks Industrial-AI
Vdsg /Craftworks Industrial-AI
 
ANPR based Security System Using ALR
ANPR based Security System Using ALRANPR based Security System Using ALR
ANPR based Security System Using ALR
 

More from Institute of Contemporary Sciences

Building valuable (online and offline) Data Science communities - Experience ...
Building valuable (online and offline) Data Science communities - Experience ...Building valuable (online and offline) Data Science communities - Experience ...
Building valuable (online and offline) Data Science communities - Experience ...Institute of Contemporary Sciences
 
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
Data Science Master 4.0 on Belgrade University - Drazen DraskovicData Science Master 4.0 on Belgrade University - Drazen Draskovic
Data Science Master 4.0 on Belgrade University - Drazen DraskovicInstitute of Contemporary Sciences
 
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...Institute of Contemporary Sciences
 
Solving churn challenge in Big Data environment - Jelena Pekez
Solving churn challenge in Big Data environment  - Jelena PekezSolving churn challenge in Big Data environment  - Jelena Pekez
Solving churn challenge in Big Data environment - Jelena PekezInstitute of Contemporary Sciences
 
Application of Business Intelligence in bank risk management - Dimitar Dilov
Application of Business Intelligence in bank risk management - Dimitar DilovApplication of Business Intelligence in bank risk management - Dimitar Dilov
Application of Business Intelligence in bank risk management - Dimitar DilovInstitute of Contemporary Sciences
 
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...Institute of Contemporary Sciences
 
Recommender systems for personalized financial advice from concept to product...
Recommender systems for personalized financial advice from concept to product...Recommender systems for personalized financial advice from concept to product...
Recommender systems for personalized financial advice from concept to product...Institute of Contemporary Sciences
 
Advanced tools in real time analytics and AI in customer support - Milan Sima...
Advanced tools in real time analytics and AI in customer support - Milan Sima...Advanced tools in real time analytics and AI in customer support - Milan Sima...
Advanced tools in real time analytics and AI in customer support - Milan Sima...Institute of Contemporary Sciences
 
Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...Institute of Contemporary Sciences
 
Reality and traps of real time data engineering - Milos Solujic
Reality and traps of real time data engineering - Milos SolujicReality and traps of real time data engineering - Milos Solujic
Reality and traps of real time data engineering - Milos SolujicInstitute of Contemporary Sciences
 
Sensor networks for personalized health monitoring - Vladimir Brusic
Sensor networks for personalized health monitoring - Vladimir BrusicSensor networks for personalized health monitoring - Vladimir Brusic
Sensor networks for personalized health monitoring - Vladimir BrusicInstitute of Contemporary Sciences
 
Prediction of good patterns for future sales using image recognition
Prediction of good patterns for future sales using image recognitionPrediction of good patterns for future sales using image recognition
Prediction of good patterns for future sales using image recognitionInstitute of Contemporary Sciences
 
Using data to fight corruption: full budget transparency in local government
Using data to fight corruption: full budget transparency in local governmentUsing data to fight corruption: full budget transparency in local government
Using data to fight corruption: full budget transparency in local governmentInstitute of Contemporary Sciences
 

More from Institute of Contemporary Sciences (20)

First 5 years of PSI:ML - Filip Panjevic
First 5 years of PSI:ML - Filip PanjevicFirst 5 years of PSI:ML - Filip Panjevic
First 5 years of PSI:ML - Filip Panjevic
 
Building valuable (online and offline) Data Science communities - Experience ...
Building valuable (online and offline) Data Science communities - Experience ...Building valuable (online and offline) Data Science communities - Experience ...
Building valuable (online and offline) Data Science communities - Experience ...
 
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
Data Science Master 4.0 on Belgrade University - Drazen DraskovicData Science Master 4.0 on Belgrade University - Drazen Draskovic
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
 
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
 
Solving churn challenge in Big Data environment - Jelena Pekez
Solving churn challenge in Big Data environment  - Jelena PekezSolving churn challenge in Big Data environment  - Jelena Pekez
Solving churn challenge in Big Data environment - Jelena Pekez
 
Application of Business Intelligence in bank risk management - Dimitar Dilov
Application of Business Intelligence in bank risk management - Dimitar DilovApplication of Business Intelligence in bank risk management - Dimitar Dilov
Application of Business Intelligence in bank risk management - Dimitar Dilov
 
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
 
Recommender systems for personalized financial advice from concept to product...
Recommender systems for personalized financial advice from concept to product...Recommender systems for personalized financial advice from concept to product...
Recommender systems for personalized financial advice from concept to product...
 
Advanced tools in real time analytics and AI in customer support - Milan Sima...
Advanced tools in real time analytics and AI in customer support - Milan Sima...Advanced tools in real time analytics and AI in customer support - Milan Sima...
Advanced tools in real time analytics and AI in customer support - Milan Sima...
 
Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...
 
From Zero to ML Hero for Underdogs - Amir Tabakovic
From Zero to ML Hero for Underdogs  - Amir TabakovicFrom Zero to ML Hero for Underdogs  - Amir Tabakovic
From Zero to ML Hero for Underdogs - Amir Tabakovic
 
Data and data scientists are not equal to money david hoyle
Data and data scientists are not equal to money   david hoyleData and data scientists are not equal to money   david hoyle
Data and data scientists are not equal to money david hoyle
 
The price is right - Tomislav Krizan
The price is right - Tomislav KrizanThe price is right - Tomislav Krizan
The price is right - Tomislav Krizan
 
When it's raining gold, bring a bucket - Andjela Culibrk
When it's raining gold, bring a bucket - Andjela CulibrkWhen it's raining gold, bring a bucket - Andjela Culibrk
When it's raining gold, bring a bucket - Andjela Culibrk
 
Reality and traps of real time data engineering - Milos Solujic
Reality and traps of real time data engineering - Milos SolujicReality and traps of real time data engineering - Milos Solujic
Reality and traps of real time data engineering - Milos Solujic
 
Sensor networks for personalized health monitoring - Vladimir Brusic
Sensor networks for personalized health monitoring - Vladimir BrusicSensor networks for personalized health monitoring - Vladimir Brusic
Sensor networks for personalized health monitoring - Vladimir Brusic
 
Improving Data Quality with Product Similarity Search
Improving Data Quality with Product Similarity SearchImproving Data Quality with Product Similarity Search
Improving Data Quality with Product Similarity Search
 
Prediction of good patterns for future sales using image recognition
Prediction of good patterns for future sales using image recognitionPrediction of good patterns for future sales using image recognition
Prediction of good patterns for future sales using image recognition
 
Using data to fight corruption: full budget transparency in local government
Using data to fight corruption: full budget transparency in local governmentUsing data to fight corruption: full budget transparency in local government
Using data to fight corruption: full budget transparency in local government
 
Geospatial Analysis and Open Data - Forest and Climate
Geospatial Analysis and Open Data - Forest and ClimateGeospatial Analysis and Open Data - Forest and Climate
Geospatial Analysis and Open Data - Forest and Climate
 

Recently uploaded

Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 

Recently uploaded (20)

Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 

Automated Data Quality Assurance with Machine Learning and Autoencoders

  • 1. Automated Data Quality Assurance with Machine Learning Milica Petrović InCube Group
  • 2. 2 Talk Outline 2 What’s Wrong with Data Quality? 3 Error Detection using ML Demo & Features 1 4 Error Remediation using RPA
  • 3. 3 2 What’s Wrong with Data Quality? 3 Error Detection using ML Demo & Features 1 4 Error Remediation using RPA
  • 5. 5 Data Quality Today Our Take at a Solution Data Quality Today  Manually coded SQL rules  Uni-/bi-variate checks  Too much data  Too many error types  Too narrow focus  Too late  Automate: simultaneous error detection & faster process  Reusability: tailored ML algorithms reused for fields of similar type  Deep dive: discovery of new error types based on multivariate dependencies  Unsupervised  Capture multivariate relationships Challenges Solutions with Machine Learning Autoencoders
  • 6. 6 2 What’s Wrong with Data Quality? 3 Error Detection using ML Demo & Features 1 4 Error Remediation using RPA
  • 7. 7 Autoencoders for Data Quality Use and Architecture Target: Reconstruct input Bottleneck: Ensures network learns structure of input data For good data only INPUTINPUT OUTPUT Differentiate between good and bad data Reconstruction: Good reconstruction only for data with structure similar to training data
  • 8. 8 Autoencoders for Data Quality Assumptions and Training Imperfect data: Training requires large share of good data Limited potency of the network: More layers not always better INPUT OUTPUT INPUT For good data only
  • 9. 9 Discriminating Good and Bad Data Records Clustering the Reconstruction Errors MeanSquaredError Individual Data Records Challenge: Where to draw the separating line? Kernel Density Estimate
  • 10. 10 Discriminating Good and Bad Data Records Clustering the Reconstruction Errors MeanSquaredError Individual Data Records Challenge: Where to draw the separating line? Kernel Density Estimate
  • 11. 11 Discriminating Good and Bad Data Records Sequence of Autoencoders 1st iteration Keep Rest of Data Challenge: Magnitude of reconstruction error varies across data error types Remove Detected Anomalies
  • 12. 12 Discriminating Good and Bad Data Records Sequence of Autoencoders
  • 13. 13 Discriminating Good and Bad Data Records Sequence of Autoencoders Across iterations: Increase model complexity Stopping: When threshold separates large chunk of data
  • 14. 14 2 What’s Wrong with Data Quality? 3 Error Detection using ML Demo & Features 1 4 Error Remediation using RPA
  • 18. 18 variational autoencoder complete autoencoder with regularization undercomplete autoencoder with custom loss one-hot encoding character fields categorical fields numerical feature extraction date fields normalization numerical fields Reusability of Pre-Processing and Model Setup Setup per feature type Pre-processing Models
  • 19. 19 2 What’s Wrong with Data Quality? 3 Error Detection using ML Demo & Features 1 4 Error Remediation using RPA
  • 20. 20 Identifying Where DQ Exception Occurred Searching for Correct Data using the ID Customer Number Name Street Number City Zip Code 65465456 Jane Doe Riverstreet 32 Zurich 8004 12564321 John Doe Central Town i Zurich 8001 56543651 Peter Smith Quiet Street 2 Zurich 8045 36364448 Lisa Miller Busy Street 3 Zurich 8022 AUTOENCODER Example: CRM data at a bank
  • 21. 21 Account Opening Form Fictitious Bank of Zurich Sender: Customer Number: 12564321 John Doe (leave empty, filled by the bank) Central Town 1 8001 Zurich Application Date: 1.9.2019 Opening Date: 1.1.2020 Finding Correct Data via “Intelligent” Robots Computer Vision or NLP Help Find and Read Correct Data Customer Number Name Street Number City Zip Code 65465456 Jane Doe Riverstreet 32 Zurich 8004 12564321 John Doe Central Town i Zurich 8001 56543651 Peter Smith Quiet Street 2 Zurich 8045 36364448 Lisa Miller Busy Street 3 Zurich 8022
  • 22. 22 Remediation Proposal RPA Robot Proposes DQ Exception Correction Customer Number Name Street Number City Zip Code 65465456 Jane Doe Riverstreet 32 Zurich 8004 12564321 John Doe Central Town i Zurich 8001 56543651 Peter Smith Quiet Street 2 Zurich 8045 36364448 Lisa Miller Busy Street 3 Zurich 8022 Customer Number Name Street Number City Zip Code 65465456 Jane Doe Riverstreet 32 Zurich 8004 12564321 John Doe Central Town 1 Zurich 8001 56543651 Peter Smith Quiet Street 2 Zurich 8045 36364448 Lisa Miller Busy Street 3 Zurich 8022 SUGGESTS THE CHANGE BASED ON
  • 23. 23 Error Detection via ML Client Data Repository Extract DQ exceptions using autoencoders + RPA robot finds correct data using NLP and computer vision, and proposes remediation of DQ exceptions Cognitive Services SME reviews, and approves or rejects DQ resolutions are used to improve error detection and correction RPA robot corrects DQ exceptions Corrected data flows back into repository Automation of Data Quality Remediation Process Overview
  • 24. 24 Key Findings from Projects and Experience High reusability: Only one-time customization effort per data type Extension: ML can replicate rule-based DQ checks and find new errors High scalability: Operations can be performed in parallel Multivariate relationships: Detection of interdependencies Unsupervised learning! Training data quality matters Competency: Subject Matter Experts can focus on value-adding tasks AutoencodersRPA Cost savings: Automate interactions with existing IT infrastructure