SlideShare una empresa de Scribd logo
1 de 26
Descargar para leer sin conexión
#datapopupseattle
Correctness in Data Science
Benjamin Skrainka
Principal Data Scientist + Lead Instructor, Galvanize
bskrainka galvanize
#datapopupseattle
UNSTRUCTURED
Data Science POP-UP in Seattle
www.dominodatalab.com
D
Produced by Domino Data Lab
Domino’s enterprise data science platform is used
by leading analytical organizations to increase
productivity, enable collaboration, and publish
models into production faster.
Correctness in Data Science
Benjamin S. Skrainka
October 7, 2015
Benjamin S. Skrainka Correctness in Data Science October 7, 2015 1 / 24
The correctness problem
A lot of (data) science is unscientific:
“My code runs, so the answer must be correct”
“It passed Explain Plan, so the answer is correct”
“This model is too complex to have a design document”
“It is impossible to unit test scientific code”
“The lift from the direct mail campaign is 10%”"
Benjamin S. Skrainka Correctness in Data Science October 7, 2015 2 / 24
Correctness matters
Bad (data) science:
Costs real money and can kill people
Will eventually damage your reputation and career
Could expose you to litigation
An issue of basic integrity and sleeping at night
Benjamin S. Skrainka Correctness in Data Science October 7, 2015 3 / 24
Objectives
Today’s goals:
Introduce VV&UQ framework to evaluate correctness of scientific
models
Survey good habits to improve quality of your work
Benjamin S. Skrainka Correctness in Data Science October 7, 2015 4 / 24
Verification, Validation, & Uncertainty Quantification
Benjamin S. Skrainka Correctness in Data Science October 7, 2015 5 / 24
Introduction to VV&UQ
Verification, Validation, & Uncertainty Quantification provides
epistemological framework to evaluate correctness of scientific models:
Evidence of correctness should accompany any prediction
In absence of evidence, assume predictions are wrong
Popper: can only disprove or fail to disprove a model
VV&UQ is inductive whereas science is deductive
Reference: Verification and Validation in Scientific Computing by
Oberkampf & Roy
Benjamin S. Skrainka Correctness in Data Science October 7, 2015 6 / 24
Definitions of VV&UQ
Definitions of terms (Oberkampf & Roy):
Verification:
I “solving equations right”
I I.e., code implements the model correctly
Validation:
I “solving right equations”
I I.e., model has high fidelity to reality
Definitions of VV&UQ will vary depending on source . . .
æ Most organizations do not even practice verification. . .
Benjamin S. Skrainka Correctness in Data Science October 7, 2015 7 / 24
Definition of UQ
Definition of Uncertainty Quantification (Oberkampf & Roy):
Process of identifying, characterizing, and quantifying those
factors in an analysis which could a ect accuracy of
computational results
Do your assumptions hold? When do they fail?
Does your model apply to the data/situation?
Where does your model break down? What are its limits?
Benjamin S. Skrainka Correctness in Data Science October 7, 2015 8 / 24
Verification of code
Does your code implement the model correctly?
Unit test everything you can:
I Scientific code can be unit tested
I Test special cases
I Test on cases with analytic solutions
I Test on synthetic data
Unit test framework will setup and tear-down fixtures
Should be able to recover parameters from Monte Carlo data
Benjamin S. Skrainka Correctness in Data Science October 7, 2015 9 / 24
Verification of SQL
Passing Explain Plan doesn’t mean your SQL is correct:
Garbage in, garbage out
Check a simple case you can compute by hand
Check join plan is correct
Check aggregate statistics
Check answer is compatible with reality
Benjamin S. Skrainka Correctness in Data Science October 7, 2015 10 / 24
Unit test
import unittest2 as unittest
import assignment as problems
class TestAssignment(unittest.TestCase):
def test_zero(self):
result = problems.question_zero()
self.assertEqual(result, 9198)
...
if __name__ == __main__ :
unittest.main()
Benjamin S. Skrainka Correctness in Data Science October 7, 2015 11 / 24
Unit test
Figure 1:Benjamin S. Skrainka Correctness in Data Science October 7, 2015 12 / 24
Validation of model
Check your model is a good (enough) representation of reality:
“All models are wrong but some are useful” – George Box
Run an experiment
Perform specification testing
Test assumptions hold
Beware of endogenous features
Benjamin S. Skrainka Correctness in Data Science October 7, 2015 13 / 24
Approaches to experimentation
Many ways to test:
A/B test
Multi-armed bandit
Bayesian A/B test
Wald sequential analysis
Benjamin S. Skrainka Correctness in Data Science October 7, 2015 14 / 24
Uncertainty quantification
There are many types of uncertainty which a ect the robustness of your
model:
Parameter uncertainty
Structural uncertainty
Algorithmic uncertainty
Experimental uncertainty
Interpolation uncertainty
Classified as aleatoric (statistical) and epistemic (systematic)
Benjamin S. Skrainka Correctness in Data Science October 7, 2015 15 / 24
Good habits
Benjamin S. Skrainka Correctness in Data Science October 7, 2015 16 / 24
Act like a software engineer
Use best practices from software engineering:
Good design of code
Follow a sensible coding convention
Version control
Use same file structure for every project
Unit test
Use PEP8 or equivalent
Perform code reviews
Benjamin S. Skrainka Correctness in Data Science October 7, 2015 17 / 24
Reproducible research
‘Document what you do and do what you document’:
Keep a journal!
Data provenance
How data was cleaned
Design document
Specification & requirements
Do you keep a journal? You should. Fermi taught me that. –
John A. Wheeler
Benjamin S. Skrainka Correctness in Data Science October 7, 2015 18 / 24
Follow a workflow
Use a workflow like CRISP-DM:
1 Define business question and metric
2 Understand data
3 Prepare data
4 Build model
5 Evaluate
6 Deploy
Ensures you don’t forget any key steps
Benjamin S. Skrainka Correctness in Data Science October 7, 2015 19 / 24
Automate your data pipeline
One-touch build of your application or paper:
Automate entire workflow from raw data to final result
Ensures you perform all steps
Ensures all steps are known – no one o manual adjustments
Avoids stupid human errors
Auto generate all tables and figures
Save time when handling new data . . . which always has subtle
changes in formatting
Benjamin S. Skrainka Correctness in Data Science October 7, 2015 20 / 24
Write flexible code to handle data
Use constants/macros to access data fields:
Code will clearly show what data matters
Easier to understand code and data pipeline
Easier to debug data problems
Easier to handles changes in data formatting
Benjamin S. Skrainka Correctness in Data Science October 7, 2015 21 / 24
Python example
# Setup indicators
ix_gdp = 7
...
# Load & clean data
m_raw = np.recfromcsv( bea_gdp.csv )
gdp = m_raw[:, ix_gdp]
...
Benjamin S. Skrainka Correctness in Data Science October 7, 2015 22 / 24
Politics. . .
Often, there is political pressure to violate best practice:
Examples:
I 80% confidence intervals
I Absurd attribution window
I Two year forecast horizon but only three months of data
Hard to do right thing vs. senior management
Recruit a high-level scientist to advocate
Particularly common with forecasting:
I Often requested by management for CYA
I Insist on a ‘panel of experts’ for impossible decisions
Benjamin S. Skrainka Correctness in Data Science October 7, 2015 23 / 24
Conclusion
Need to raise the quality of data science:
VV & UQ provides rigorous framework:
I Verification: solve the equations right
I Validation: solve the right equations
I Uncertainty quantification: how robust is model to unknowns?
Adopting good habits provides huge gains for minimal e ort
Benjamin S. Skrainka Correctness in Data Science October 7, 2015 24 / 24

Más contenido relacionado

La actualidad más candente

Trends on Pinterest
Trends on PinterestTrends on Pinterest
Trends on PinterestJune Andrews
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceLivePerson
 
Data Science Popup Austin: Conflict in Growing Data Science Organizations
Data Science Popup Austin: Conflict in Growing Data Science Organizations Data Science Popup Austin: Conflict in Growing Data Science Organizations
Data Science Popup Austin: Conflict in Growing Data Science Organizations Domino Data Lab
 
data scientist the sexiest job of the 21st century
data scientist the sexiest job of the 21st centurydata scientist the sexiest job of the 21st century
data scientist the sexiest job of the 21st centuryFrank Kienle
 
BioIT Webinar on AI and data methods for drug discovery
BioIT Webinar on AI and data methods for drug discoveryBioIT Webinar on AI and data methods for drug discovery
BioIT Webinar on AI and data methods for drug discoveryFernanda Foertter
 
Implementing Data Science
Implementing Data ScienceImplementing Data Science
Implementing Data ScienceNathan Watson
 
Data Science Popup Austin: Privilege and Supervised Machine Learning
Data Science Popup Austin: Privilege and Supervised Machine LearningData Science Popup Austin: Privilege and Supervised Machine Learning
Data Science Popup Austin: Privilege and Supervised Machine LearningDomino Data Lab
 
Exploring What a Typical Data Science Project Looks Like
Exploring What a Typical Data Science Project Looks LikeExploring What a Typical Data Science Project Looks Like
Exploring What a Typical Data Science Project Looks LikeProduct School
 
Agile Analytics: Delivering on Promises by Atif Abdul Rahman
Agile Analytics: Delivering on Promises by Atif Abdul RahmanAgile Analytics: Delivering on Promises by Atif Abdul Rahman
Agile Analytics: Delivering on Promises by Atif Abdul RahmanAgile ME
 
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Edureka!
 
Is Data Scientist still the sexiest job of 21st century? Find Out!
Is Data Scientist still the sexiest job of 21st century? Find Out!Is Data Scientist still the sexiest job of 21st century? Find Out!
Is Data Scientist still the sexiest job of 21st century? Find Out!Edureka!
 
A data view of the data science process
A data view of the data science processA data view of the data science process
A data view of the data science processMathieu d'Aquin
 
How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...
How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...
How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...Edureka!
 
Leland Lockhart - SXSW Intro to Data Science
Leland Lockhart -  SXSW Intro to Data ScienceLeland Lockhart -  SXSW Intro to Data Science
Leland Lockhart - SXSW Intro to Data ScienceLeland Lockhart, PhD
 
IT & Innovation - short summary
IT & Innovation - short summaryIT & Innovation - short summary
IT & Innovation - short summaryPerry Nouwens
 
Be Data Informed Without Being a Data Scientist
Be Data Informed Without Being a Data ScientistBe Data Informed Without Being a Data Scientist
Be Data Informed Without Being a Data ScientistPamela Pavliscak
 
Data Curation - Data probity in a time of COVID
Data Curation - Data probity in a time of COVIDData Curation - Data probity in a time of COVID
Data Curation - Data probity in a time of COVIDSIKM
 
What is a Data Scientist
What is a Data Scientist What is a Data Scientist
What is a Data Scientist Experian_US
 
Who is a data scientist
Who is a data scientist  Who is a data scientist
Who is a data scientist prateek kumar
 

La actualidad más candente (20)

Trends on Pinterest
Trends on PinterestTrends on Pinterest
Trends on Pinterest
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data Science Popup Austin: Conflict in Growing Data Science Organizations
Data Science Popup Austin: Conflict in Growing Data Science Organizations Data Science Popup Austin: Conflict in Growing Data Science Organizations
Data Science Popup Austin: Conflict in Growing Data Science Organizations
 
data scientist the sexiest job of the 21st century
data scientist the sexiest job of the 21st centurydata scientist the sexiest job of the 21st century
data scientist the sexiest job of the 21st century
 
BioIT Webinar on AI and data methods for drug discovery
BioIT Webinar on AI and data methods for drug discoveryBioIT Webinar on AI and data methods for drug discovery
BioIT Webinar on AI and data methods for drug discovery
 
Implementing Data Science
Implementing Data ScienceImplementing Data Science
Implementing Data Science
 
Data Science Popup Austin: Privilege and Supervised Machine Learning
Data Science Popup Austin: Privilege and Supervised Machine LearningData Science Popup Austin: Privilege and Supervised Machine Learning
Data Science Popup Austin: Privilege and Supervised Machine Learning
 
Exploring What a Typical Data Science Project Looks Like
Exploring What a Typical Data Science Project Looks LikeExploring What a Typical Data Science Project Looks Like
Exploring What a Typical Data Science Project Looks Like
 
Agile Analytics: Delivering on Promises by Atif Abdul Rahman
Agile Analytics: Delivering on Promises by Atif Abdul RahmanAgile Analytics: Delivering on Promises by Atif Abdul Rahman
Agile Analytics: Delivering on Promises by Atif Abdul Rahman
 
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
 
Is Data Scientist still the sexiest job of 21st century? Find Out!
Is Data Scientist still the sexiest job of 21st century? Find Out!Is Data Scientist still the sexiest job of 21st century? Find Out!
Is Data Scientist still the sexiest job of 21st century? Find Out!
 
A data view of the data science process
A data view of the data science processA data view of the data science process
A data view of the data science process
 
How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...
How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...
How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...
 
Leland Lockhart - SXSW Intro to Data Science
Leland Lockhart -  SXSW Intro to Data ScienceLeland Lockhart -  SXSW Intro to Data Science
Leland Lockhart - SXSW Intro to Data Science
 
IT & Innovation - short summary
IT & Innovation - short summaryIT & Innovation - short summary
IT & Innovation - short summary
 
Be Data Informed Without Being a Data Scientist
Be Data Informed Without Being a Data ScientistBe Data Informed Without Being a Data Scientist
Be Data Informed Without Being a Data Scientist
 
Data Curation - Data probity in a time of COVID
Data Curation - Data probity in a time of COVIDData Curation - Data probity in a time of COVID
Data Curation - Data probity in a time of COVID
 
What is a Data Scientist
What is a Data Scientist What is a Data Scientist
What is a Data Scientist
 
Who is a data scientist
Who is a data scientist  Who is a data scientist
Who is a data scientist
 
Agile Analytics
Agile AnalyticsAgile Analytics
Agile Analytics
 

Similar a Correctness in Data Science - Data Science Pop-up Seattle

Tune up your data science process
Tune up your data science processTune up your data science process
Tune up your data science processBenjamin Skrainka
 
No estimates - 10 new principles for testing
No estimates  - 10 new principles for testingNo estimates  - 10 new principles for testing
No estimates - 10 new principles for testingVasco Duarte
 
DataAnalyticsLC_20180410_public
DataAnalyticsLC_20180410_publicDataAnalyticsLC_20180410_public
DataAnalyticsLC_20180410_publicplka13
 
Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI dayMohammed Barakat
 
Disrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging TechnologiesDisrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging TechnologiesDatabricks
 
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...Precisely
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus
 
Scale your Testing and Quality with Automation Engineering and ML - Carlos Ki...
Scale your Testing and Quality with Automation Engineering and ML - Carlos Ki...Scale your Testing and Quality with Automation Engineering and ML - Carlos Ki...
Scale your Testing and Quality with Automation Engineering and ML - Carlos Ki...QA or the Highway
 
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...Institute of Contemporary Sciences
 
Data-Ed Online: Approaching Data Quality
Data-Ed Online: Approaching Data QualityData-Ed Online: Approaching Data Quality
Data-Ed Online: Approaching Data QualityDATAVERSITY
 
Advanced Project Data Analytics for Improved Project Delivery
Advanced Project Data Analytics for Improved Project DeliveryAdvanced Project Data Analytics for Improved Project Delivery
Advanced Project Data Analytics for Improved Project DeliveryMark Constable
 
Proposing an Interactive Audit Pipeline for Visual Privacy Research
Proposing an Interactive Audit Pipeline for Visual Privacy ResearchProposing an Interactive Audit Pipeline for Visual Privacy Research
Proposing an Interactive Audit Pipeline for Visual Privacy ResearchChristan Grant
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Greg Makowski
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2Roger Barga
 
How Data Scientists Make Reliable Decisions with Data
How Data Scientists Make Reliable Decisions with DataHow Data Scientists Make Reliable Decisions with Data
How Data Scientists Make Reliable Decisions with DataTa-Wei (David) Huang
 
Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learningPramit Choudhary
 

Similar a Correctness in Data Science - Data Science Pop-up Seattle (20)

Tune up your data science process
Tune up your data science processTune up your data science process
Tune up your data science process
 
No estimates - 10 new principles for testing
No estimates  - 10 new principles for testingNo estimates  - 10 new principles for testing
No estimates - 10 new principles for testing
 
Challenges of Executing AI
Challenges of Executing AIChallenges of Executing AI
Challenges of Executing AI
 
DataAnalyticsLC_20180410_public
DataAnalyticsLC_20180410_publicDataAnalyticsLC_20180410_public
DataAnalyticsLC_20180410_public
 
Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI day
 
Math in data
Math in dataMath in data
Math in data
 
Disrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging TechnologiesDisrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging Technologies
 
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
Scale your Testing and Quality with Automation Engineering and ML - Carlos Ki...
Scale your Testing and Quality with Automation Engineering and ML - Carlos Ki...Scale your Testing and Quality with Automation Engineering and ML - Carlos Ki...
Scale your Testing and Quality with Automation Engineering and ML - Carlos Ki...
 
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
 
Data-Ed Online: Approaching Data Quality
Data-Ed Online: Approaching Data QualityData-Ed Online: Approaching Data Quality
Data-Ed Online: Approaching Data Quality
 
Advanced Project Data Analytics for Improved Project Delivery
Advanced Project Data Analytics for Improved Project DeliveryAdvanced Project Data Analytics for Improved Project Delivery
Advanced Project Data Analytics for Improved Project Delivery
 
Big Data
Big DataBig Data
Big Data
 
Why am I doing this???
Why am I doing this???Why am I doing this???
Why am I doing this???
 
Proposing an Interactive Audit Pipeline for Visual Privacy Research
Proposing an Interactive Audit Pipeline for Visual Privacy ResearchProposing an Interactive Audit Pipeline for Visual Privacy Research
Proposing an Interactive Audit Pipeline for Visual Privacy Research
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
How Data Scientists Make Reliable Decisions with Data
How Data Scientists Make Reliable Decisions with DataHow Data Scientists Make Reliable Decisions with Data
How Data Scientists Make Reliable Decisions with Data
 
Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learning
 

Más de Domino Data Lab

What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...Domino Data Lab
 
The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...Domino Data Lab
 
Racial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops dataRacial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops dataDomino Data Lab
 
Data Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using itData Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using itDomino Data Lab
 
Supporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentationSupporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentationDomino Data Lab
 
Leveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive IndustryLeveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive IndustryDomino Data Lab
 
Summertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile VirusSummertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile VirusDomino Data Lab
 
Reproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with JupyterReproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with JupyterDomino Data Lab
 
GeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data ScienceGeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data ScienceDomino Data Lab
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Domino Data Lab
 
Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Domino Data Lab
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at ScaleDomino Data Lab
 
How I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataHow I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataDomino Data Lab
 
Software Engineering for Data Scientists
Software Engineering for Data ScientistsSoftware Engineering for Data Scientists
Software Engineering for Data ScientistsDomino Data Lab
 
Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...Domino Data Lab
 
Building Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technologyBuilding Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technologyDomino Data Lab
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsDomino Data Lab
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino Data Lab
 
The Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data ScienceThe Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data ScienceDomino Data Lab
 

Más de Domino Data Lab (20)

What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...
 
The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...
 
Racial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops dataRacial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops data
 
Data Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using itData Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using it
 
Supporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentationSupporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentation
 
Leveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive IndustryLeveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive Industry
 
Summertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile VirusSummertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile Virus
 
Reproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with JupyterReproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with Jupyter
 
GeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data ScienceGeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data Science
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
 
Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at Scale
 
How I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataHow I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked Data
 
Software Engineering for Data Scientists
Software Engineering for Data ScientistsSoftware Engineering for Data Scientists
Software Engineering for Data Scientists
 
Making Big Data Smart
Making Big Data SmartMaking Big Data Smart
Making Big Data Smart
 
Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...
 
Building Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technologyBuilding Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technology
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...
 
The Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data ScienceThe Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data Science
 

Último

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 

Último (20)

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 

Correctness in Data Science - Data Science Pop-up Seattle

  • 1. #datapopupseattle Correctness in Data Science Benjamin Skrainka Principal Data Scientist + Lead Instructor, Galvanize bskrainka galvanize
  • 2. #datapopupseattle UNSTRUCTURED Data Science POP-UP in Seattle www.dominodatalab.com D Produced by Domino Data Lab Domino’s enterprise data science platform is used by leading analytical organizations to increase productivity, enable collaboration, and publish models into production faster.
  • 3. Correctness in Data Science Benjamin S. Skrainka October 7, 2015 Benjamin S. Skrainka Correctness in Data Science October 7, 2015 1 / 24
  • 4. The correctness problem A lot of (data) science is unscientific: “My code runs, so the answer must be correct” “It passed Explain Plan, so the answer is correct” “This model is too complex to have a design document” “It is impossible to unit test scientific code” “The lift from the direct mail campaign is 10%”" Benjamin S. Skrainka Correctness in Data Science October 7, 2015 2 / 24
  • 5. Correctness matters Bad (data) science: Costs real money and can kill people Will eventually damage your reputation and career Could expose you to litigation An issue of basic integrity and sleeping at night Benjamin S. Skrainka Correctness in Data Science October 7, 2015 3 / 24
  • 6. Objectives Today’s goals: Introduce VV&UQ framework to evaluate correctness of scientific models Survey good habits to improve quality of your work Benjamin S. Skrainka Correctness in Data Science October 7, 2015 4 / 24
  • 7. Verification, Validation, & Uncertainty Quantification Benjamin S. Skrainka Correctness in Data Science October 7, 2015 5 / 24
  • 8. Introduction to VV&UQ Verification, Validation, & Uncertainty Quantification provides epistemological framework to evaluate correctness of scientific models: Evidence of correctness should accompany any prediction In absence of evidence, assume predictions are wrong Popper: can only disprove or fail to disprove a model VV&UQ is inductive whereas science is deductive Reference: Verification and Validation in Scientific Computing by Oberkampf & Roy Benjamin S. Skrainka Correctness in Data Science October 7, 2015 6 / 24
  • 9. Definitions of VV&UQ Definitions of terms (Oberkampf & Roy): Verification: I “solving equations right” I I.e., code implements the model correctly Validation: I “solving right equations” I I.e., model has high fidelity to reality Definitions of VV&UQ will vary depending on source . . . æ Most organizations do not even practice verification. . . Benjamin S. Skrainka Correctness in Data Science October 7, 2015 7 / 24
  • 10. Definition of UQ Definition of Uncertainty Quantification (Oberkampf & Roy): Process of identifying, characterizing, and quantifying those factors in an analysis which could a ect accuracy of computational results Do your assumptions hold? When do they fail? Does your model apply to the data/situation? Where does your model break down? What are its limits? Benjamin S. Skrainka Correctness in Data Science October 7, 2015 8 / 24
  • 11. Verification of code Does your code implement the model correctly? Unit test everything you can: I Scientific code can be unit tested I Test special cases I Test on cases with analytic solutions I Test on synthetic data Unit test framework will setup and tear-down fixtures Should be able to recover parameters from Monte Carlo data Benjamin S. Skrainka Correctness in Data Science October 7, 2015 9 / 24
  • 12. Verification of SQL Passing Explain Plan doesn’t mean your SQL is correct: Garbage in, garbage out Check a simple case you can compute by hand Check join plan is correct Check aggregate statistics Check answer is compatible with reality Benjamin S. Skrainka Correctness in Data Science October 7, 2015 10 / 24
  • 13. Unit test import unittest2 as unittest import assignment as problems class TestAssignment(unittest.TestCase): def test_zero(self): result = problems.question_zero() self.assertEqual(result, 9198) ... if __name__ == __main__ : unittest.main() Benjamin S. Skrainka Correctness in Data Science October 7, 2015 11 / 24
  • 14. Unit test Figure 1:Benjamin S. Skrainka Correctness in Data Science October 7, 2015 12 / 24
  • 15. Validation of model Check your model is a good (enough) representation of reality: “All models are wrong but some are useful” – George Box Run an experiment Perform specification testing Test assumptions hold Beware of endogenous features Benjamin S. Skrainka Correctness in Data Science October 7, 2015 13 / 24
  • 16. Approaches to experimentation Many ways to test: A/B test Multi-armed bandit Bayesian A/B test Wald sequential analysis Benjamin S. Skrainka Correctness in Data Science October 7, 2015 14 / 24
  • 17. Uncertainty quantification There are many types of uncertainty which a ect the robustness of your model: Parameter uncertainty Structural uncertainty Algorithmic uncertainty Experimental uncertainty Interpolation uncertainty Classified as aleatoric (statistical) and epistemic (systematic) Benjamin S. Skrainka Correctness in Data Science October 7, 2015 15 / 24
  • 18. Good habits Benjamin S. Skrainka Correctness in Data Science October 7, 2015 16 / 24
  • 19. Act like a software engineer Use best practices from software engineering: Good design of code Follow a sensible coding convention Version control Use same file structure for every project Unit test Use PEP8 or equivalent Perform code reviews Benjamin S. Skrainka Correctness in Data Science October 7, 2015 17 / 24
  • 20. Reproducible research ‘Document what you do and do what you document’: Keep a journal! Data provenance How data was cleaned Design document Specification & requirements Do you keep a journal? You should. Fermi taught me that. – John A. Wheeler Benjamin S. Skrainka Correctness in Data Science October 7, 2015 18 / 24
  • 21. Follow a workflow Use a workflow like CRISP-DM: 1 Define business question and metric 2 Understand data 3 Prepare data 4 Build model 5 Evaluate 6 Deploy Ensures you don’t forget any key steps Benjamin S. Skrainka Correctness in Data Science October 7, 2015 19 / 24
  • 22. Automate your data pipeline One-touch build of your application or paper: Automate entire workflow from raw data to final result Ensures you perform all steps Ensures all steps are known – no one o manual adjustments Avoids stupid human errors Auto generate all tables and figures Save time when handling new data . . . which always has subtle changes in formatting Benjamin S. Skrainka Correctness in Data Science October 7, 2015 20 / 24
  • 23. Write flexible code to handle data Use constants/macros to access data fields: Code will clearly show what data matters Easier to understand code and data pipeline Easier to debug data problems Easier to handles changes in data formatting Benjamin S. Skrainka Correctness in Data Science October 7, 2015 21 / 24
  • 24. Python example # Setup indicators ix_gdp = 7 ... # Load & clean data m_raw = np.recfromcsv( bea_gdp.csv ) gdp = m_raw[:, ix_gdp] ... Benjamin S. Skrainka Correctness in Data Science October 7, 2015 22 / 24
  • 25. Politics. . . Often, there is political pressure to violate best practice: Examples: I 80% confidence intervals I Absurd attribution window I Two year forecast horizon but only three months of data Hard to do right thing vs. senior management Recruit a high-level scientist to advocate Particularly common with forecasting: I Often requested by management for CYA I Insist on a ‘panel of experts’ for impossible decisions Benjamin S. Skrainka Correctness in Data Science October 7, 2015 23 / 24
  • 26. Conclusion Need to raise the quality of data science: VV & UQ provides rigorous framework: I Verification: solve the equations right I Validation: solve the right equations I Uncertainty quantification: how robust is model to unknowns? Adopting good habits provides huge gains for minimal e ort Benjamin S. Skrainka Correctness in Data Science October 7, 2015 24 / 24