SlideShare una empresa de Scribd logo
1 de 32
Prof. Paolo Missier
School of Computing
Newcastle University, UK
May, 2021
Data Provenance for Data Science
In collaboration with:
Prof. Torlone, Giulia Simonelli, Luca Lauro – Universita’ RomaTre, Italy
Prof. Chapman -- University of Southampton, UK
2
Data  Model  Predictions
Model
pre-processing
Raw
datasets
features
Predicted you:
- Ranking
- Score
- Class
Data
collection
Instances
Key decisions are made during data selection and
processing:
- Where does the data come from?
- What’s in the dataset?
- What transformations were applied?
3
A concrete example
<event
name>
The classic ”Titanic” dataset: Can you predict survival probabilities?
• Approach: simple logistic regression analysis
Features:
Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
Name - Name
Sex - Sex
Age - Age
SibSp - Number of Siblings/Spouses Aboard
Parch - Number of Parents/Children Aboard
Ticket - Ticket Number
Fare - Passenger Fare (British pound)
Cabin - Cabin
Embarked - Port of Embarkation (C = Cherbourg; Q =
Queenstown; S = Southampton)
Outcome:
Survived (0 = No; 1 = Yes)
4
<event
name>
Enable analysis of data pre-processing
Is the target class
balanced?
(down / upsample)
Data preparation workflow includes a number of decisions
Dropping
irrelevant
attributes
PassengerId',
'Name',
'Ticket',
'Cabin'
Managing
missing
values
Age missing in 714/891
records
“Pclass is a good
predictor for age”
Impute Age values using
average age for PClass
Dropping correlated
features (?)
Drop
“Fare”, “Pclass”
5
Example: missing values imputation
<event
name>
6
Also: script alludes to human decisions
<event
name>
How do we capture these decisions?
To what extent can they be inferred from code?
7
Correlation analysis
<event
name>
• Is Pclass really a good
predictor for Age?
• Why drop both PClass and
Fare?
1. Dropped Age only
(Nearly identical performance (F1=0.77, 0.76))
2. Use sex, Pclass only
Alternative pre-processing:
8
<event
name>
Also: exploring the effect of alternative pre-processing
D
P1 D1 Learn M1 Predict
x
y1
How can knowledge of P1, P2 help understand why y1 ≠ y2 ?
Ex. Alternative imputation methods for missing values
Ex. Boost minority class / downsample majority class
P2 D2 Learn M2 Predict y2
y1 ≠ y2
9
Some concrete questions
<event
name>
Appropriateness of training set, bias: Is training data fit to learn from?
Appropriateness of preprocessing: where best practices followed?
Debugging / Explaining: output value Y looks wrong, can you tell me how it was produced
Auditing:
• Who was responsible for generating output Y?
• Has any privacy agreement been violated in producing Y?
Access control: access to Y may be restricted based on the derivation history of Y
10
<event
name>
Traceability, explainability, transparency – EU regulations
“Why was my mortgage application refused?” The bias problem originates in the data and its pre-processing!
Article 12 Record-keeping
1. High-risk AI systems shall be designed and developed with capabilities enabling the automatic recording of events
(‘logs’) while the high-risk AI systems is operating. Those logging capabilities shall conform to recognised standards or
common specifications.
Proposal for a Regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) - Brussels,
21.4.2021: https://ec.europa.eu/newsroom/dae/items/709090
“AI systems that create a high risk to the health and safety or fundamental rights of natural persons/ […] the
classification as high-risk does not only depend on the function performed by the AI system, but also on the specific
purpose and modalities for which that system is used.
- used for the purpose of assessing students
- recruitment or selection of natural persons
- evaluate the eligibility of natural persons for public assistance benefits and services
- evaluate the creditworthiness of natural persons or establish their credit score
- used by law enforcement authorities for making individual risk assessments
12
<event
name>
Provenance
A possible approach to help answer some of the questions:
1. Automatically generate metadata that describes the flow of data through the pipeline as it occurs
2. Persistently store the metadata for each run of the pipeline
3. Map the questions to queries on the metadata store
Data provenance is a structured form of metadata that may fit the purpose
Article 12 Record-keeping
1. High-risk AI systems shall be designed and developed with capabilities enabling the automatic recording of events (‘logs’) while the
high-risk AI systems is operating. Those logging capabilities shall conform to recognised standards or common specifications.
13
What is provenance?
Oxford English Dictionary:
• the fact of coming from some particular source or quarter; origin, derivation
• the history or pedigree of a work of art, manuscript, rare book, etc.;
• a record of the passage of an item through its various owners: chain of custody
Magna Carta (‘the Great Charter’) was agreed
between King John and his barons on 15 June 1215.
14
The W3C PROV model (2013)
processing
Input 1
Input n
usage
usage
Output 1
Output m
generation
generation
(derivation)
(derivation)
15
The W3C PROV model (2013)
https://www.w3.org/TR/prov-dm/
18
M
Data
sources
Acquisition,
wrangling
Test
set
Training
set
Preparing for learning
Model
Selection
Training /
test split
Model
Testing
Model
Learning
Model
Validation
Predictions
Model
Usage
Decision points:
- Source selection
- Sample / population shape
- Cleaning
- Integration
Decision points:
- Sampling / stratification
- Feature selection
- Feature engineering
- Dimensionality reduction
- Regularisation
- Imputation
- Class rebalancing
- …
Provenance
trace
M
Model
Learning
Training
set
Training /
test split
Imputation
Feature
selection
D’ D’’
…
Hyper
parameters
C1 C2
C3
Pipeline structure with provenance annotations
19
<event
name>
Can provenance help address the new EU regulations?
Proposal for a Regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) - Brussels,
21.4.2021: https://ec.europa.eu/newsroom/dae/items/709090
Article 12 Record-keeping
2. The logging capabilities shall ensure a level of traceability of the AI system’s functioning throughout its lifecycle that
is appropriate to the intended purpose of the system.
3. In particular, logging capabilities shall enable the monitoring of the operation of the high-risk AI system with respect
to the occurrence of situations that may result in the AI system presenting a risk within the meaning of Article 65(1) or
lead to a substantial modification, and facilitate the post-market monitoring referred to in Article 61.
4. For high-risk AI systems referred to in paragraph 1, point (a) of Annex III, the logging capabilities shall provide, at a
minimum:
(a) recording of the period of each use of the system (start date and time and end date and time of each use);
(b) the reference database against which input data has been checked by the system;
(c) the input data for which the search has led to a match; EN 50 EN
(d) the identification of the natural persons involved in the verification of the results, as referred to in Article 14 (5).
20
<event
name>
Provenance of what?
- Transparent pipeline
- Fine-grained datasets
- Transparent program PT
- Fine-grained datasets
Base case:
- opaque program Po
- coarse-grained dataset
Default provenance:
- Every output depends on every input
- Transparent program PT
- coarse-grained datasets
23
Data Provenance for Data Science: technical insight
Technical approach [1]
- Formalisation of provenance patterns for pipeline operators
- Systematic collection of fine-grained provenance from (nearly) arbitrary pipelines
- Demonstration of provenance queries
- Performance analysis
- Collecting provenance incurs space and time overhead
- Performance of provenance queries
[1]. Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. Chapman, A., Missier,
P., Simonelli, G., & Torlone, R. PVLDB, 14(4):507-520, January, 2021.
24
Pre-processing operators
<event
name>
[1] Berti-Equille L. Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation. In: The World Wide Web Conference on - WWW ’19. New York, New York, USA:
ACM Press; 2019. p. 2580–6.
[1]
[2] García S, Ramírez-Gallego S, Luengo J, Benítez JM, Herrera F. Big data preprocessing: methods and prospects. Big Data Anal. 2016 Dec 1;1(1):9.
[2]
25
Typical operators used in data prep
26
Operators
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
Data reduction
- Feature selection
- Instance selection
Data augmentation
- Space transformation
- Instance generation
- Encoding (eg one-hot…)
Data transformation
- Data repair
- Binarisation
- Normalisation
- Discretisation
- Imputation
Ex.: vertical augmentation  adding columns
27
Making your code provenance-aware
df = pd.DataFrame(…)
# Create a new provenance document
p = pr.Provenance(df, savepath)
# create provanance tracker
tracker=ProvenanceTracker.ProvenanceTracker(df, p)
# instance generation
tracker.df = tracker.df.append({'key2': 'K4'},
ignore_index=True)
# imputation
tracker.df = tracker.df.fillna('imputato')
# feature transformation of column D
tracker.df['D'] = tracker.df['D']*2
# Feature transformation of column key2
tracker.df['key2'] = tracker.df['key2']*2
Idea:
A python tracker object intercepts dataframe
operations
Operations that are channeled through the tracker
generate provenance fragments
28
Provenance patterns
29
Provenance templates
Template + binding rules = instantiated provenance fragment
+
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
{old values: F, I, V}  {new values: F’, J, V’}
30
This applies to all operators…
31
Putting it all together
32
Evaluation - performance
33
Evaluation: Provenance capture and query times
34
Scalability
35
Summary
Multiple hypotheses regarding Data Provenance for Data Science:
1. Is it practical to collect fine-grained provenance?
1. To what extent can it be done automatically?
2. How much does it cost?
2. Is it also useful?  does it help addressing the key questions on high-risk AI systems?
Questions?
<event
name>
37
<event
name>
SPARES

Más contenido relacionado

La actualidad más candente

The openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query LanguageThe openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query Language
Neo4j
 

La actualidad más candente (20)

Standards Metadata Management (system)
Standards Metadata Management (system)Standards Metadata Management (system)
Standards Metadata Management (system)
 
Neo4j Training Modeling
Neo4j Training ModelingNeo4j Training Modeling
Neo4j Training Modeling
 
Introduction to US Privacy and Data Security Regulations and Requirements (Se...
Introduction to US Privacy and Data Security Regulations and Requirements (Se...Introduction to US Privacy and Data Security Regulations and Requirements (Se...
Introduction to US Privacy and Data Security Regulations and Requirements (Se...
 
Urgensi RUU Perlindungan Data Pribadi
Urgensi RUU Perlindungan Data PribadiUrgensi RUU Perlindungan Data Pribadi
Urgensi RUU Perlindungan Data Pribadi
 
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
 
GREEND: An energy consumption dataset of households in Austria and Italy
GREEND: An energy consumption dataset of households in Austria and ItalyGREEND: An energy consumption dataset of households in Austria and Italy
GREEND: An energy consumption dataset of households in Austria and Italy
 
Metadata Strategies
Metadata StrategiesMetadata Strategies
Metadata Strategies
 
Danish Business Authority: Explainability and causality in relation to ML Ops
Danish Business Authority: Explainability and causality in relation to ML OpsDanish Business Authority: Explainability and causality in relation to ML Ops
Danish Business Authority: Explainability and causality in relation to ML Ops
 
Dublin Core, the DCMI Abstract Model & DC Application Profiles
Dublin Core, the DCMI Abstract Model & DC Application ProfilesDublin Core, the DCMI Abstract Model & DC Application Profiles
Dublin Core, the DCMI Abstract Model & DC Application Profiles
 
Data-Ed Webinar: Data Quality Engineering
Data-Ed Webinar: Data Quality EngineeringData-Ed Webinar: Data Quality Engineering
Data-Ed Webinar: Data Quality Engineering
 
2017 07 03_meetup_d
2017 07 03_meetup_d2017 07 03_meetup_d
2017 07 03_meetup_d
 
SysCon 2013 SysML & Requirements
SysCon 2013 SysML & RequirementsSysCon 2013 SysML & Requirements
SysCon 2013 SysML & Requirements
 
Conceptual vs. Logical vs. Physical Data Modeling
Conceptual vs. Logical vs. Physical Data ModelingConceptual vs. Logical vs. Physical Data Modeling
Conceptual vs. Logical vs. Physical Data Modeling
 
도서관 Linked Open Data의 필요성
도서관 Linked Open Data의 필요성도서관 Linked Open Data의 필요성
도서관 Linked Open Data의 필요성
 
Master Data Management
Master Data ManagementMaster Data Management
Master Data Management
 
Metadata Workshop
Metadata WorkshopMetadata Workshop
Metadata Workshop
 
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
 
The openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query LanguageThe openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query Language
 
Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...
 
ISCB 2023 Sources of uncertainty b.pptx
ISCB 2023 Sources of uncertainty b.pptxISCB 2023 Sources of uncertainty b.pptx
ISCB 2023 Sources of uncertainty b.pptx
 

Similar a Data Provenance for Data Science

Term Paper Presentation
Term Paper PresentationTerm Paper Presentation
Term Paper Presentation
Shubham Singh
 
Ingredients for Semantic Sensor Networks
Ingredients for Semantic Sensor NetworksIngredients for Semantic Sensor Networks
Ingredients for Semantic Sensor Networks
Oscar Corcho
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And Grid
Ian Foster
 
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
Artificial Intelligence Institute at UofSC
 

Similar a Data Provenance for Data Science (20)

10probs.ppt
10probs.ppt10probs.ppt
10probs.ppt
 
Collecting and analyzing network-based evidence
Collecting and analyzing network-based evidenceCollecting and analyzing network-based evidence
Collecting and analyzing network-based evidence
 
Energy Databank in Nigeria: Management ,Technology and Security
Energy Databank in Nigeria:   Management ,Technology and SecurityEnergy Databank in Nigeria:   Management ,Technology and Security
Energy Databank in Nigeria: Management ,Technology and Security
 
WSO2 Big Data Platform and Applications
WSO2 Big Data Platform and ApplicationsWSO2 Big Data Platform and Applications
WSO2 Big Data Platform and Applications
 
Sinnott Paper
Sinnott PaperSinnott Paper
Sinnott Paper
 
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
 
Beyond Online PDFs
Beyond Online PDFs Beyond Online PDFs
Beyond Online PDFs
 
Role of Big Data Analytics in Power System Application Ravi v angadi asst. pr...
Role of Big Data Analytics in Power System Application Ravi v angadi asst. pr...Role of Big Data Analytics in Power System Application Ravi v angadi asst. pr...
Role of Big Data Analytics in Power System Application Ravi v angadi asst. pr...
 
Sensitive Data Exposure Incident Checklist
Sensitive Data Exposure Incident ChecklistSensitive Data Exposure Incident Checklist
Sensitive Data Exposure Incident Checklist
 
Network Security Data Visualization
Network Security Data VisualizationNetwork Security Data Visualization
Network Security Data Visualization
 
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
IRJET-	 Fault Detection and Prediction of Failure using Vibration AnalysisIRJET-	 Fault Detection and Prediction of Failure using Vibration Analysis
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
 
Big Data and IOT
Big Data and IOTBig Data and IOT
Big Data and IOT
 
Term Paper Presentation
Term Paper PresentationTerm Paper Presentation
Term Paper Presentation
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data Streams
 
Ingredients for Semantic Sensor Networks
Ingredients for Semantic Sensor NetworksIngredients for Semantic Sensor Networks
Ingredients for Semantic Sensor Networks
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And Grid
 
Growing Information Intensity of Energy 2014
Growing Information Intensity of Energy 2014Growing Information Intensity of Energy 2014
Growing Information Intensity of Energy 2014
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
Grid Projects In The US July 2008
Grid Projects In The US July 2008Grid Projects In The US July 2008
Grid Projects In The US July 2008
 
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
 

Más de Paolo Missier

Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
Paolo Missier
 

Más de Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for Health
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Data Provenance for Data Science

  • 1. Prof. Paolo Missier School of Computing Newcastle University, UK May, 2021 Data Provenance for Data Science In collaboration with: Prof. Torlone, Giulia Simonelli, Luca Lauro – Universita’ RomaTre, Italy Prof. Chapman -- University of Southampton, UK
  • 2. 2 Data  Model  Predictions Model pre-processing Raw datasets features Predicted you: - Ranking - Score - Class Data collection Instances Key decisions are made during data selection and processing: - Where does the data come from? - What’s in the dataset? - What transformations were applied?
  • 3. 3 A concrete example <event name> The classic ”Titanic” dataset: Can you predict survival probabilities? • Approach: simple logistic regression analysis Features: Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) Name - Name Sex - Sex Age - Age SibSp - Number of Siblings/Spouses Aboard Parch - Number of Parents/Children Aboard Ticket - Ticket Number Fare - Passenger Fare (British pound) Cabin - Cabin Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) Outcome: Survived (0 = No; 1 = Yes)
  • 4. 4 <event name> Enable analysis of data pre-processing Is the target class balanced? (down / upsample) Data preparation workflow includes a number of decisions Dropping irrelevant attributes PassengerId', 'Name', 'Ticket', 'Cabin' Managing missing values Age missing in 714/891 records “Pclass is a good predictor for age” Impute Age values using average age for PClass Dropping correlated features (?) Drop “Fare”, “Pclass”
  • 5. 5 Example: missing values imputation <event name>
  • 6. 6 Also: script alludes to human decisions <event name> How do we capture these decisions? To what extent can they be inferred from code?
  • 7. 7 Correlation analysis <event name> • Is Pclass really a good predictor for Age? • Why drop both PClass and Fare? 1. Dropped Age only (Nearly identical performance (F1=0.77, 0.76)) 2. Use sex, Pclass only Alternative pre-processing:
  • 8. 8 <event name> Also: exploring the effect of alternative pre-processing D P1 D1 Learn M1 Predict x y1 How can knowledge of P1, P2 help understand why y1 ≠ y2 ? Ex. Alternative imputation methods for missing values Ex. Boost minority class / downsample majority class P2 D2 Learn M2 Predict y2 y1 ≠ y2
  • 9. 9 Some concrete questions <event name> Appropriateness of training set, bias: Is training data fit to learn from? Appropriateness of preprocessing: where best practices followed? Debugging / Explaining: output value Y looks wrong, can you tell me how it was produced Auditing: • Who was responsible for generating output Y? • Has any privacy agreement been violated in producing Y? Access control: access to Y may be restricted based on the derivation history of Y
  • 10. 10 <event name> Traceability, explainability, transparency – EU regulations “Why was my mortgage application refused?” The bias problem originates in the data and its pre-processing! Article 12 Record-keeping 1. High-risk AI systems shall be designed and developed with capabilities enabling the automatic recording of events (‘logs’) while the high-risk AI systems is operating. Those logging capabilities shall conform to recognised standards or common specifications. Proposal for a Regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) - Brussels, 21.4.2021: https://ec.europa.eu/newsroom/dae/items/709090 “AI systems that create a high risk to the health and safety or fundamental rights of natural persons/ […] the classification as high-risk does not only depend on the function performed by the AI system, but also on the specific purpose and modalities for which that system is used. - used for the purpose of assessing students - recruitment or selection of natural persons - evaluate the eligibility of natural persons for public assistance benefits and services - evaluate the creditworthiness of natural persons or establish their credit score - used by law enforcement authorities for making individual risk assessments
  • 11. 12 <event name> Provenance A possible approach to help answer some of the questions: 1. Automatically generate metadata that describes the flow of data through the pipeline as it occurs 2. Persistently store the metadata for each run of the pipeline 3. Map the questions to queries on the metadata store Data provenance is a structured form of metadata that may fit the purpose Article 12 Record-keeping 1. High-risk AI systems shall be designed and developed with capabilities enabling the automatic recording of events (‘logs’) while the high-risk AI systems is operating. Those logging capabilities shall conform to recognised standards or common specifications.
  • 12. 13 What is provenance? Oxford English Dictionary: • the fact of coming from some particular source or quarter; origin, derivation • the history or pedigree of a work of art, manuscript, rare book, etc.; • a record of the passage of an item through its various owners: chain of custody Magna Carta (‘the Great Charter’) was agreed between King John and his barons on 15 June 1215.
  • 13. 14 The W3C PROV model (2013) processing Input 1 Input n usage usage Output 1 Output m generation generation (derivation) (derivation)
  • 14. 15 The W3C PROV model (2013) https://www.w3.org/TR/prov-dm/
  • 15. 18 M Data sources Acquisition, wrangling Test set Training set Preparing for learning Model Selection Training / test split Model Testing Model Learning Model Validation Predictions Model Usage Decision points: - Source selection - Sample / population shape - Cleaning - Integration Decision points: - Sampling / stratification - Feature selection - Feature engineering - Dimensionality reduction - Regularisation - Imputation - Class rebalancing - … Provenance trace M Model Learning Training set Training / test split Imputation Feature selection D’ D’’ … Hyper parameters C1 C2 C3 Pipeline structure with provenance annotations
  • 16. 19 <event name> Can provenance help address the new EU regulations? Proposal for a Regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) - Brussels, 21.4.2021: https://ec.europa.eu/newsroom/dae/items/709090 Article 12 Record-keeping 2. The logging capabilities shall ensure a level of traceability of the AI system’s functioning throughout its lifecycle that is appropriate to the intended purpose of the system. 3. In particular, logging capabilities shall enable the monitoring of the operation of the high-risk AI system with respect to the occurrence of situations that may result in the AI system presenting a risk within the meaning of Article 65(1) or lead to a substantial modification, and facilitate the post-market monitoring referred to in Article 61. 4. For high-risk AI systems referred to in paragraph 1, point (a) of Annex III, the logging capabilities shall provide, at a minimum: (a) recording of the period of each use of the system (start date and time and end date and time of each use); (b) the reference database against which input data has been checked by the system; (c) the input data for which the search has led to a match; EN 50 EN (d) the identification of the natural persons involved in the verification of the results, as referred to in Article 14 (5).
  • 17. 20 <event name> Provenance of what? - Transparent pipeline - Fine-grained datasets - Transparent program PT - Fine-grained datasets Base case: - opaque program Po - coarse-grained dataset Default provenance: - Every output depends on every input - Transparent program PT - coarse-grained datasets
  • 18. 23 Data Provenance for Data Science: technical insight Technical approach [1] - Formalisation of provenance patterns for pipeline operators - Systematic collection of fine-grained provenance from (nearly) arbitrary pipelines - Demonstration of provenance queries - Performance analysis - Collecting provenance incurs space and time overhead - Performance of provenance queries [1]. Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. Chapman, A., Missier, P., Simonelli, G., & Torlone, R. PVLDB, 14(4):507-520, January, 2021.
  • 19. 24 Pre-processing operators <event name> [1] Berti-Equille L. Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation. In: The World Wide Web Conference on - WWW ’19. New York, New York, USA: ACM Press; 2019. p. 2580–6. [1] [2] García S, Ramírez-Gallego S, Luengo J, Benítez JM, Herrera F. Big data preprocessing: methods and prospects. Big Data Anal. 2016 Dec 1;1(1):9. [2]
  • 21. 26 Operators 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 op Data reduction - Feature selection - Instance selection Data augmentation - Space transformation - Instance generation - Encoding (eg one-hot…) Data transformation - Data repair - Binarisation - Normalisation - Discretisation - Imputation Ex.: vertical augmentation  adding columns
  • 22. 27 Making your code provenance-aware df = pd.DataFrame(…) # Create a new provenance document p = pr.Provenance(df, savepath) # create provanance tracker tracker=ProvenanceTracker.ProvenanceTracker(df, p) # instance generation tracker.df = tracker.df.append({'key2': 'K4'}, ignore_index=True) # imputation tracker.df = tracker.df.fillna('imputato') # feature transformation of column D tracker.df['D'] = tracker.df['D']*2 # Feature transformation of column key2 tracker.df['key2'] = tracker.df['key2']*2 Idea: A python tracker object intercepts dataframe operations Operations that are channeled through the tracker generate provenance fragments
  • 24. 29 Provenance templates Template + binding rules = instantiated provenance fragment + 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 op {old values: F, I, V}  {new values: F’, J, V’}
  • 25. 30 This applies to all operators…
  • 26. 31 Putting it all together
  • 30. 35 Summary Multiple hypotheses regarding Data Provenance for Data Science: 1. Is it practical to collect fine-grained provenance? 1. To what extent can it be done automatically? 2. How much does it cost? 2. Is it also useful?  does it help addressing the key questions on high-risk AI systems?

Notas del editor

  1. How about the data used to train / build the model?
  2. baseline-noAgents.provn
  3. \newcommand{\f}{\textbf{a}} \text{features}~ X=[\f_1 \ldots \f_k] \text{new features}~ Y=[\f'_1 \ldots \f'_l] \noindent new values for each row are  obtained by applying $f$\\ to values in the $X$ features