The document discusses data provenance for data science applications. It proposes automatically generating and storing metadata that describes how data flows through a machine learning pipeline. This provenance information could help address questions about model predictions, data processing decisions, and regulatory requirements for high-risk AI systems. Capturing provenance at a fine-grained level incurs overhead but enables detailed queries. The approach was evaluated on performance and scalability. Provenance may help with transparency, explainability and oversight as required by new regulations.
Automating Google Workspace (GWS) & more with Apps Script
Data Provenance for Data Science
1. Prof. Paolo Missier
School of Computing
Newcastle University, UK
May, 2021
Data Provenance for Data Science
In collaboration with:
Prof. Torlone, Giulia Simonelli, Luca Lauro – Universita’ RomaTre, Italy
Prof. Chapman -- University of Southampton, UK
2. 2
Data Model Predictions
Model
pre-processing
Raw
datasets
features
Predicted you:
- Ranking
- Score
- Class
Data
collection
Instances
Key decisions are made during data selection and
processing:
- Where does the data come from?
- What’s in the dataset?
- What transformations were applied?
3. 3
A concrete example
<event
name>
The classic ”Titanic” dataset: Can you predict survival probabilities?
• Approach: simple logistic regression analysis
Features:
Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
Name - Name
Sex - Sex
Age - Age
SibSp - Number of Siblings/Spouses Aboard
Parch - Number of Parents/Children Aboard
Ticket - Ticket Number
Fare - Passenger Fare (British pound)
Cabin - Cabin
Embarked - Port of Embarkation (C = Cherbourg; Q =
Queenstown; S = Southampton)
Outcome:
Survived (0 = No; 1 = Yes)
4. 4
<event
name>
Enable analysis of data pre-processing
Is the target class
balanced?
(down / upsample)
Data preparation workflow includes a number of decisions
Dropping
irrelevant
attributes
PassengerId',
'Name',
'Ticket',
'Cabin'
Managing
missing
values
Age missing in 714/891
records
“Pclass is a good
predictor for age”
Impute Age values using
average age for PClass
Dropping correlated
features (?)
Drop
“Fare”, “Pclass”
6. 6
Also: script alludes to human decisions
<event
name>
How do we capture these decisions?
To what extent can they be inferred from code?
7. 7
Correlation analysis
<event
name>
• Is Pclass really a good
predictor for Age?
• Why drop both PClass and
Fare?
1. Dropped Age only
(Nearly identical performance (F1=0.77, 0.76))
2. Use sex, Pclass only
Alternative pre-processing:
8. 8
<event
name>
Also: exploring the effect of alternative pre-processing
D
P1 D1 Learn M1 Predict
x
y1
How can knowledge of P1, P2 help understand why y1 ≠ y2 ?
Ex. Alternative imputation methods for missing values
Ex. Boost minority class / downsample majority class
P2 D2 Learn M2 Predict y2
y1 ≠ y2
9. 9
Some concrete questions
<event
name>
Appropriateness of training set, bias: Is training data fit to learn from?
Appropriateness of preprocessing: where best practices followed?
Debugging / Explaining: output value Y looks wrong, can you tell me how it was produced
Auditing:
• Who was responsible for generating output Y?
• Has any privacy agreement been violated in producing Y?
Access control: access to Y may be restricted based on the derivation history of Y
10. 10
<event
name>
Traceability, explainability, transparency – EU regulations
“Why was my mortgage application refused?” The bias problem originates in the data and its pre-processing!
Article 12 Record-keeping
1. High-risk AI systems shall be designed and developed with capabilities enabling the automatic recording of events
(‘logs’) while the high-risk AI systems is operating. Those logging capabilities shall conform to recognised standards or
common specifications.
Proposal for a Regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) - Brussels,
21.4.2021: https://ec.europa.eu/newsroom/dae/items/709090
“AI systems that create a high risk to the health and safety or fundamental rights of natural persons/ […] the
classification as high-risk does not only depend on the function performed by the AI system, but also on the specific
purpose and modalities for which that system is used.
- used for the purpose of assessing students
- recruitment or selection of natural persons
- evaluate the eligibility of natural persons for public assistance benefits and services
- evaluate the creditworthiness of natural persons or establish their credit score
- used by law enforcement authorities for making individual risk assessments
11. 12
<event
name>
Provenance
A possible approach to help answer some of the questions:
1. Automatically generate metadata that describes the flow of data through the pipeline as it occurs
2. Persistently store the metadata for each run of the pipeline
3. Map the questions to queries on the metadata store
Data provenance is a structured form of metadata that may fit the purpose
Article 12 Record-keeping
1. High-risk AI systems shall be designed and developed with capabilities enabling the automatic recording of events (‘logs’) while the
high-risk AI systems is operating. Those logging capabilities shall conform to recognised standards or common specifications.
12. 13
What is provenance?
Oxford English Dictionary:
• the fact of coming from some particular source or quarter; origin, derivation
• the history or pedigree of a work of art, manuscript, rare book, etc.;
• a record of the passage of an item through its various owners: chain of custody
Magna Carta (‘the Great Charter’) was agreed
between King John and his barons on 15 June 1215.
13. 14
The W3C PROV model (2013)
processing
Input 1
Input n
usage
usage
Output 1
Output m
generation
generation
(derivation)
(derivation)
15. 18
M
Data
sources
Acquisition,
wrangling
Test
set
Training
set
Preparing for learning
Model
Selection
Training /
test split
Model
Testing
Model
Learning
Model
Validation
Predictions
Model
Usage
Decision points:
- Source selection
- Sample / population shape
- Cleaning
- Integration
Decision points:
- Sampling / stratification
- Feature selection
- Feature engineering
- Dimensionality reduction
- Regularisation
- Imputation
- Class rebalancing
- …
Provenance
trace
M
Model
Learning
Training
set
Training /
test split
Imputation
Feature
selection
D’ D’’
…
Hyper
parameters
C1 C2
C3
Pipeline structure with provenance annotations
16. 19
<event
name>
Can provenance help address the new EU regulations?
Proposal for a Regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) - Brussels,
21.4.2021: https://ec.europa.eu/newsroom/dae/items/709090
Article 12 Record-keeping
2. The logging capabilities shall ensure a level of traceability of the AI system’s functioning throughout its lifecycle that
is appropriate to the intended purpose of the system.
3. In particular, logging capabilities shall enable the monitoring of the operation of the high-risk AI system with respect
to the occurrence of situations that may result in the AI system presenting a risk within the meaning of Article 65(1) or
lead to a substantial modification, and facilitate the post-market monitoring referred to in Article 61.
4. For high-risk AI systems referred to in paragraph 1, point (a) of Annex III, the logging capabilities shall provide, at a
minimum:
(a) recording of the period of each use of the system (start date and time and end date and time of each use);
(b) the reference database against which input data has been checked by the system;
(c) the input data for which the search has led to a match; EN 50 EN
(d) the identification of the natural persons involved in the verification of the results, as referred to in Article 14 (5).
17. 20
<event
name>
Provenance of what?
- Transparent pipeline
- Fine-grained datasets
- Transparent program PT
- Fine-grained datasets
Base case:
- opaque program Po
- coarse-grained dataset
Default provenance:
- Every output depends on every input
- Transparent program PT
- coarse-grained datasets
18. 23
Data Provenance for Data Science: technical insight
Technical approach [1]
- Formalisation of provenance patterns for pipeline operators
- Systematic collection of fine-grained provenance from (nearly) arbitrary pipelines
- Demonstration of provenance queries
- Performance analysis
- Collecting provenance incurs space and time overhead
- Performance of provenance queries
[1]. Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. Chapman, A., Missier,
P., Simonelli, G., & Torlone, R. PVLDB, 14(4):507-520, January, 2021.
19. 24
Pre-processing operators
<event
name>
[1] Berti-Equille L. Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation. In: The World Wide Web Conference on - WWW ’19. New York, New York, USA:
ACM Press; 2019. p. 2580–6.
[1]
[2] García S, Ramírez-Gallego S, Luengo J, Benítez JM, Herrera F. Big data preprocessing: methods and prospects. Big Data Anal. 2016 Dec 1;1(1):9.
[2]
30. 35
Summary
Multiple hypotheses regarding Data Provenance for Data Science:
1. Is it practical to collect fine-grained provenance?
1. To what extent can it be done automatically?
2. How much does it cost?
2. Is it also useful? does it help addressing the key questions on high-risk AI systems?
How about the data used to train / build the model?
baseline-noAgents.provn
\newcommand{\f}{\textbf{a}}
\text{features}~ X=[\f_1 \ldots \f_k]
\text{new features}~ Y=[\f'_1 \ldots \f'_l]
\noindent new values for each row are obtained by applying $f$\\ to values in the $X$ features