Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Analytics of analytics pipelines:from optimising re-execution to general Data Provenance for Data Science
1. Paolo Missier
School of Computing
Newcastle University, UK
March 2021
Analytics of analytics pipelines:
from optimising re-execution to general
Data Provenance for Data Science
paolo.missier@ncl.ac.uk
LinkedIn: paolomissier
Twitter: @PMissier
4. 4
What changes?
Life Sciences, Health care
Reference databases
Algorithms and libraries
Simulation
Large parameter space
Input conditions
Machine Learning
Evolving ground truth datasets
Model re-training
5. 5
Motivating example: Genomics pipelines
Image credits: Broad Institute https://software.broadinstitute.org/gatk/
https://www.genomicsengland.co.uk/the-100000-genomes-project/
Spark GATK tools on Azure:
45 mins / GB
@ 13GB / exome: about 10 hours
6. 6
Genomics: WES / WGS, Variant calling Variant interpretation
SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya, E.; Kirby, R.; and Keogh,
M. In Procs. 11th International conference on Data Integration in the Life Sciences, Los Angeles, CA, 2015. Springer
SVI: Simple Variant Interpretation
Variant classification : pathogenic, benign and unknown/uncertain
7. 7
Changes that affect variant interpretation
What changes:
- Improved sequencing / variant calling
- ClinVar, OMIM evolve rapidly
- New reference data sources
Evolution in number of variants that affect patients
(a) with a specific phenotype
(b) Across all phenotypes
8. 8
Blind reaction to change: a game of battleship
Sparsity issue:
• About 500 executions
• 33 patients
• total runtime about 60 hours
• Only 14 relevant output changes
detected
4.2 hours of computation per change
Should we care about updates?
Evolving knowledge about
gene variations
11. 11
f(.) unstable heuristics
Impact: “Any variant with status moving from/to Red causes High impact on any
patient who is affected by the variant”
Observation: Variants v within output set y that are in scope for patient X remain in scope! (monotonicity)
1. Variant v changes status
- unknown benign
- unknown deleterious
2. Brand new variant
If in scope:
compare status before / after inexpensive
recompute SVI on all inputs expensive
Scope: which cases are affected? “a change in variant v can only have impact on a case
X if V and X share the same phenotype”
13. 13
ReComp
http://recomp.org.uk/
Outcome:
A framework for selective Re-computation
• Generic, Customisable
Scope:
expensive analysis +
frequent data changes +
not all data changes significant
Challenge:
Make re-computation efficient in
response to changes
Assumptions:
Processes are
• Observable
• Reproducible
Estimates are cheap
Insight: replace re-computation with change impact estimation
Using history of past executions
14. 14
Data Provenance in ReComp
Hypothesis:
collecting detailed {provenance, logs} from past executions helps optimizing future executions
2. Identify and re-execute the minimal fragments of workflow that have been affected
1. Identify the subset of executions that are potentially affected by the changes
15. 15
Reproducibility
How
Selective:
- Across a cohort of past executions. which subset of individuals?
- Within a single re-execution which process fragments?
Change in
ClinVar
Change in
GeneMap
Why, when, to what extent
16. 16
The ReComp meta-process
History
DB
Detect and
quantify
changes
data diff(d,d’)
Record
execution history
Analytics
Process P
Log / provenance
Partially
Re-exec
P (D) P’(D’)
Change
Events
Changes:
• Reference datasets
• Inputs
For each past
instance:
Estimate impact
of changes
Impact(dd’, o) impact estimation functions
Scope
Select relevant
sub-processes
Optimisation
17. 17
How much do we know about the process?
Impact estimation
Re-execution
less more
Process structure
Execution trace
black box
I/O provenance
IO, DO
All-or-nothing
monolithic process, legacy
a complex simulator
white box
step-by-step provenance
workflows, R / python code
genomics analytics
Typical process
Fine-grained Impact
Partial restart trees (*)
(*) Cala J, Missier P. Provenance Annotation and Analysis to Support Process Re-Computation. In: Procs.
IPAW 2018. London: Springer; 2018.
18. 18
Provenance of process executions
Process, workflow run
Data
wasAssociatedWith
cooking
recipe
chef
finished
dish
wasGeneratedBy
A plan plays a role in an association
Activity: workflow run Data product
Plan
20. 20
History DB: Workflow Provenance
Each invocation of a workflow generates a provenance trace
“plan”
“plan
execution”
WF
B1 B2
B1exec B2exec
Data
WFexec
partOf
partOf
usage
generation
association association
association
db
usage
Program
Workflow
Execution
Entity
(ref data)
21. 21
SVI implemented using workflow
Phenotype to genes
Variant selection
Variant classification
Patient
variants
GeneMap
ClinVar
Classified
variants
Phenotype
22. 22
SVI – partial re-execution
Overhead:
caching
intermediate data
Time savings Partial re-exec (sec) Complete re-exec Time saving (%)
GeneMap 325 455 28.5
ClinVar 287 455 37
Change in
ClinVar
Change in
GeneMap
Cala J, Missier P. Provenance Annotation and Analysis to Support Process Re-Computation. In: Procs. IPAW 2018.
23. 24
ReComp: Summary
Evaluation: case-by-case basis
- Cost savings
- Ease of customisation
Generic ReComp framework:
- Observe changes, Provenance DB (History), control re-exec
Customisation:
- Diff functions, impact functions
Fine-grained provenance + control max savings
25. 26
Data Model Predictions
Model
pre-processing
Raw
datasets
features
Predicted you:
- Ranking
- Score
- Class
Data
collection
Instances
Key decisions are made during data selection and
processing:
- Where does the data come from?
- What’s in the dataset?
- What transformations were applied?
Complementing current ML approaches to model interpretability
1. Can we explain these decisions?
2. Are these explanations useful?
26. 27
Explaining data preparation
Data
collection
Model
Population data pre-processing
Raw
datasets
features
Predicted you:
- Ranking
- Score
- Class
- Integration
- Cleaning
- Outlier removal
- Normalisation
- Feature selection
- Class rebalancing
- Sampling
- Stratification
- …
Data acquisition and wrangling:
- How were datasets acquired?
- How recently?
- For what purpose?
- Are they being reused /
repurposed?
- What is their quality?
Instances
- Scripts Python / TensorFlow, Pandas, Spark
- Workflows Knime, …
Provenance Transparency
28. 35
Recent early results
A small grassroots project… [1]
- Formalisation of provenance patterns for pipeline operators
- Systematic collection of fine-grained provenance from (nearly) arbitrary pipelines
- Reality check:
- How much does it cost? provenance volume
- Does it help? queries against the provenance database
[1]. Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. Chapman, A., Missier,
P., Simonelli, G., & Torlone, R. PVLDB, 14(4):507-520, January, 2021.
29. 36
Operators
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
Data reduction
- Feature selection
- Instance selection
Data augmentation
- Space transformation
- Instance generation
- Encoding (eg one-hot…)
Data transformation
- Data repair
- Binarisation
- Normalisation
- Discretisation
- Imputation
Ex.: vertical augmentation adding columns
30. 37
Code instrumentation
Create a provlet for
a specific
transformation
Initialize provenance
capture
…code injection is now being automated!
38. 45
Summary
Multiple hypotheses regarding Data Provenance for Data Science:
1. Is it practical to collect fine-grained provenance?
1. To what extent can it be done automatically?
2. How much does it cost?
2. Is it also useful? what is the benefit to data analysts?
Work in progress! Interest? Ideas?
Notas del editor
We are going to use this smaller process as a testbed
Changes in the reference databases have an impact on the classification
returns updates in mappings to genes that have changed between the two versions (including possibly new mappings):
$\diffOM(\OM^t, \OM^{t'}) = \{\langle t, genes(\dt) \rangle | genes(\dt) \neq genes'(\dt) \} $\\
where $genes'(\dt)$ is the new mapping for $\dt$ in $\OM^{t'}$.
\begin{align*}
\diffCV&(\CV^t, \CV^{t'}) = \\
&\{ \langle v, \varst(v) | \varst(v) \neq \varst'(v) \} \\
& \cup \CV^{t'} \setminus \CV^t \cup \CV^t \setminus \CV^{t'}
\label{eq:diff-cv}
\end{align*}
where $\varst'(v)$ is the new class associate to $v$ in $\CV^{t'}$.
Threats: Will any of the changes invalidate prior findings?
Opportunities: Can the findings be improved over time?
Can we do better in a generic way?
We need to control re-computation on two dimensions
Across a population
Within a single process
f: X \rightarrow Y
\mathbf{y} = f(\mathbf{x}, D)
\mathbf{d} \rightsquigarrow \mathbf{d’}
\mathbf{x} = [x_1 \dots x_n]\\
\mathbf{y} = [y_1 \dots y_m]
\delta_Y(y,y') > \Delta_Y
&\text{simply compute } \mathbf{y'} = f(\mathbf{x’}, \mathbf{d') \\
&\text{inefficient if computing $f(.)$ is expensive, and}\\
&\text{$y, y’$ turn out to be very similar to each other}
&\text{find a new function } f’(.) \text{ that approximates } f(.) \\
&\text{return } f'(\mathbf{x’}, \mathbf{d'})
&\text{Define a distance metric }\delta_Y \text{ on }Y \\
&\text{try and estimate } \delta_Y(y,y') \text{ \emph{without explicitly computing} } y’ \\
&\text{if } \delta_Y(y,y')> \Delta_Y \text{ for a set threshold } \Delta_Y \text{ then compute } f(\mathbf{x’})
&\text{This approach works well when: }\\
&\text{1. Distance metrics can be defined on both $X$ and $Y$: $\delta_X, \delta_Y$} \\
&\text{2. $f(.)$ is \emph{stable}:} \quad {\delta_X(\mathbf{x},\mathbf{x'}) < \epsilon_X \Rightarrow \delta_Y(\mathbf{y},\mathbf{y'}) < \epsilon_Y}
\text{let } v \in \diff{Y}(Y^t, Y^{t'}): \\
\text{for any $X$: } \impact_{P}(C,X) = \texttt{High} \text{ if }\\
v.\texttt{status:}
\begin{cases}
* \rightarrow \texttt{red} \\
\texttt{red} \rightarrow *
\end{cases}
Success criteria:
performance, but this is on a case-by-case basis
Ease of customization. The focus of this paper
The framework is a meta-process…
Changes can also occur to OS, libraries and other dependencies but these are out of scope
The black box case is illustrated here and is less interesting.
The more interesting SVI case is in the next slide
Shows Essential ProvONE fragment used by ReComp
This shows the good case of “Gerry box” workflow and box-level provenance
SVI workflow with automated provenance recording
Cohort of about 100 exomes (neurological disorders)
Changes in ClinVar and OMIM GeneMap
How these two restart trees are discovered is explained in the two papers
IPAW
BDC
How about the data used to train / build the model?
Relatively easy to keep track of data pre-processing provenance
\newcommand{\f}{\textbf{a}}
\text{features}~ X=[\f_1 \ldots \f_k]
\text{new features}~ Y=[\f'_1 \ldots \f'_l]
\noindent new values for each row are obtained by applying $f$\\ to values in the $X$ features