Analytics of analytics pipelines:from optimising re-execution to general Data Provenance for Data Science

Paolo Missier
School of Computing
Newcastle University, UK
March 2021
Analytics of analytics pipelines:
from optimising re-execution to general
Data Provenance for Data Science
paolo.missier@ncl.ac.uk
LinkedIn: paolomissier
Twitter: @PMissier

2
Outline
ONS
March
2021
P.
Missier
1. ReComp: a framework to enable the selective re-computation of expensive analytics workflows
2. Data Provenance for Data Science

3
Context
Big
Data
The Big
Analytics
Machine
Actionable
Knowledge
Analytics
Data Science over time V3
V2
V1
Meta-knowledge
Algorithms
Tools
Libraries
Reference
datasets
t
t
t

4
What changes?
Life Sciences, Health care
Reference databases
Algorithms and libraries
Simulation
Large parameter space
Input conditions
Machine Learning
Evolving ground truth datasets
Model re-training

5
Motivating example: Genomics pipelines
Image credits: Broad Institute https://software.broadinstitute.org/gatk/
https://www.genomicsengland.co.uk/the-100000-genomes-project/
Spark GATK tools on Azure:
45 mins / GB
@ 13GB / exome: about 10 hours

6
Genomics: WES / WGS, Variant calling  Variant interpretation
SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya, E.; Kirby, R.; and Keogh,
M. In Procs. 11th International conference on Data Integration in the Life Sciences, Los Angeles, CA, 2015. Springer
SVI: Simple Variant Interpretation
Variant classification : pathogenic, benign and unknown/uncertain

7
Changes that affect variant interpretation
What changes:
- Improved sequencing / variant calling
- ClinVar, OMIM evolve rapidly
- New reference data sources
Evolution in number of variants that affect patients
(a) with a specific phenotype
(b) Across all phenotypes

8
Blind reaction to change: a game of battleship
Sparsity issue:
• About 500 executions
• 33 patients
• total runtime about 60 hours
• Only 14 relevant output changes
detected
4.2 hours of computation per change
Should we care about updates?
Evolving knowledge about
gene variations

9
Reacting to changes in inputs
x1
x2
y1
d1 d2
f()
1. Always refresh
2. Approximate

10
3. A refresh-if-needed approach

11
f(.) unstable  heuristics
Impact: “Any variant with status moving from/to Red causes High impact on any
patient who is affected by the variant”
Observation: Variants v within output set y that are in scope for patient X remain in scope! (monotonicity)
1. Variant v changes status
- unknown  benign
- unknown  deleterious
2. Brand new variant
 If in scope:
compare status before / after inexpensive
 recompute SVI on all inputs expensive
Scope: which cases are affected? “a change in variant v can only have impact on a case
X if V and X share the same phenotype”

12
Empirical evaluation
re-executions 495  71 Ideal: 14
But: no false negatives

13
ReComp
http://recomp.org.uk/
Outcome:
A framework for selective Re-computation
• Generic, Customisable
Scope:
expensive analysis +
frequent data changes +
not all data changes significant
Challenge:
Make re-computation efficient in
response to changes
Assumptions:
Processes are
• Observable
• Reproducible
Estimates are cheap
Insight: replace re-computation with change impact estimation
Using history of past executions

14
Data Provenance in ReComp
Hypothesis:
collecting detailed {provenance, logs} from past executions helps optimizing future executions
2. Identify and re-execute the minimal fragments of workflow that have been affected
1. Identify the subset of executions that are potentially affected by the changes

15
Reproducibility
How
Selective:
- Across a cohort of past executions.  which subset of individuals?
- Within a single re-execution  which process fragments?
Change in
ClinVar
Change in
GeneMap
 Why, when, to what extent

16
The ReComp meta-process
History
DB
Detect and
quantify
changes
data diff(d,d’)
Record
execution history
Analytics
Process P
Log / provenance
Partially
Re-exec
P (D) P’(D’)
Change
Events
Changes:
• Reference datasets
• Inputs
For each past
instance:
Estimate impact
of changes
Impact(dd’, o) impact estimation functions
Scope
Select relevant
sub-processes
Optimisation

17
How much do we know about the process?
Impact estimation
Re-execution
less more
Process structure
Execution trace
black box
I/O provenance
IO, DO
All-or-nothing
monolithic process, legacy
 a complex simulator
white box
step-by-step provenance
workflows, R / python code
 genomics analytics
Typical process
Fine-grained Impact
Partial  restart trees (*)
(*) Cala J, Missier P. Provenance Annotation and Analysis to Support Process Re-Computation. In: Procs.
IPAW 2018. London: Springer; 2018.

18
Provenance of process executions
Process, workflow run
Data
wasAssociatedWith
cooking
recipe
chef
finished
dish
wasGeneratedBy
A plan plays a role in an association
Activity: workflow run Data product
Plan

19
Execution trace / Provenance
User Execution
«Association » «Usage» «Generation »
«Entity»
«Collection»
Controller Program
Workflow Channel
Port
wasPartOf
«hadMember »
«wasDerivedFrom »
hasSubProgram
«hadPlan »
controlledBy
controls
[*]
[*]
[*]
[*] [*] [*]
«wasDerivedF
[*]
[*]
[0..1]
[0..1]
[0..1]
[*]
[1]
[*]
[*]
[0..1]
[0..1]
hasOutPort [*]
[0..1]
[1]
«wasAssociatedWith »
«agent »
[1]
[0..1]
[*]
[*]
[*] [*]
[*] [*]
[*]
[*] [*]
[*]
[*]
[*]
[0..1]
[0..1]
hasInPort [*]
[0..1]
connectsTo
[*]
[0..1]
«wasInformedBy »
[*]
[1]
«wasGeneratedBy »
«qualifiedGeneration »
«qualifiedUsage »
«qualifiedAssociation »
hadEntity
«used »
hadOutPort
hadInPort
[*]
[1]
[1] [1]
[1] [1]
hadEntity
hasDefaultParam

20
History DB: Workflow Provenance
Each invocation of a workflow generates a provenance trace
“plan”
“plan
execution”
WF
B1 B2
B1exec B2exec
Data
WFexec
partOf
partOf
usage
generation
association association
association
db
usage
Program
Workflow
Execution
Entity
(ref data)

21
SVI implemented using workflow
Phenotype to genes
Variant selection
Variant classification
Patient
variants
GeneMap
ClinVar
Classified
variants
Phenotype

22
SVI – partial re-execution
Overhead:
caching
intermediate data
Time savings Partial re-exec (sec) Complete re-exec Time saving (%)
GeneMap 325 455 28.5
ClinVar 287 455 37
Change in
ClinVar
Change in
GeneMap
Cala J, Missier P. Provenance Annotation and Analysis to Support Process Re-Computation. In: Procs. IPAW 2018.

24
ReComp: Summary
Evaluation: case-by-case basis
- Cost savings
- Ease of customisation
Generic ReComp framework:
- Observe changes, Provenance DB (History), control re-exec
Customisation:
- Diff functions, impact functions
Fine-grained provenance + control  max savings

25
Data Provenance for Data Science

26
Data  Model  Predictions
Model
pre-processing
Raw
datasets
features
Predicted you:
- Ranking
- Score
- Class
Data
collection
Instances
Key decisions are made during data selection and
processing:
- Where does the data come from?
- What’s in the dataset?
- What transformations were applied?
Complementing current ML approaches to model interpretability
1. Can we explain these decisions?
2. Are these explanations useful?

27
Explaining data preparation
Data
collection
Model
Population data pre-processing
Raw
datasets
features
Predicted you:
- Ranking
- Score
- Class
- Integration
- Cleaning
- Outlier removal
- Normalisation
- Feature selection
- Class rebalancing
- Sampling
- Stratification
- …
Data acquisition and wrangling:
- How were datasets acquired?
- How recently?
- For what purpose?
- Are they being reused /
repurposed?
- What is their quality?
Instances
- Scripts  Python / TensorFlow, Pandas, Spark
- Workflows  Knime, …
Provenance  Transparency

29
Typical operators used in data prep

35
Recent early results
A small grassroots project… [1]
- Formalisation of provenance patterns for pipeline operators
- Systematic collection of fine-grained provenance from (nearly) arbitrary pipelines
- Reality check:
- How much does it cost?  provenance volume
- Does it help?  queries against the provenance database
[1]. Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. Chapman, A., Missier,
P., Simonelli, G., & Torlone, R. PVLDB, 14(4):507-520, January, 2021.

36
Operators
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
Data reduction
- Feature selection
- Instance selection
Data augmentation
- Space transformation
- Instance generation
- Encoding (eg one-hot…)
Data transformation
- Data repair
- Binarisation
- Normalisation
- Discretisation
- Imputation
Ex.: vertical augmentation  adding columns

37
Code instrumentation
Create a provlet for
a specific
transformation
Initialize provenance
capture
…code injection is now being automated!

39
Provenance templates
Template + binding rules = instantiated provenance fragment
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
{old values: F, I, V}  {new values: F’, J, V’} +

40
This applies to all operators…

43
Evaluation: Provenance capture and query times

45
Summary
Multiple hypotheses regarding Data Provenance for Data Science:
1. Is it practical to collect fine-grained provenance?
1. To what extent can it be done automatically?
2. How much does it cost?
2. Is it also useful?  what is the benefit to data analysts?
Work in progress! Interest? Ideas?

Analytics of analytics pipelines:from optimising re-execution to general Data Provenance for Data Science

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Analytics of analytics pipelines:from optimising re-execution to general Data Provenance for Data Science

Similar a Analytics of analytics pipelines:from optimising re-execution to general Data Provenance for Data Science (20)

Más de Paolo Missier

Más de Paolo Missier (20)

Último

Último (20)