SlideShare una empresa de Scribd logo
1 de 60
1
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Selective and incremental re-computation in reaction to changes:
an exercise in metadata analytics
recomp.org.uk
Paolo Missier, Jacek Cala, Jannetta Steyn
School of Computing
Newcastle University, UK
Durham University
May 31st, 2018
Meta-*
In collaboration with
• Institute of Genetic Medicine, Newcastle
University
• School of GeoSciences, Newcastle University
2
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Data Science
Meta-knowledge
Big
Data
The Big
Analytics
Machine
Algorithms
Tools
Middleware
Reference
datasets
“Valuable
Knowledge”
3
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Data Science over time
Big
Data
The Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
Life Science
Analytics
4
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Understanding change
Big
Data
The Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
• Threats: Will any of the changes invalidate prior findings?
• Opportunities: Can the findings be improved over time?
ReComp space = expensive analysis +
frequent changes +
high impact
Analytics within ReComp space…
C1: are resource-intensive and thus expensive when repeatedly executed over time, i.e., on
a cloud or HPC cluster;
C2: require sophisticated implementations to run efficiently, such as workflows with a nested
structure;
C3: depend on multiple reference datasets and software libraries and tools, some of which
are versioned and evolve over time;
C4: apply to a possibly large population of input instances
C5: deliver valuable knowledge
5
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Talk Outline
ReComp: selective re-computation to refresh outcomes in reaction
to change
• Case study 1: Re-computation decisions for flood simulations
• Learning useful estimators for the impact of change
• Black box computation, coarse-grained changes
• Case study 2: high throughput genomics data processing
• An exercise in provenance collection and analytics
• White-box computation, fine-grained changes
• Open challenges
6
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Case study 1: Flood modelling simulation
Simulation characteristics:
Part of Newcastle upon Tyne
DTM: ≈2.3M cells, 2x2m cell size
Building and green areas from Nov 2017
Rainfall event with return period 50 years
Simulation time: 60 mins
10–25 frames with water depth and velocity in
each cell
Output size: 23x65 MiB ≈ 1.5 GiB
Water depth heat map
City Catchment Analysis Tool (CityCAT)
Vassilis Glenis, et al.
School of Engineering, NU
7
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
When should we repeat an expensive simulation?
CityCat
Flood simulator
CityCat
Flood simulator
Can we predict
high difference
areas without re-
running the
simulation?
New buildings / green areas
may alter data flow
Extreme weather event simulation (in Newcastle)
Extreme Rainfall event
Running CityCat is generally expensive:
- Processing for the Newcastle area: ≈3h on a 4-core i7 3.2GHz CPU
Placeholder for more expensive simulations!
Maps updates are infrequent (6 months)
But useful when simulating changes eg for planning purposes
Flood
Diffusion
Time series
8
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Estimating the impact of a flood simulation
Suppose we are able to quantify:
- Difference in inputs, M,M’
- Difference in outputs F,F’
Suppose also that we are only interested in large enough changes between two outputs:
For some user-defined parameter
Problem statement:
Can we define an ideal ReComp decision function which
- Operates on two versions of the inputs, M, M’, and old output F
- Returns true iff (1) would return true when F’ is actually computed
(1)
Can we predict when F’ needs to be computed?
9
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Approach
1. Define input diff and output diff functions:
2. Define an impact function:
3. Define the ReComp decision function:
where is a tunable parameter
ReComp approximates (1), so it’s subject to errors:
False Positives:
False Negatives:
4. Use ground data to determine values for as a function of FPR and FNR
Note: The ReComp function should be much less expensive to compute than sim()
10
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Diff and impact functions
B: Buildings
L: other Land
H: hard surface
f() partitions polygons changes into 6 types:
For each type, compute the average water depth within and around the footprint of the change
returns the max of the avg water depth over all changes
d
Water depth
B–L+
B–∩ L+
d
B–
Water depth
: max of the differences between spatially averaged F,F’ over window W
11
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Tuning the threshold parameter
Ground data from all past re-computations:
FP: <1,0>
FN: <0,1>
Set FNR to be close to 0. Experimentally find that minimises FPR. (max specificity)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
0.10 0.15 0.20 0.25
θImp
0.10 0.15 0.20 0.25
θImp
Precision
Recall
Accuracy
Specificity
window size 20x20m, θO = 0.2m, all
changes
window size 20x20m, θO = 0.2m,
consecutive changes
12
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Experimental results
13
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Summary of the approach
M
F
M’
True
F’
False
Ground
data
Tune
Target FPR
Historical
data
<M,M’,F,F’>
14
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Talk Outline
ReComp: selective re-computation to refresh outcomes in reaction
to change
• Case study 1: Re-computation decisions for flood simulations
• Learning useful estimators for the impact of change
• Case study 2: high throughput genomics data processing
• An exercise in provenance collection and analytics
• Open challenges
15
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Data Analytics enabled by Next Gen Sequencing
Genomics: WES / WGS, Variant calling, Variant interpretation  diagnosis
- Eg 100K Genome Project, Genomics England, GeCIP
Submission of
sequence data for
archiving and analysis
Data analysis using
selected EBI and
external software tools
Data presentation and
visualisation through
web interface
Visualisation
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
sequences align clean
recalibrate
alignments
calculate
coverage
coverage
informationraw
sequences align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Metagenomics: Species identification
- Eg The EBI metagenomics portal
16
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Whole-exome variant calling pipeline
Van der Auwera, G. A., Carneiro, M. O., Hartl, C., Poplin, R., del Angel, G., Levy-Moonshine, A., … DePristo, M. A. (2002). From
FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. In Current Protocols in
Bioinformatics. John Wiley & Sons, Inc. https://doi.org/10.1002/0471250953.bi1110s43
GATK quality
score
recalibration
Annovar functional annotations (eg
MAF, synonimity, SNPs…)
followed by in house annotations
BWA, Bowtie,
Novoalign
Picard:
MarkDuplicates
GATK-Haplotype Caller
FreeBayes
SamTools
Variant
recalibration
17
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Expensive
Data stats per sample:
4 files per sample (2-lane, pair-end,
reads)
≈15 GB of compressed text data (gz)
≈40 GB uncompressed text data
(FASTQ)
Usually 30-40 input samples
0.45-0.6 TB of compressed data
1.2-1.6 TB uncompressed
Most steps use 8-10 GB of
reference data
Small 6-sample run takes
about 30h on the IGM HPC
machine (Stage1+2)
Scalable and Efficient Whole-exome Data Processing Using Workflows on the Cloud. Cala, J.;
Marei, E.; Yu, Y.; Takeda, K.; and Missier, P. Future Generation Computer Systems, Special Issue:
Big Data in the Cloud, 2016
19
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
SVI: Simple Variant Interpretation
Genomics: WES / WGS, Variant calling, Variant interpretation  diagnosis
- Eg 100K Genome Project, Genomics England, GeCIP
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
sequences align clean
recalibrate
alignments
calculate
coverage
coverage
informationraw
sequences align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Filters then classifies variants into three categories: pathogenic,
benign and unknown/uncertain
SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya,
E.; Kirby, R.; and Keogh, M. In Procs. 11th International conference on Data Integration in the Life Sciences,
Los Angeles, CA, 2015. Springer
20
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Changes that affect variant interpretation
What changes:
- Improved sequencing / variant calling
- ClinVar, OMIM evolve rapidly
- New reference data sources
Evolution in number of variants that affect patients
(a) with a specific phenotype
(b) Across all phenotypes
21
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Baseline: blind re-computation
Sparsity issue:
• About 500 executions
• 33 patients
• total runtime about 60 hours
• Only 14 relevant output changes detected:
4.2 hours of computation per change
≈7 minutes / patient (single-core VM)
Should we care about database updates?
22
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Unstable
Van der Auwera, G. A., Carneiro, M. O., Hartl, C., Poplin, R., del Angel, G., Levy-Moonshine, A., … DePristo, M. A. (2002). From
FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. In Current Protocols in
Bioinformatics. John Wiley & Sons, Inc. https://doi.org/10.1002/0471250953.bi1110s43
GATK quality
score
recalibration
Annovar functional annotations (eg
MAF, synonimity, SNPs…)
followed by in house annotations
BWA, Bowtie,
Novoalign
Picard:
MarkDuplicates
GATK-Haplotype Caller
FreeBayes
SamTools
Variant
recalibration
dbSNP builds
150 2/17
149 11/16
148 6/16
147 4/16
Any of these stages may change over time – semi-independently
Human reference genome:
H19  h37, h38,…
23
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
FreeBayes vs SamTools vs GATK-Haplotype Caller
GATK: McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., … DePristo, M.
A. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA
sequencing data. Genome Research, 20(9), 1297–303. https://doi.org/10.1101/gr.107524.110
FreeBayes: Garrison, Erik, and Gabor Marth. "Haplotype-based variant detection from short-read
sequencing." arXiv preprint arXiv:1207.3907 (2012).
GIAB: Zook, J. M., Chapman, B., Wang, J., Mittelman, D., Hofmann, O., Hide, W., & Salit, M. (2014).
Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype
calls. Nat Biotech, 32(3), 246–251. http://dx.doi.org/10.1038/nbt.2835
Adam Cornish and Chittibabu Guda, “A Comparison of Variant Calling Pipelines Using Genome in a
Bottle as a Reference,” BioMed Research International, vol. 2015, Article ID 456479, 11 pages, 2015.
doi:10.1155/2015/456479
Hwang, S., Kim, E., Lee, I., & Marcotte, E. M. (2015). Systematic comparison of variant calling
pipelines using gold standard personal exome variants. Scientific Reports, 5(December), 17875.
https://doi.org/10.1038/srep17875
24
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Comparing three versions of Freebayes
Should we care about changes in the pipeline?
• Tested three versions of the caller:
• 0.9.10  Dec 2013
• 1.0.2  Dec 2015
• 1.1  Nov 2016
• The Venn diagram shows quantitative comparison (% and number) of filtered
variants;
• Phred quality score >30
• 16 patient BAM files (7 AD, 9 FTD-ALS)
25
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Impact on SVI classification
Patient phenotypes: 7 Alzheimer’s, 9 FTD-ALS
The ONLY change in the pipeline is the version of Freebayes used to call variants
(R)ed – confirmed pathogenicity (A)mber – uncertain pathogenicity
Patient ID
Freebayes
version
B_0190
B_0191
B_0192
B_0193
B_0195
B_0196
B_0198
B_0199
B_0201
B_0202
B_0203
B_0208
B_0209
B_0211
B_0213
B_0214
0.9.10 A A R A R R R R R A R R R R A R
1.0.2 A A R A R R A A R A R A R A A R
1.1 A A R A R R A A R A R A R A A R
Phenotype
ALS-FTD
ALS-FTD
ALS-FTD
ALS-FTD
ALS-FTD
ALS-FTD
AD
ALS-FTD
AD
AD
AD
AD
AD
ALS-FTD
ALS-FTD
AD
26
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Changes: frequency / impact / cost
Change Frequency
Changeimpactonacohort
GATK
Variant annotations
(Annovar)
Reference
Human genome
Variant DB
(eg ClinVar)
Phenotype 
disease mapping
(eg OMIM
GeneMap)
New
sequences
LowHigh
Low High
Variant
Caller
Variant calling
N+1 problem
Variant interpretation
27
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Changes: frequency / impact / cost
Change Frequency
Changeimpactonacohort
GATK
Variant annotations
(Annovar)
Reference
Human genome
Variant DB
(eg ClinVar)
Phenotype 
disease mapping
(eg OMIM
GeneMap)
New
sequences
LowHigh
Low High
Variant
Caller
Variant calling
N+1 problem
Variant interpretation
ReComp
space
28
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
When is ReComp effective?
29
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
The ReComp meta-process
Estimate impact of
changes
Select and
Enact
Record execution
history
Detect and
measure
changes
History
DB
Data diff(.,.)
functions
Change
Events
Process P
Observe
Exec
1. Capture the history of past computations:
- Process Structure and dependencies
- Cost
- Provenance of the outcomes
2. Metadata analytics: Learn from history
- Estimation models for impact, cost, benefits
Approach:
2. Collect and exploit
process history metadata
1. Quantify data-diff and impact of changes on prior outcomes
Changes:
• Algorithms and tools
• Accuracy of input sequences
• Reference databases (HGMD,
ClinVar, OMIM GeneMap…)
32
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
changes, data diff, impact
1) Observed change events:
(inputs, dependencies, or both)
3) Impact occurs to various degree on multiple prior outcomes.
Impact of change C on the processing of a specific X:
2) Type-specific Diff functions:
Impact is process- and data-specific:
33
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Impact
Given P (fixed), a change in one of the inputs to P: C={xx’} affects a single output:
However a change in one of the dependencies: C= {dd’} affects all outputs yt where
version d of D was used
34
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Impact: importance and Scope
Scope: which cases are affected?
- Individual variants have an associated phenotype.
- Patient cases also have a phenotype
“a change in variant v can only have impact on a case X if V and X
share the same phenotype”
Importance: “Any variant with status moving from/to Red causes High
impact on any X that is affected by the variant”
35
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Approach – a combination of techniques
1. Partial re-execution
• Identify and re-enact the portion of a process that are affected by change
2. Differential execution
• Input to the new execution consists of the differences between two versions of a
changed dataset
• Only feasible if some algebraic properties of the process hold
3. Identifying the scope of change – Loss-less
• Exclude instances of the population that are certainly not affected
37
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Approach – a combination of techniques
1. Partial re-execution
2. Differential execution
3. Identifying the scope of change – Loss-less
38
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Role of Workflow Provenance in partial re-run
User Execution
«Association » «Usage» «Generation »
«
«C
Controller Program
Workflow Channel
Port
wasPartOf
«wasDerivedFrom »
hasSubProgram
«hadPlan »
controlledBy
controls[*]
[*]
[*]
[*] [*] [*]
[0..1]
[0..1]
[*][1]
[*]
[*]
[0..1]
[0..1]
hasOutPort [*][0..1]
[1]
«wasAssociatedWith »
«agent »
[1]
[0..1]
[*]
[*]
[*] [*]
[*] [*]
[*]
[*] [*]
[*]
[*]
[*]
[0..1]
[0..1]
hasInPort [*][0..1]
connectsTo
[*]
[0..1]
«wasInformedBy »
[*][1]
«wasGeneratedBy »
«qualifiedGeneration »
«qualifiedUsage »
«qualifiedAssociation »
hadEntity
«used »
hadOutPorthadInPort
[*][1]
[1] [1]
[1]
hadEntity
hasDefaultPara
39
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
History DB: Workflow Provenance
Each invocation of an eSC workflow generates a provenance trace
“plan”
“plan
execution”
WF
B1 B2
B1exec B2exec
Data
WFexec
partOf
partOf
usagegeneration
association association
association
db
usage
ProgramWorkflow
Execution
Entity
(ref data)
40
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
SVI as eScience Central workflow
Phenotype to genes
Variant selection
Variant classification
Patient
variants
GeneMap
ClinVar
Classified
variants
Phenotype
41
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
1. Partial re-execution
1. Change detection: A provenance fact indicates that a new version Dnew of
database d is available wasDerivedFrom(“db”,Dnew)
:- execution(WFexec), wasPartOf(Xexec,WFexec), used(Xexec, “db”)
2.1 Find the entry point(s) into the workflow, where db was used
:- execution(WFexec), execution(B1exec), execution(B2exec),
wasPartOf(B1exec, WFexec), wasPartOf(B2exec, WFexec),
wasGeneratedBy(Data, B1exec), used(B2exec,Data)
2.2 Discover the rest of the sub-workflow graph (execute recursively)
2. Reacting to the change:
Provenance
pattern:
“plan”
“plan
execution”
Ex. db = “ClinVar v.x”
WF
B1 B2
B1exec B2exec
Data
WFexec
partOf
partOf
usagegeneration
association association
association
db
usage
42
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Minimal sub-graphs in SVI
Change in
ClinVar
Change in
GeneMap
Overhead: cache intermediate data required for partial re-execution
• 156 MB for GeneMap changes and 37 kB for ClinVar changes
Time savings Partial re-
execution (seC)
Complete re-
execution
Time saving (%)
GeneMap 325 455 28.5
ClinVar 287 455 37
47
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Approach – a combination of techniques
1. Partial re-execution
2. Differential execution
3. Identifying the scope of change – Loss-less
48
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Diff functions: example
ClinVar
1/2016
ClinVar
1/2017
diff
(unchanged)
49
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Compute difference sets – ClinVar
The ClinVar dataset: 30 columns
Changes:
Records: 349,074  543,841
Added 200,746 Removed 5,979. Updated 27,662
50
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
For tabular data, difference is just Select-Project
Key columns: {"#AlleleID", "Assembly", "Chromosome”}
“where” columns:{"ClinicalSignificance”}
51
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Differential execution
ClinVar
1/2016
ClinVar
1/2017
diff
(unchanged)
52
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Differential execution
Suppose D is a relation (a table). diffD() can be expressed as:
Where:
We compute:
as the combination of:
This is effective if:
This can be achieved as follows:
…provided P is distributive wrt st union and difference
Cf. F. McSherry, D. Murray, R. Isaacs, and M. Isard, “Differential dataflow,” in Proceedings of CIDR 2013, 2013.
53
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Partial re-computation using input difference
Idea: run SVI but replace ClinVar query with a query on ClinVar version diff:
Q(CV)  Q(diff(CV1, CV2))
Works for SVI, but hard to generalise: depends on the type of process
Bigger gain: diff(CV1, CV2) much smaller than CV2
GeneMap versions
from –> to
ToVersion record
count
Difference
record count Reduction
16-03-08 –> 16-06-07 15910 1458 91%
16-03-08 –> 16-04-28 15871 1386 91%
16-04-28 –> 16-06-01 15897 78 99.5%
16-06-01 –> 16-06-02 15897 2 99.99%
16-06-02 –> 16-06-07 15910 33 99.8%
ClinVar versions
from –> to
ToVersion record
count
Difference
record count Reduction
15-02 –> 16-05 290815 38216 87%
15-02 –> 16-02 285042 35550 88%
16-02 –> 16-05 290815 3322 98.9%
54
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Approach – a combination of techniques
1. Partial re-execution
2. Differential execution
3. Identifying the scope of change – Loss-less
55
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
3: precisely identify the scope of a change
Patient / DB version
impact matrix
Strong scope:
(fine-grained provenance)
Weak scope: “if CVi was used in the processing of pj then pj is in scope”
(coarse-grained provenance – next slide)
Semantic scope:
(domain-specific scoping rules)
56
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
A weak scoping algorithm
Coarse-grained
provenance
Candidate invocation:
Any invocation I of P
whose provenance
contains statements of
the form:
used(A,”db”),wasPartOf(A,I),wasAssociatedWith(I,_,WF)
- For each candidate invocation I of P:
- partially re-execute using the difference sets as inputs # see previous slides
- find the minimal subgraph P’ of P that needs re-computation # see above
- repeat:
execute P’ one step at-a-time
until <empty output> or <P’ completed>
- If <P’ completed> and not <empty output> then
- Execute P’ on the full inputs
Sketch of the algorithm:
WF
B1 B2
B1exec B2exec
Data
WFexec
partOf
partOf
usagegeneration
association association
association
db
usage
57
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Scoping: precision
• The approach avoids the majority of re-computations given a ClinVar change
• Reduction in number of complete re-executions from 495 down to 71
58
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Summary of ReComp challenges
Change
Events
History
DB
Reproducibility:
- virtualisation
Sensitivity analysis unlikely to work well
Small input perturbations  potentially large impact on diagnosis
Learning useful estimators is hard
Diff functions are both type-
and application-specific
Not all runtime environments
support provenance recording
specific  generic
Data
diff(.,.)
functions
Process P
Observe
Exec
59
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Come to our workshop during Provenance Week!
https://sites.google.com/view/incremental-recomp-workshop
July 12th (pm) and 13th (am), King’s College London
http://provenanceweek2018.org/
60
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Questions?
http://recomp.org.uk/
Meta-*
61
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
The Metadata Analytics challenge:
Learning from a metadata DB of execution history to
support automated ReComp decisions
62
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
History Database
HDB: A metadata-database containing records of past executions:
Execution records:
C1 C2 C3
GATK
(Haplotype caller) FreeBayes
0.9
FreeBayes
1.0
FreeBayes
1.1
  
X1
X2
X3
X4
X5
Y11
Y21
Y31
Y41
Y51
Y12
Y52
Y43
Y53
HDB
Example: Consider only one type of change: Variant caller
63
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Impact (again)
Given P (fixed), a change in one of the inputs to P: C={xx’} affects a single output:
While a change in one of the dependencies: C= {dd’} affects all outputs yt where version
d of D was used
64
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
ReComp decisions
Given a population X of prior inputs:
Given a change
ReComp makes yes/no decisions for each
returns True if P is to be executed again on X, and False otherwise
To decide, ReComp must estimate impact:
(as well as estimate the re-computation cost)
Example:
65
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Two possible approaches
1. Direct estimator of impact function:
Here the problem is learn such function for specific P, C, and data types Y
2. Learning an emulator for P which is simpler to compute and provides a useful
approximation:
surrogate (emulator)
where ε is a stochastic term that accounts for the error in approximating f
Learning requires a training set { (xi, yi) } …
If can be found, then we can hope to use it to approximate:
Such that
66
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
History DB and Differences DB
Whenever P is re-computed on input X, a new er’ is added to HDB for X:
Using diff() functions we produce a derived difference record dr:
… collected in a Differences database:
dr1 = imp(C1,Y11)
dr2= imp(C12,Y41)
dr3 = imp(C1,Y51)
dr4 = imp(C2,Y52)
DDB
C1 C2 C3
GATK
(Haplotype caller) FreeBayes
0.9
FreeBayes
1.0
FreeBayes
1.1
  
X1
X2
X3
X4
X5
Y11
Y21
Y31
Y41
Y51
Y12
Y52
Y43
Y53
HDB




67
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
ReComp algorithm
ReComp
C
X
E: HDB DDB
decisions:
E’: HDB’DDB’
68
ReComp–DurhamUniversityMay31st,2018
PaoloMissier
Learning challenges
• Evidence is small and sparse
• How can it be used for selecting from X?
• Learning a reliable imp() function is not feasible
• What’s the use of history? You never see the same change twice!
• Must somehow use evidence from related changes
• A possible approach:
• ReComp makes probabilistic decisions, takes chances
• Associate a reward to each ReComp decision  reinforcement learning
• Bayesian inference (use new evidence to update probabilities)
dr1 = imp(C1,Y11)
dr2= imp(C12,Y41)
dr3 = imp(C1,Y51)
dr4 = imp(C2,Y52)
DDB
C1 C2 C3
GATK
(Haplotype caller)
FreeBayes
0.9
FreeBayes
1.0
FreeBayes
1.1
  
X1
X2
X3
X4
X5
Y11
Y21
Y31
Y41
Y51
Y12
Y52
Y43
Y53
HDB





Más contenido relacionado

La actualidad más candente

How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?Anubhav Jain
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFIan Foster
 
Overview of DuraMat software tool development (poster version)
Overview of DuraMat software tool development(poster version)Overview of DuraMat software tool development(poster version)
Overview of DuraMat software tool development (poster version)Anubhav Jain
 
DuraMat Data Analytics
DuraMat Data AnalyticsDuraMat Data Analytics
DuraMat Data AnalyticsAnubhav Jain
 
Atomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discoveryAtomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discoveryAnubhav Jain
 
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataThe DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataAnubhav Jain
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
Fast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data StreamsFast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data StreamsAlbert Bifet
 
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsDeep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsValery Tkachenko
 
A Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge SystemsA Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge Systemsaimsnist
 
Core Objective 1: Highlights from the Central Data Resource
Core Objective 1: Highlights from the Central Data ResourceCore Objective 1: Highlights from the Central Data Resource
Core Objective 1: Highlights from the Central Data ResourceAnubhav Jain
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Ian Foster
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLAnubhav Jain
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningAnubhav Jain
 
Project Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster ReliefProject Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster ReliefRobert Grossman
 
The Influence of the Java Collection Framework on Overall Energy Consumption
The Influence of the Java Collection Framework on Overall Energy ConsumptionThe Influence of the Java Collection Framework on Overall Energy Consumption
The Influence of the Java Collection Framework on Overall Energy ConsumptionGreenLabAtDI
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAnubhav Jain
 
Pitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themPitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themAlbert Bifet
 
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Anubhav Jain
 

La actualidad más candente (20)

How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
 
Overview of DuraMat software tool development (poster version)
Overview of DuraMat software tool development(poster version)Overview of DuraMat software tool development(poster version)
Overview of DuraMat software tool development (poster version)
 
DuraMat Data Analytics
DuraMat Data AnalyticsDuraMat Data Analytics
DuraMat Data Analytics
 
Atomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discoveryAtomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discovery
 
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataThe DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Fast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data StreamsFast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data Streams
 
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsDeep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
 
A Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge SystemsA Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge Systems
 
Core Objective 1: Highlights from the Central Data Resource
Core Objective 1: Highlights from the Central Data ResourceCore Objective 1: Highlights from the Central Data Resource
Core Objective 1: Highlights from the Central Data Resource
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Project Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster ReliefProject Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster Relief
 
The Influence of the Java Collection Framework on Overall Energy Consumption
The Influence of the Java Collection Framework on Overall Energy ConsumptionThe Influence of the Java Collection Framework on Overall Energy Consumption
The Influence of the Java Collection Framework on Overall Energy Consumption
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design Problems
 
Pitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themPitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid them
 
io-Chem-BD, una solució per gestionar el Big Data en Química Computacional
io-Chem-BD, una solució per gestionar el Big Data en Química Computacionalio-Chem-BD, una solució per gestionar el Big Data en Química Computacional
io-Chem-BD, una solució per gestionar el Big Data en Química Computacional
 
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
 

Similar a Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics

HPC + Ai: Machine Learning Models in Scientific Computing
HPC + Ai: Machine Learning Models in Scientific ComputingHPC + Ai: Machine Learning Models in Scientific Computing
HPC + Ai: Machine Learning Models in Scientific Computinginside-BigData.com
 
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
Your data won’t stay smart forever:exploring the temporal dimension of (big ...Your data won’t stay smart forever:exploring the temporal dimension of (big ...
Your data won’t stay smart forever: exploring the temporal dimension of (big ...Paolo Missier
 
Big&open data challenges for smartcity-PIC2014 Shanghai
Big&open data challenges for smartcity-PIC2014 ShanghaiBig&open data challenges for smartcity-PIC2014 Shanghai
Big&open data challenges for smartcity-PIC2014 ShanghaiVictoria López
 
Biological Apps: Rapidly Converging Technologies for Living Information Proce...
Biological Apps: Rapidly Converging Technologies for Living Information Proce...Biological Apps: Rapidly Converging Technologies for Living Information Proce...
Biological Apps: Rapidly Converging Technologies for Living Information Proce...Natalio Krasnogor
 
Francisco J. Doblas-Big Data y cambio climático
Francisco J. Doblas-Big Data y cambio climáticoFrancisco J. Doblas-Big Data y cambio climático
Francisco J. Doblas-Big Data y cambio climáticoFundación Ramón Areces
 
Energy Efficient Wireless Internet Access
Energy Efficient Wireless Internet AccessEnergy Efficient Wireless Internet Access
Energy Efficient Wireless Internet AccessScienzainrete
 
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free Networks
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free NetworksSelf-Balancing Multimemetic Algorithms in Dynamic Scale-Free Networks
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free NetworksRafael Nogueras
 
Real-time PMU Data Recovery Application Based on Singular Value Decomposition
Real-time PMU Data Recovery Application Based on Singular Value DecompositionReal-time PMU Data Recovery Application Based on Singular Value Decomposition
Real-time PMU Data Recovery Application Based on Singular Value DecompositionPower System Operation
 
Real-time PMU Data Recovery Application Based on Singular Value Decomposition
Real-time PMU Data Recovery Application Based on Singular Value DecompositionReal-time PMU Data Recovery Application Based on Singular Value Decomposition
Real-time PMU Data Recovery Application Based on Singular Value DecompositionPower System Operation
 
ReComp: Preserving the value of large scale data analytics over time through...
ReComp:Preserving the value of large scale data analytics over time through...ReComp:Preserving the value of large scale data analytics over time through...
ReComp: Preserving the value of large scale data analytics over time through...Paolo Missier
 
The data, they are a-changin’
The data, they are a-changin’The data, they are a-changin’
The data, they are a-changin’ Paolo Missier
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streamsKrish_ver2
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthPaolo Missier
 
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...Rafael Nogueras
 
IBM Cloud Paris Meetup 20180517 - Deep Learning Challenges
IBM Cloud Paris Meetup 20180517 - Deep Learning ChallengesIBM Cloud Paris Meetup 20180517 - Deep Learning Challenges
IBM Cloud Paris Meetup 20180517 - Deep Learning ChallengesIBM France Lab
 
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisJason Riedy
 

Similar a Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics (20)

HPC + Ai: Machine Learning Models in Scientific Computing
HPC + Ai: Machine Learning Models in Scientific ComputingHPC + Ai: Machine Learning Models in Scientific Computing
HPC + Ai: Machine Learning Models in Scientific Computing
 
ODVSML_Presentation
ODVSML_PresentationODVSML_Presentation
ODVSML_Presentation
 
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
Your data won’t stay smart forever:exploring the temporal dimension of (big ...Your data won’t stay smart forever:exploring the temporal dimension of (big ...
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
 
Big&open data challenges for smartcity-PIC2014 Shanghai
Big&open data challenges for smartcity-PIC2014 ShanghaiBig&open data challenges for smartcity-PIC2014 Shanghai
Big&open data challenges for smartcity-PIC2014 Shanghai
 
Biological Apps: Rapidly Converging Technologies for Living Information Proce...
Biological Apps: Rapidly Converging Technologies for Living Information Proce...Biological Apps: Rapidly Converging Technologies for Living Information Proce...
Biological Apps: Rapidly Converging Technologies for Living Information Proce...
 
Slides barcelona risk data
Slides barcelona risk dataSlides barcelona risk data
Slides barcelona risk data
 
Francisco J. Doblas-Big Data y cambio climático
Francisco J. Doblas-Big Data y cambio climáticoFrancisco J. Doblas-Big Data y cambio climático
Francisco J. Doblas-Big Data y cambio climático
 
Energy Efficient Wireless Internet Access
Energy Efficient Wireless Internet AccessEnergy Efficient Wireless Internet Access
Energy Efficient Wireless Internet Access
 
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free Networks
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free NetworksSelf-Balancing Multimemetic Algorithms in Dynamic Scale-Free Networks
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free Networks
 
Real-time PMU Data Recovery Application Based on Singular Value Decomposition
Real-time PMU Data Recovery Application Based on Singular Value DecompositionReal-time PMU Data Recovery Application Based on Singular Value Decomposition
Real-time PMU Data Recovery Application Based on Singular Value Decomposition
 
Real-time PMU Data Recovery Application Based on Singular Value Decomposition
Real-time PMU Data Recovery Application Based on Singular Value DecompositionReal-time PMU Data Recovery Application Based on Singular Value Decomposition
Real-time PMU Data Recovery Application Based on Singular Value Decomposition
 
ReComp for genomics
ReComp for genomicsReComp for genomics
ReComp for genomics
 
ReComp: Preserving the value of large scale data analytics over time through...
ReComp:Preserving the value of large scale data analytics over time through...ReComp:Preserving the value of large scale data analytics over time through...
ReComp: Preserving the value of large scale data analytics over time through...
 
The data, they are a-changin’
The data, they are a-changin’The data, they are a-changin’
The data, they are a-changin’
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for Health
 
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
 
MAXSS & NVIDIA
MAXSS & NVIDIAMAXSS & NVIDIA
MAXSS & NVIDIA
 
IBM Cloud Paris Meetup 20180517 - Deep Learning Challenges
IBM Cloud Paris Meetup 20180517 - Deep Learning ChallengesIBM Cloud Paris Meetup 20180517 - Deep Learning Challenges
IBM Cloud Paris Meetup 20180517 - Deep Learning Challenges
 
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
 

Más de Paolo Missier

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...Paolo Missier
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyPaolo Missier
 
Provenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationProvenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationPaolo Missier
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Paolo Missier
 
Transparency in ML and AI (humble views from a concerned academic)
Transparency in ML and AI (humble views from a concerned academic)Transparency in ML and AI (humble views from a concerned academic)
Transparency in ML and AI (humble views from a concerned academic)Paolo Missier
 

Más de Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparency
 
Provenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationProvenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-Computation
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
 
Transparency in ML and AI (humble views from a concerned academic)
Transparency in ML and AI (humble views from a concerned academic)Transparency in ML and AI (humble views from a concerned academic)
Transparency in ML and AI (humble views from a concerned academic)
 

Último

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 

Último (20)

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 

Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics

  • 1. 1 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Selective and incremental re-computation in reaction to changes: an exercise in metadata analytics recomp.org.uk Paolo Missier, Jacek Cala, Jannetta Steyn School of Computing Newcastle University, UK Durham University May 31st, 2018 Meta-* In collaboration with • Institute of Genetic Medicine, Newcastle University • School of GeoSciences, Newcastle University
  • 3. 3 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Data Science over time Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t Life Science Analytics
  • 4. 4 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Understanding change Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t • Threats: Will any of the changes invalidate prior findings? • Opportunities: Can the findings be improved over time? ReComp space = expensive analysis + frequent changes + high impact Analytics within ReComp space… C1: are resource-intensive and thus expensive when repeatedly executed over time, i.e., on a cloud or HPC cluster; C2: require sophisticated implementations to run efficiently, such as workflows with a nested structure; C3: depend on multiple reference datasets and software libraries and tools, some of which are versioned and evolve over time; C4: apply to a possibly large population of input instances C5: deliver valuable knowledge
  • 5. 5 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Talk Outline ReComp: selective re-computation to refresh outcomes in reaction to change • Case study 1: Re-computation decisions for flood simulations • Learning useful estimators for the impact of change • Black box computation, coarse-grained changes • Case study 2: high throughput genomics data processing • An exercise in provenance collection and analytics • White-box computation, fine-grained changes • Open challenges
  • 6. 6 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Case study 1: Flood modelling simulation Simulation characteristics: Part of Newcastle upon Tyne DTM: ≈2.3M cells, 2x2m cell size Building and green areas from Nov 2017 Rainfall event with return period 50 years Simulation time: 60 mins 10–25 frames with water depth and velocity in each cell Output size: 23x65 MiB ≈ 1.5 GiB Water depth heat map City Catchment Analysis Tool (CityCAT) Vassilis Glenis, et al. School of Engineering, NU
  • 7. 7 ReComp–DurhamUniversityMay31st,2018 PaoloMissier When should we repeat an expensive simulation? CityCat Flood simulator CityCat Flood simulator Can we predict high difference areas without re- running the simulation? New buildings / green areas may alter data flow Extreme weather event simulation (in Newcastle) Extreme Rainfall event Running CityCat is generally expensive: - Processing for the Newcastle area: ≈3h on a 4-core i7 3.2GHz CPU Placeholder for more expensive simulations! Maps updates are infrequent (6 months) But useful when simulating changes eg for planning purposes Flood Diffusion Time series
  • 8. 8 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Estimating the impact of a flood simulation Suppose we are able to quantify: - Difference in inputs, M,M’ - Difference in outputs F,F’ Suppose also that we are only interested in large enough changes between two outputs: For some user-defined parameter Problem statement: Can we define an ideal ReComp decision function which - Operates on two versions of the inputs, M, M’, and old output F - Returns true iff (1) would return true when F’ is actually computed (1) Can we predict when F’ needs to be computed?
  • 9. 9 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Approach 1. Define input diff and output diff functions: 2. Define an impact function: 3. Define the ReComp decision function: where is a tunable parameter ReComp approximates (1), so it’s subject to errors: False Positives: False Negatives: 4. Use ground data to determine values for as a function of FPR and FNR Note: The ReComp function should be much less expensive to compute than sim()
  • 10. 10 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Diff and impact functions B: Buildings L: other Land H: hard surface f() partitions polygons changes into 6 types: For each type, compute the average water depth within and around the footprint of the change returns the max of the avg water depth over all changes d Water depth B–L+ B–∩ L+ d B– Water depth : max of the differences between spatially averaged F,F’ over window W
  • 11. 11 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Tuning the threshold parameter Ground data from all past re-computations: FP: <1,0> FN: <0,1> Set FNR to be close to 0. Experimentally find that minimises FPR. (max specificity) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.10 0.15 0.20 0.25 θImp 0.10 0.15 0.20 0.25 θImp Precision Recall Accuracy Specificity window size 20x20m, θO = 0.2m, all changes window size 20x20m, θO = 0.2m, consecutive changes
  • 13. 13 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Summary of the approach M F M’ True F’ False Ground data Tune Target FPR Historical data <M,M’,F,F’>
  • 14. 14 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Talk Outline ReComp: selective re-computation to refresh outcomes in reaction to change • Case study 1: Re-computation decisions for flood simulations • Learning useful estimators for the impact of change • Case study 2: high throughput genomics data processing • An exercise in provenance collection and analytics • Open challenges
  • 15. 15 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Data Analytics enabled by Next Gen Sequencing Genomics: WES / WGS, Variant calling, Variant interpretation  diagnosis - Eg 100K Genome Project, Genomics England, GeCIP Submission of sequence data for archiving and analysis Data analysis using selected EBI and external software tools Data presentation and visualisation through web interface Visualisation raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants Metagenomics: Species identification - Eg The EBI metagenomics portal
  • 16. 16 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Whole-exome variant calling pipeline Van der Auwera, G. A., Carneiro, M. O., Hartl, C., Poplin, R., del Angel, G., Levy-Moonshine, A., … DePristo, M. A. (2002). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. In Current Protocols in Bioinformatics. John Wiley & Sons, Inc. https://doi.org/10.1002/0471250953.bi1110s43 GATK quality score recalibration Annovar functional annotations (eg MAF, synonimity, SNPs…) followed by in house annotations BWA, Bowtie, Novoalign Picard: MarkDuplicates GATK-Haplotype Caller FreeBayes SamTools Variant recalibration
  • 17. 17 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Expensive Data stats per sample: 4 files per sample (2-lane, pair-end, reads) ≈15 GB of compressed text data (gz) ≈40 GB uncompressed text data (FASTQ) Usually 30-40 input samples 0.45-0.6 TB of compressed data 1.2-1.6 TB uncompressed Most steps use 8-10 GB of reference data Small 6-sample run takes about 30h on the IGM HPC machine (Stage1+2) Scalable and Efficient Whole-exome Data Processing Using Workflows on the Cloud. Cala, J.; Marei, E.; Yu, Y.; Takeda, K.; and Missier, P. Future Generation Computer Systems, Special Issue: Big Data in the Cloud, 2016
  • 18. 19 ReComp–DurhamUniversityMay31st,2018 PaoloMissier SVI: Simple Variant Interpretation Genomics: WES / WGS, Variant calling, Variant interpretation  diagnosis - Eg 100K Genome Project, Genomics England, GeCIP raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants Filters then classifies variants into three categories: pathogenic, benign and unknown/uncertain SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya, E.; Kirby, R.; and Keogh, M. In Procs. 11th International conference on Data Integration in the Life Sciences, Los Angeles, CA, 2015. Springer
  • 19. 20 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Changes that affect variant interpretation What changes: - Improved sequencing / variant calling - ClinVar, OMIM evolve rapidly - New reference data sources Evolution in number of variants that affect patients (a) with a specific phenotype (b) Across all phenotypes
  • 20. 21 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Baseline: blind re-computation Sparsity issue: • About 500 executions • 33 patients • total runtime about 60 hours • Only 14 relevant output changes detected: 4.2 hours of computation per change ≈7 minutes / patient (single-core VM) Should we care about database updates?
  • 21. 22 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Unstable Van der Auwera, G. A., Carneiro, M. O., Hartl, C., Poplin, R., del Angel, G., Levy-Moonshine, A., … DePristo, M. A. (2002). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. In Current Protocols in Bioinformatics. John Wiley & Sons, Inc. https://doi.org/10.1002/0471250953.bi1110s43 GATK quality score recalibration Annovar functional annotations (eg MAF, synonimity, SNPs…) followed by in house annotations BWA, Bowtie, Novoalign Picard: MarkDuplicates GATK-Haplotype Caller FreeBayes SamTools Variant recalibration dbSNP builds 150 2/17 149 11/16 148 6/16 147 4/16 Any of these stages may change over time – semi-independently Human reference genome: H19  h37, h38,…
  • 22. 23 ReComp–DurhamUniversityMay31st,2018 PaoloMissier FreeBayes vs SamTools vs GATK-Haplotype Caller GATK: McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., … DePristo, M. A. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9), 1297–303. https://doi.org/10.1101/gr.107524.110 FreeBayes: Garrison, Erik, and Gabor Marth. "Haplotype-based variant detection from short-read sequencing." arXiv preprint arXiv:1207.3907 (2012). GIAB: Zook, J. M., Chapman, B., Wang, J., Mittelman, D., Hofmann, O., Hide, W., & Salit, M. (2014). Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotech, 32(3), 246–251. http://dx.doi.org/10.1038/nbt.2835 Adam Cornish and Chittibabu Guda, “A Comparison of Variant Calling Pipelines Using Genome in a Bottle as a Reference,” BioMed Research International, vol. 2015, Article ID 456479, 11 pages, 2015. doi:10.1155/2015/456479 Hwang, S., Kim, E., Lee, I., & Marcotte, E. M. (2015). Systematic comparison of variant calling pipelines using gold standard personal exome variants. Scientific Reports, 5(December), 17875. https://doi.org/10.1038/srep17875
  • 23. 24 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Comparing three versions of Freebayes Should we care about changes in the pipeline? • Tested three versions of the caller: • 0.9.10  Dec 2013 • 1.0.2  Dec 2015 • 1.1  Nov 2016 • The Venn diagram shows quantitative comparison (% and number) of filtered variants; • Phred quality score >30 • 16 patient BAM files (7 AD, 9 FTD-ALS)
  • 24. 25 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Impact on SVI classification Patient phenotypes: 7 Alzheimer’s, 9 FTD-ALS The ONLY change in the pipeline is the version of Freebayes used to call variants (R)ed – confirmed pathogenicity (A)mber – uncertain pathogenicity Patient ID Freebayes version B_0190 B_0191 B_0192 B_0193 B_0195 B_0196 B_0198 B_0199 B_0201 B_0202 B_0203 B_0208 B_0209 B_0211 B_0213 B_0214 0.9.10 A A R A R R R R R A R R R R A R 1.0.2 A A R A R R A A R A R A R A A R 1.1 A A R A R R A A R A R A R A A R Phenotype ALS-FTD ALS-FTD ALS-FTD ALS-FTD ALS-FTD ALS-FTD AD ALS-FTD AD AD AD AD AD ALS-FTD ALS-FTD AD
  • 25. 26 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Changes: frequency / impact / cost Change Frequency Changeimpactonacohort GATK Variant annotations (Annovar) Reference Human genome Variant DB (eg ClinVar) Phenotype  disease mapping (eg OMIM GeneMap) New sequences LowHigh Low High Variant Caller Variant calling N+1 problem Variant interpretation
  • 26. 27 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Changes: frequency / impact / cost Change Frequency Changeimpactonacohort GATK Variant annotations (Annovar) Reference Human genome Variant DB (eg ClinVar) Phenotype  disease mapping (eg OMIM GeneMap) New sequences LowHigh Low High Variant Caller Variant calling N+1 problem Variant interpretation ReComp space
  • 28. 29 ReComp–DurhamUniversityMay31st,2018 PaoloMissier The ReComp meta-process Estimate impact of changes Select and Enact Record execution history Detect and measure changes History DB Data diff(.,.) functions Change Events Process P Observe Exec 1. Capture the history of past computations: - Process Structure and dependencies - Cost - Provenance of the outcomes 2. Metadata analytics: Learn from history - Estimation models for impact, cost, benefits Approach: 2. Collect and exploit process history metadata 1. Quantify data-diff and impact of changes on prior outcomes Changes: • Algorithms and tools • Accuracy of input sequences • Reference databases (HGMD, ClinVar, OMIM GeneMap…)
  • 29. 32 ReComp–DurhamUniversityMay31st,2018 PaoloMissier changes, data diff, impact 1) Observed change events: (inputs, dependencies, or both) 3) Impact occurs to various degree on multiple prior outcomes. Impact of change C on the processing of a specific X: 2) Type-specific Diff functions: Impact is process- and data-specific:
  • 30. 33 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Impact Given P (fixed), a change in one of the inputs to P: C={xx’} affects a single output: However a change in one of the dependencies: C= {dd’} affects all outputs yt where version d of D was used
  • 31. 34 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Impact: importance and Scope Scope: which cases are affected? - Individual variants have an associated phenotype. - Patient cases also have a phenotype “a change in variant v can only have impact on a case X if V and X share the same phenotype” Importance: “Any variant with status moving from/to Red causes High impact on any X that is affected by the variant”
  • 32. 35 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Approach – a combination of techniques 1. Partial re-execution • Identify and re-enact the portion of a process that are affected by change 2. Differential execution • Input to the new execution consists of the differences between two versions of a changed dataset • Only feasible if some algebraic properties of the process hold 3. Identifying the scope of change – Loss-less • Exclude instances of the population that are certainly not affected
  • 33. 37 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Approach – a combination of techniques 1. Partial re-execution 2. Differential execution 3. Identifying the scope of change – Loss-less
  • 34. 38 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Role of Workflow Provenance in partial re-run User Execution «Association » «Usage» «Generation » « «C Controller Program Workflow Channel Port wasPartOf «wasDerivedFrom » hasSubProgram «hadPlan » controlledBy controls[*] [*] [*] [*] [*] [*] [0..1] [0..1] [*][1] [*] [*] [0..1] [0..1] hasOutPort [*][0..1] [1] «wasAssociatedWith » «agent » [1] [0..1] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [0..1] [0..1] hasInPort [*][0..1] connectsTo [*] [0..1] «wasInformedBy » [*][1] «wasGeneratedBy » «qualifiedGeneration » «qualifiedUsage » «qualifiedAssociation » hadEntity «used » hadOutPorthadInPort [*][1] [1] [1] [1] hadEntity hasDefaultPara
  • 35. 39 ReComp–DurhamUniversityMay31st,2018 PaoloMissier History DB: Workflow Provenance Each invocation of an eSC workflow generates a provenance trace “plan” “plan execution” WF B1 B2 B1exec B2exec Data WFexec partOf partOf usagegeneration association association association db usage ProgramWorkflow Execution Entity (ref data)
  • 36. 40 ReComp–DurhamUniversityMay31st,2018 PaoloMissier SVI as eScience Central workflow Phenotype to genes Variant selection Variant classification Patient variants GeneMap ClinVar Classified variants Phenotype
  • 37. 41 ReComp–DurhamUniversityMay31st,2018 PaoloMissier 1. Partial re-execution 1. Change detection: A provenance fact indicates that a new version Dnew of database d is available wasDerivedFrom(“db”,Dnew) :- execution(WFexec), wasPartOf(Xexec,WFexec), used(Xexec, “db”) 2.1 Find the entry point(s) into the workflow, where db was used :- execution(WFexec), execution(B1exec), execution(B2exec), wasPartOf(B1exec, WFexec), wasPartOf(B2exec, WFexec), wasGeneratedBy(Data, B1exec), used(B2exec,Data) 2.2 Discover the rest of the sub-workflow graph (execute recursively) 2. Reacting to the change: Provenance pattern: “plan” “plan execution” Ex. db = “ClinVar v.x” WF B1 B2 B1exec B2exec Data WFexec partOf partOf usagegeneration association association association db usage
  • 38. 42 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Minimal sub-graphs in SVI Change in ClinVar Change in GeneMap Overhead: cache intermediate data required for partial re-execution • 156 MB for GeneMap changes and 37 kB for ClinVar changes Time savings Partial re- execution (seC) Complete re- execution Time saving (%) GeneMap 325 455 28.5 ClinVar 287 455 37
  • 39. 47 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Approach – a combination of techniques 1. Partial re-execution 2. Differential execution 3. Identifying the scope of change – Loss-less
  • 41. 49 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Compute difference sets – ClinVar The ClinVar dataset: 30 columns Changes: Records: 349,074  543,841 Added 200,746 Removed 5,979. Updated 27,662
  • 42. 50 ReComp–DurhamUniversityMay31st,2018 PaoloMissier For tabular data, difference is just Select-Project Key columns: {"#AlleleID", "Assembly", "Chromosome”} “where” columns:{"ClinicalSignificance”}
  • 44. 52 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Differential execution Suppose D is a relation (a table). diffD() can be expressed as: Where: We compute: as the combination of: This is effective if: This can be achieved as follows: …provided P is distributive wrt st union and difference Cf. F. McSherry, D. Murray, R. Isaacs, and M. Isard, “Differential dataflow,” in Proceedings of CIDR 2013, 2013.
  • 45. 53 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Partial re-computation using input difference Idea: run SVI but replace ClinVar query with a query on ClinVar version diff: Q(CV)  Q(diff(CV1, CV2)) Works for SVI, but hard to generalise: depends on the type of process Bigger gain: diff(CV1, CV2) much smaller than CV2 GeneMap versions from –> to ToVersion record count Difference record count Reduction 16-03-08 –> 16-06-07 15910 1458 91% 16-03-08 –> 16-04-28 15871 1386 91% 16-04-28 –> 16-06-01 15897 78 99.5% 16-06-01 –> 16-06-02 15897 2 99.99% 16-06-02 –> 16-06-07 15910 33 99.8% ClinVar versions from –> to ToVersion record count Difference record count Reduction 15-02 –> 16-05 290815 38216 87% 15-02 –> 16-02 285042 35550 88% 16-02 –> 16-05 290815 3322 98.9%
  • 46. 54 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Approach – a combination of techniques 1. Partial re-execution 2. Differential execution 3. Identifying the scope of change – Loss-less
  • 47. 55 ReComp–DurhamUniversityMay31st,2018 PaoloMissier 3: precisely identify the scope of a change Patient / DB version impact matrix Strong scope: (fine-grained provenance) Weak scope: “if CVi was used in the processing of pj then pj is in scope” (coarse-grained provenance – next slide) Semantic scope: (domain-specific scoping rules)
  • 48. 56 ReComp–DurhamUniversityMay31st,2018 PaoloMissier A weak scoping algorithm Coarse-grained provenance Candidate invocation: Any invocation I of P whose provenance contains statements of the form: used(A,”db”),wasPartOf(A,I),wasAssociatedWith(I,_,WF) - For each candidate invocation I of P: - partially re-execute using the difference sets as inputs # see previous slides - find the minimal subgraph P’ of P that needs re-computation # see above - repeat: execute P’ one step at-a-time until <empty output> or <P’ completed> - If <P’ completed> and not <empty output> then - Execute P’ on the full inputs Sketch of the algorithm: WF B1 B2 B1exec B2exec Data WFexec partOf partOf usagegeneration association association association db usage
  • 49. 57 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Scoping: precision • The approach avoids the majority of re-computations given a ClinVar change • Reduction in number of complete re-executions from 495 down to 71
  • 50. 58 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Summary of ReComp challenges Change Events History DB Reproducibility: - virtualisation Sensitivity analysis unlikely to work well Small input perturbations  potentially large impact on diagnosis Learning useful estimators is hard Diff functions are both type- and application-specific Not all runtime environments support provenance recording specific  generic Data diff(.,.) functions Process P Observe Exec
  • 51. 59 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Come to our workshop during Provenance Week! https://sites.google.com/view/incremental-recomp-workshop July 12th (pm) and 13th (am), King’s College London http://provenanceweek2018.org/
  • 53. 61 ReComp–DurhamUniversityMay31st,2018 PaoloMissier The Metadata Analytics challenge: Learning from a metadata DB of execution history to support automated ReComp decisions
  • 54. 62 ReComp–DurhamUniversityMay31st,2018 PaoloMissier History Database HDB: A metadata-database containing records of past executions: Execution records: C1 C2 C3 GATK (Haplotype caller) FreeBayes 0.9 FreeBayes 1.0 FreeBayes 1.1    X1 X2 X3 X4 X5 Y11 Y21 Y31 Y41 Y51 Y12 Y52 Y43 Y53 HDB Example: Consider only one type of change: Variant caller
  • 55. 63 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Impact (again) Given P (fixed), a change in one of the inputs to P: C={xx’} affects a single output: While a change in one of the dependencies: C= {dd’} affects all outputs yt where version d of D was used
  • 56. 64 ReComp–DurhamUniversityMay31st,2018 PaoloMissier ReComp decisions Given a population X of prior inputs: Given a change ReComp makes yes/no decisions for each returns True if P is to be executed again on X, and False otherwise To decide, ReComp must estimate impact: (as well as estimate the re-computation cost) Example:
  • 57. 65 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Two possible approaches 1. Direct estimator of impact function: Here the problem is learn such function for specific P, C, and data types Y 2. Learning an emulator for P which is simpler to compute and provides a useful approximation: surrogate (emulator) where ε is a stochastic term that accounts for the error in approximating f Learning requires a training set { (xi, yi) } … If can be found, then we can hope to use it to approximate: Such that
  • 58. 66 ReComp–DurhamUniversityMay31st,2018 PaoloMissier History DB and Differences DB Whenever P is re-computed on input X, a new er’ is added to HDB for X: Using diff() functions we produce a derived difference record dr: … collected in a Differences database: dr1 = imp(C1,Y11) dr2= imp(C12,Y41) dr3 = imp(C1,Y51) dr4 = imp(C2,Y52) DDB C1 C2 C3 GATK (Haplotype caller) FreeBayes 0.9 FreeBayes 1.0 FreeBayes 1.1    X1 X2 X3 X4 X5 Y11 Y21 Y31 Y41 Y51 Y12 Y52 Y43 Y53 HDB    
  • 60. 68 ReComp–DurhamUniversityMay31st,2018 PaoloMissier Learning challenges • Evidence is small and sparse • How can it be used for selecting from X? • Learning a reliable imp() function is not feasible • What’s the use of history? You never see the same change twice! • Must somehow use evidence from related changes • A possible approach: • ReComp makes probabilistic decisions, takes chances • Associate a reward to each ReComp decision  reinforcement learning • Bayesian inference (use new evidence to update probabilities) dr1 = imp(C1,Y11) dr2= imp(C12,Y41) dr3 = imp(C1,Y51) dr4 = imp(C2,Y52) DDB C1 C2 C3 GATK (Haplotype caller) FreeBayes 0.9 FreeBayes 1.0 FreeBayes 1.1    X1 X2 X3 X4 X5 Y11 Y21 Y31 Y41 Y51 Y12 Y52 Y43 Y53 HDB    

Notas del editor

  1. M \rightarrow \mathit{sim}(M) = F
  2. M \rightarrow \mathit{sim}(M) = F \Delta_F(F,F') \Delta_M(M,M') \Delta_F(F,F') > \theta_O
  3. \mathit{imp}(M,F, M') = f(\Delta_M(M,M'), F) \mathit{ReComp}(M' | M,F) =  \begin{cases}    \text{True if } \mathit{imp}(M,F, M') > \theta \\   \text{False otherwise} \end{cases} \Delta_F(F,F') > \theta_0, \mathit{imp}(M,F, M') < \theta \Delta_F(F,F') < \theta_0, \mathit{imp}(M,F, M') > \theta
  4. \Delta_M(M,M') = \{ B^+, B^-, L^+, L^-\} \begin{aligned} & B^- = B_{\text{old}} \setminus B_{\text{new}} \hspace{0.5cm} & B^+ = B_{\text{new}} \setminus B_{\text{old}} \\ & L^- = L_{\text{old}} \setminus L_{\text{new}} & L^+ = L_{\text{new}} \setminus L_{\text{old}} \end{aligned} \label{eq:diff-output} \begin{aligned} & (B \rightarrow L) = B^- \cap L^+ \hspace{.4cm} & (B \rightarrow H) = B^- \setminus L^+ \\ & (L \rightarrow B) = L^- \cap B^+      & (L \rightarrow H) = L^- \setminus B^+ \\ & (H \rightarrow B) = B^+ \setminus L^- & (H \rightarrow L) = L^+ \setminus B^- \end{aligned} \Delta_F(F,F')
  5. \langle M_i, M_j, F_i, F_j \rangle \rightarrow \langle \Delta_F(F_i,F_j), \mathit{imp}(M_i, M_j, F_j) \rangle \langle 1/0, 1/0 \rangle
  6. Genomics is a form of data-intensive / computation-intensive analysis
  7. Each sample included 2-lane, pair-end raw sequence reads (4 files per sample).The average size of compressed files was nearly 15 GiB per sample; file decompression was included in the pipeline as one of the initial tasks.
  8. Changes in the reference databases have an impact on the classification
  9. returns updates in mappings to genes that have changed between the two versions (including possibly new mappings): $\diffOM(\OM^t, \OM^{t'}) = \{\langle t, genes(\dt) \rangle | genes(\dt) \neq genes'(\dt) \} $\\ where $genes'(\dt)$ is the new mapping for $\dt$ in $\OM^{t'}$. \begin{align*} \diffCV&(\CV^t, \CV^{t'}) = \\ &\{ \langle v, \varst(v) | \varst(v) \neq \varst'(v) \} \\ & \cup \CV^{t'} \setminus \CV^t \cup \CV^t \setminus \CV^{t'} \label{eq:diff-cv} \end{align*} where $\varst'(v)$ is the new class associate to $v$ in $\CV^{t'}$.
  10. Point of slide: sparsity of impact demands better than blind recomp. Table 1 summarises the results. We recorded four types of outcomes. Firstly, confirming the current diagnosis (� ), which happens when additional variants are added to the Red class. Secondly, retracting the diagnosis, which may happen (rarely) when all red variants are retracted, de-noted ❖. Thirdly, changes in the amber class which do not alter the diagnosis (� ), and finally, no change at all ( ). `Table reports results from nearly 500 executions, concern-ing a cohort of 33 patients, for a total runtime of about 58.7 hours. As merely 14 relevant output changes were de-tected, this is about 4.2 hours of computation per change: a steep cost, considering that the actual execution time of SVI takes a little over 7 minutes.
  11. our recommendation is the use of BWA-MEM and Samtools pipeline for SNP calls and BWA-MEM and GATK-HC pipeline for indel calls. 
  12. In four cases change in the caller version changes the classification
  13. Changes can be frequent or rare, disruptive or marginal
  14. Changes can be frequent or rare, disruptive or marginal
  15. How to make computational experiments reusable, all or in part, through a combination of data and code sharing and re-purposing (reusable Research Objects) and virtualisation mechanisms
  16. \diff{X}(X^t, X^{t'}), \quad \diff{Y}(Y^t, Y^{t'}), \quad \diff{D}(D^t,D^{t'})  C = \{\update{D^{t'}}{D^t},  \quad  \update{X^{t'}}{X^t} \} y^{t'} &= \exec(P, x^{t'}, d^{t'}) \\ \impact_{P}(C,y) &= f_{Y}( \diff{Y}(y^t, t^{t'})) \imp_{SVI} (C,X} = f_{SVI}( \diff{Y}(Y^t, Y^{t'})) \in \{ \texttt{None}, \texttt{Low}, \texttt{High} \}
  17. \impact_{P}( \{\update{x^{t'}}{x^t} \},y) &= f_{Y}( \diff{Y}(y^t, \exec(P, x^{t'}, d^{t}) )) \impact_{P}( \{\update{d^{t'}}{d^t} \},y) &= f_{Y}( \diff{Y}(y^t, \exec(P, x^t, d^{t'}) ))
  18. \text{let } v \in \diff{Y}(Y^t, Y^{t'}): \\ \text{for any $X$: } \impact_{P}(C,X) = \texttt{High} \text{ if }\\ v.\texttt{status:} \begin{cases} * \rightarrow \texttt{red} \\ \texttt{red} \rightarrow *  \end{cases}
  19. Firstly, if we can analyse the structure and semantics of process P , to recompute an instance of P more effec-tively we may be able to reduce re-computation to only those parts of the process that are actually involved in the processing of the changed data. For this, we are in-spired by techniques for smart rerun of workflow-based applications [6, 7], as well as by more general approaches to incremental computation [8, 9].
  20. Firstly, if we can analyse the structure and semantics of process P , to recompute an instance of P more effec-tively we may be able to reduce re-computation to only those parts of the process that are actually involved in the processing of the changed data. For this, we are in-spired by techniques for smart rerun of workflow-based applications [6, 7], as well as by more general approaches to incremental computation [8, 9].
  21. Firstly, if we can analyse the structure and semantics of process P , to recompute an instance of P more effec-tively we may be able to reduce re-computation to only those parts of the process that are actually involved in the processing of the changed data. For this, we are in-spired by techniques for smart rerun of workflow-based applications [6, 7], as well as by more general approaches to incremental computation [8, 9].
  22. Experimental setup for our study of ReComp techniques: SVI workflow with automated provenance recording Cohort of about 100 exomes (neurological disorders) Changes in ClinVar and OMIM GeneMap
  23. Firstly, if we can analyse the structure and semantics of process P , to recompute an instance of P more effec-tively we may be able to reduce re-computation to only those parts of the process that are actually involved in the processing of the changed data. For this, we are in-spired by techniques for smart rerun of workflow-based applications [6, 7], as well as by more general approaches to incremental computation [8, 9].
  24. y^t = \mathit{exec}(P,x,D^t) y^{t'}_+ = \mathit{exec}(P,x,\delta^+)
  25. This is only a small selection of rows and a subset of columns. In total there was 30 columns, 349074 rows in the old set, 543841 rows in the new set, 200746 of the added rows, 5979 of the removed rows, 27662 of the changed rows. As on the previous slide, you may want to highlight that the selection of key-columns and where-columns is very important. For example, using #AlleleID, Assembly and Chromosome as the key columns, we have entry #AlleleID 15091 which looks very similar in both added (green) and removed (red) sets. They differ, however, in the Chromosome column. Considering the where-columns, using only ClinicalSignificance returns blue rows which differ between versions only in that columns. Changes in other columns (e.g. LastEvaluated) are not reported, which may have ramifications if such a difference is used to produce the new output.
  26. This is only a small selection of rows and a subset of columns. In total there was 30 columns, 349074 rows in the old set, 543841 rows in the new set, 200746 of the added rows, 5979 of the removed rows, 27662 of the changed rows. As on the previous slide, you may want to highlight that the selection of key-columns and where-columns is very important. For example, using #AlleleID, Assembly and Chromosome as the key columns, we have entry #AlleleID 15091 which looks very similar in both added (green) and removed (red) sets. They differ, however, in the Chromosome column. Considering the where-columns, using only ClinicalSignificance returns blue rows which differ between versions only in that columns. Changes in other columns (e.g. LastEvaluated) are not reported, which may have ramifications if such a difference is used to produce the new output.
  27. y^t = \mathit{exec}(P,x,D^t) y^{t'}_+ = \mathit{exec}(P,x,\delta^+) \delta^- \cup \delta^+
  28. Also, as in Tab. 2 and 3 in the paper, I’d mention whether this reduction was possible with generic diff function or specific function tailored to SVI. What is also interesting and what I would highlight is that even if the reduction is very close to 100% but below, the cost of recomputation of the process may still be significant because of some constant-time overheads related to running a process (e.g. loading data into memory). e-SC workflows suffer from exactly this issue (every block serializes and deserializes data) and that’s why Fig. 6 shows increase in runtime for GeneMap executed with 2 \deltas even if the reduction is 99.94% (cf. Tab. 2 and Fig. 6 for GeneMap diff between 16-10-30 –> 16-10-31).
  29. Firstly, if we can analyse the structure and semantics of process P , to recompute an instance of P more effec-tively we may be able to reduce re-computation to only those parts of the process that are actually involved in the processing of the changed data. For this, we are in-spired by techniques for smart rerun of workflow-based applications [6, 7], as well as by more general approaches to incremental computation [8, 9].
  30. v \in (\delta^- \cup \delta^+) \cap \mathit{used}(p_j, v) \Rightarrow p_j \text{ in scope } v.\mathit{phenotype} == p_j.\mathit{phenotype} \Rightarrow p_j \text{ in scope }
  31. Regarding the algorithm, you show the simplified version (Alg. 1). But please take also look on Alg. 2 and mention that you can only run the loop if the distributiveness holds for all P in the downstream graph. Otherwise, you need to break and re-execute on full inputs just after first non-distributive task produces a non-empty output. But, obviously, the hope is that with a well tailored diff function the output will be empty for majority of cases.
  32. er = \langle P, X^{t}, D^{t}, Y^{t}, c^{t}, T \rangle   \HDB = \{  er_1, er_2 \dots er_N \} {\cal X} = \{ er.X | er \in \HDB\}
  33. \impact_{P}( \{\update{x^{t'}}{x^t} \},y) &= f_{Y}( \diff{Y}(y^t, \exec(P, x^{t'}, d^{t}) )) \impact_{P}( \{\update{d^{t'}}{d^t} \},y) &= f_{Y}( \diff{Y}(y^t, \exec(P, x^t, d^{t'}) ))
  34. C = \update{D^{t'}}{D^t}, \update{X^{t'}}{X^t} X \in {\cal X} \imphat_P(C,y) \costhat(C,y) \recomp_P(C,y) \impact_{P}(C,y) \langle X^{t}, D^{t}, Y^{t}, c^{t} \rangle \qquad \langle X^{t'}, D^{t'}, Y^{t'}, c^{t'} \rangle  \recomp_{SVI}(C,X) = \begin{cases} \text{True} & \text{if } \imphat_{SVI}(C,y) \neq \texttt{None} \\ \text{False} & \text{otherwise} \end{cases}
  35. \begin{equation*} \langle y_i^t, c_i^t \rangle = \exec(P, x_i^t, \{ d_1^t \dots d_m^t\}) \end{equation*} \widehat{f}(x) = f(x) + \epsilon \begin{equation*} y = P(x) \end{equation*} $\update{x’}{x}$ $\diff{Y}(f(x), f(x'))$ $f'()$ such that $f'(x) = f(x) + \epsilon$ $ \diff{Y}(f(x), f(x')) \rightarrow \diff{Y}(f'(x), f'(x')) $
  36. er = \langle X^{t}, D^{t}, Y^{t}, c^{t} \rangle \qquad er' = \langle X^{t'}, D^{t'}, Y^{t'}, c^{t'} \rangle  \dr = \langle \diff_X(X^{t}, X^{t'}), \diff_D(D^{t}, D^{t'}), \diff_Y(, Y^{t},Y^{t'}, \imp{C,X} \rangle \DDB = \{ \dr_1, \dr_2 \dots \dr_M \}
  37. \begin{algorithm}[H] \SetCustomAlgoRuledWidth{\textwidth}  \KwData{Evidence $E = \{ \mathit{HDB}, \mathit{DDB} \}$, Population ${\cal X}$, change $C$}   \KwResult{Updated outcomes for a subset ${\cal X}' \subseteq {\cal X}$, updated Evidence}   $\mathit{dv} = \mathbf{1}$\;  \While{ $\mathit{dv} != \mathbf{0}$}     {     $\mathit{dv} = \mathit{select}(E,C)$ \tcc{ binary \textit{decision vector} of size $|{\cal X}|$}     $[Y_i^{t'}]_{i:1 \dots k} = \mathit{execAll}(dv, {\cal X})$ \tcc{Re-comp all $k$ selected $X \in {\cal X}$}     $I = [ \imp{Y_i^{t}, Y_i^{t'}}]_{i:1 \dots k}$ \tcc{calculate impact from the new outcomes}     $E = \mathit{updateEvidence}(E,I)$ \tcc{update evidence adding new impact}     }        \end{algorithm}