SlideShare una empresa de Scribd logo
1 de 45
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
ReComp: preserving the value of big data insights over time
Panta Rhei (Heraclitus, through Plato)
Paolo Missier
Paolo.Missier@ncl.ac.uk
November, 2015
Cloud CDT seminar series
Newcastle
(*) Painting by Johannes Moreelse
(*)
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Generating Analytical knowledge
Specification Deployment Execution KA
Dependencies Algorithms,
Libs,
Packages
System
External state
(DBs)
Input
Data config
KA = Knowledge Asset
Ex.:
machine learning
Using Python
and scikit-learn
Learn model
to recognise
activity
pattern
Python 3
Ubuntu x.y.z
Azure VM
Model
training
Model
Scikit-learn
Numpy
Pandas
Ubuntu
on Azure
Dependencies
Training +
Testing
dataset
config
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Generating Analytical knowledge
Specification Deployment Execution KA
Dependencies Algorithms,
Libs,
Packages
System
External state
(DBs)
Input
Data config
KA = Knowledge Asset
Ex.: workflow to
Identify mutations
in a patient’s
genome
Workflow
specification
WF manager
Linux VM
cluster on
Azure
Analyse
Input genome
variant
s
GATK/Picard/BWA
Workflow Manager
(and its own dependencies)
Ubuntu
on Azure
Dep.
Input
genome config
Ref
genome
Variants
DBs
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Rate of change
Long-lived
Slow-changing
Short-lived
Fast-changing
Input
data
External
DBs
Data
Streams
Historical
Time series
data
Current
Twitter
graph
Reference
DBs
What changes, and how frequently?
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
How fast does knowledge advance?
• Life Sciences knowledge:
• Genes (GenBank, Ensembl), Proteins, SNPs, Human Variants DBs (ClinVar)
• Life Sciences ontologies (GO, HPO,…)
• The human genome assembly
• The collection of all PubMed articles
• DBPedia, Wikipedia, etc.
• All current {Twitter, FB, G+, …} users and their connections
• A map of all buildings in a city, with their location and footprint
• The Hubble Atlas of Ancient Galaxies
• The catalogue of all known Exoplanets (about 2000)
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
How do we know which changes are relevant?
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
What analytics?
Genomics
• Diagnosis of rare genetic diseases
• Analyse soil, water composition (metagenomics)
Social media analytics, eg Twitter content analysis
• Sentiment analysis
• Topic discovery
• Emergency response
• Fostering new communities
Climate modelling
• Predicting local climate changes
• Ecology: understanding change by monitoring local species
Environment risk assessment
• Flood modelling and simulation
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Case study: NGS data processing pipeline (Genomics)
Recalibration
Corrects for system
bias on quality
scores assigned by
sequencer
GATK
Computes coverage
of each read.
VCF Subsetting by filtering,
eg non-exomic variants
Annovar functional annotations (eg
MAF, synonimity, SNPs…)
followed by in house annotations
Aligns sample
sequence to HG19
reference genome
using BWA aligner
Cleaning, duplicate
elimination
Picard tools
Variant calling operates on
multiple samples
simultaneously
Splits samples into chunks.
Haplotype caller detects
both SNV as well as longer
indels
Variant recalibration
attempts to reduce
false positive rate
from caller
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
sequences align clean
recalibrate
alignments
calculate
coverage
coverage
informationraw
sequences align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Case study: metagenomics
From environment to DNA sequence
Sample
Size
Fractioning
DNA
extraction
Sequencing
Analysis?
mRNA
extraction
PCR
AmpliconMetatranscriptome Metagenome
metagenomics
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Case study: flood modelling in Newcastle
CityCAT (City Catchment Analysis Tool)
A unique software tool for modelling, analysis and
visualisation of surface water flooding
• High resolution flood model
• Integrates hydraulic modelling algorithms
• Subsurface flow modelling
• Topography (DEMs from LIDAR)
• Physical structures (buildings etc.)
• Landuse data
• Outputs high resolution grid of flood depths
• Extensively tested
• Multi-platform
• Integrated into CONDOR and Microsoft Azure
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
What kind of changes affect these analytics tasks?
Application Knowledge Algorithms and tools
LS Diagnosis of rare
genetic diseases
PubMed
Human Variants DBs
The human genome assembly
SNP DBs
Numerous algorithms and tools used for
sequence alignment, cleaning, variant
calling…
LS Metagenomics Collections of known DNA
sequences for multiple species
Same as for genomics
SM Sentiment analysis Past Predictive models Content analysis NLP tools
Statistical model learning (classification)
SM Topic discovery Clustering algorithms
SM Emergency response Content analysis NLP tools
Predictive models, topical trend analysis
SM Fostering new
communities
Hubs & authorities algorithms, clustering
CS Predicting local
climate changes
Historical and current time series at
multiple resolution
Past and current models
Statistical model learning
CS Ecology: understand
change by monitoring
local species
Local species count & behaviour
observations
Statistical model learning
CE Flood modelling and
simulation
Local topography, location of
buildings
Simulation packages (eg CityCat)
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Volume: how many data products are affected?
Application Volume
LS Diagnosis of rare genetic diseases 100K genome project in the UK alone
Thousands of samples in Newcastle alone
LS Metagenomics A few K (EBI Metagenomics portal)
SM Sentiment analysis # of users whose sentiment is being analysed
SM Topic discovery A few clusters, containing a large number of Tweets
SM Emergency response A few key decisions
SM Fostering new communities A few key users
CS Predicting local climate changes Local effect
CS Ecology: understand change by
monitoring local species
Local effects
CE Flood modelling and simulation Local effects
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
LS Diagnosis of rare genetic
diseases
LS Metagenomics
SM Sentiment analysis
SM Topic discovery
SM Emergency response
SM Fostering new communities
CS Predicting local climate changes
CS Ecology: monitoring local species
CE Flood modelling and simulation
How fast do these products become obsolete?
minutes hours months yearsdays
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
How sensitive are data products to change?
LS Diagnosis of rare genetic
diseases
LS Metagenomics
SM Sentiment analysis
SM Topic discovery
SM Emergency response
SM Fostering new communities
CS Predicting local climate changes
CS Ecology: monitoring local species
CE Flood modelling and simulation
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
How much do they cost?
Note:
Cost per product vs
Cost over all products
Cost components:
- Design
- Development
- System
- Runtime
LS Diagnosis of rare genetic
diseases
LS Metagenomics
SM Sentiment analysis
SM Topic discovery
SM Emergency response
SM Fostering new communities
CS Predicting local climate changes
CS Ecology: monitoring local species
CE Flood modelling and simulation
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Case study: NGS data processing pipeline (Genomics)
Recalibration
Corrects for system
bias on quality
scores assigned by
sequencer
GATK
Computes coverage
of each read.
VCF Subsetting by filtering,
eg non-exomic variants
Annovar functional annotations (eg
MAF, synonimity, SNPs…)
followed by in house annotations
Aligns sample
sequence to HG19
reference genome
using BWA aligner
Cleaning, duplicate
elimination
Picard tools
Variant calling operates on
multiple samples
simultaneously
Splits samples into chunks.
Haplotype caller detects
both SNV as well as longer
indels
Variant recalibration
attempts to reduce
false positive rate
from caller
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
sequences align clean
recalibrate
alignments
calculate
coverage
coverage
informationraw
sequences align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Workflow Deployment on the Azure Cloud
<<Azure VM>>
Azure Blob
store
e-SC db
backend
<<Azure VM>>
e-Science
Central
main server JMS queue
REST APIWeb UI
web
browser
rich client
app
workflow invocations
e-SC control data
workflow data
<<worker role>>
Workflow
engine
<<worker role>>
Workflow
engine
e-SC blob
store
<<worker role>>
Workflow
engine
Workflow engines Module
configuration:
3 nodes, 24 cores
Azure workflow engines: D13 VMs with 8-core CPU, 56 GiB of memory and 400
GB SSD, Ubuntu 14.04.
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Cost
0
2
4
6
8
10
12
14
16
18
0 6 12 18 24
CostinGBP
Number of samples
3 eng (24 cores)
6 eng (48 cores)
12 eng (96 cores)
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Changes in reference knowledge (ClinVar DB)
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Case study: the Metagenomics portal at the EBI
From environment to DNA sequence
Sample
Size
Fractioning
DNA
extraction
Sequencing
Analysis?
mRNA
extraction
PCR
AmpliconMetatranscriptome Metagenome
EBI metagenomics portal
Open resource for the archiving and analysis of metagenomics
and metatranscriptomics
Generic, yet standardised analysis platform for all metagenomics
studies
Offer a service that small groups would struggle to achieve
Submission of
sequence data for
archiving and analysis
Data analysis using
selected EBI and
external software
tools
Data presentation
and visualisation
through web
interface
Visualisation
Pipeline Overview
Marine Datasets
- Portal contains over 30 marine metagenomes
MillionsofSequences
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Case study: flood modelling in Newcastle
CityCAT (City Catchment Analysis Tool)
A unique software tool for modelling, analysis and
visualisation of surface water flooding
• High resolution flood model
• Integrates hydraulic modelling algorithms
• Subsurface flow modelling
• Topography (DEMs from LIDAR)
• Physical structures (buildings etc.)
• Landuse data
• Outputs high resolution grid of flood depths
• Extensively tested
• Multi-platform
• Integrated into CONDOR and Microsoft Azure
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Fusing UO data and modelling
CityCAT Flood model
Traffic data
Weather data
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
The ReComp project
Aims:
To create a decision support system for
1. detecting changes that affect time-sensitive analytical
knowledge,
2. assessing its reprocessing options, and
3. estimating their cost
Change
Events
Utility
function
s
Priority
rules
Prioritised KAs
Cost estimates
Reproducibility
assessment
ReComp DSS
Previously
Computed KAs
And their metadata
Funded by the EPSRC - Making sense from data
Feb. 2016- Feb. 2019
2 Research Associates
In collaboration with
- Newcastle Civil Engineering (Phil James)
- Department of Clinical Neurosciences
Cambridge University (Prof. Patrick Chinnery)
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
ReComp: Target operating region
Rate of change
Volume
slowfast
low
high
Cost
Volume
highlow
low
high
Volume
Rate of change
ReComp target region
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Recomputation analysis: abstraction
t1 t2 t3
KA5
KA4
KA3
KA2
KA1a b c
a b
d
a b c d
a c
Change
Events
a a’
a
a
a
a
b b’
c c’
b,c
b
b,c
c
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Recomputation analysis: conceptual steps
Assume we have a growing universe of KA of Knowledge Assets.
Each ka ∈ KA has dependencies dep(ka) on other assets in a set DA (input data,
algorithms, libs…)
ReComp analysis steps:
Monitor and detect relevant change events {dai  dai’} with dai ∈ DA
For each change event {dai  dai’}:
• Identify candidate recomp population karec ⊆ KA:
• ka ∈ KA such that dai ∈ dep(ka)
• For each ka ∈ karec:
• Estimate the effect of recomputing ka using da’i instead of dai
• Quantitative estimation of impact due to change dai  dai’
• Determine time, cost associated to recomputing ka
• Use these estimates along with utility functions to rank karec
• Carry out top-k recomputations given a budget: ka  ka’
• Perform post-hoc analysis to improve estimation models:
• Compare actual effect with estimates
• Differential data analysis: Δ(ka, ka’)
• Change cause analysis: has any other element contributed to Δ(ka, ka’)?
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Recomputation analysis through sampling
Change
Events
Monitor
identify
recomp
candidates
prioritisation budgetutility
assess
effects of
change
estimate
recomp
cost
assess
reproducibility
cost
sampling
recomp
recompsmall scale
recomp
Meta-K
large-scale
recomp
estimate
recomp
cost
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Recomputation analysis through modelling
identify
recomp
candidates
large-scale
recomp
estimate
change
impact
Estimate
reproducibility
cost/effort
prioritisation
target
population
utility budget
Change
Events
Change
Impact
Model
Cost
Model
Model
updates
Model
updates
Change impact model: Δ(x,x’)  Δ(y,y’)
-- challenging!!
Can we do better??
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Metadata + Analytics
The knowledge is
in the metadata!
Research hypothesis:
supporting the analysis can be achieved through analytical reasoning applied to a
collection of metadata items, which describe details of past computations.
identify
recomp
candidates
large-scale
recomp
estimate
change
impact
Estimate
reproducibility
cost/effort
Change
Events
Change
Impact
Model
Cost
Model
Model
updates
Model
updates
Meta-K • Logs
• Provenance
• Dependencies
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
High level architecture
ReComp decision dashboard
Execute
Curate
Select/
prioritise
prospective
provenance
curation
(Yworkflow)
Meta-Knowledge
Repository
Research
Objects
Change
Impact
Analysis
Cost
Estimation
Differential
Analysis
Reproducibility
Assessment
- Utility functions
- Priorities policies
- Data similarity functions
domain knowledge
runtime
monitor
Logging
Runtime
Provenance recorder
runtime
monitor
Logging
Runtime
Provenance recorder
Python
WP1
- provenance
- logs
- data and process versions
- process dependencies
(other analytics environments)
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Project objectives
Obj 1.
To investigate analytics techniques aimed at supporting re-computation decisions
Obj 2.
To research techniques for assessing under what conditions it is practically feasible
to re-compute an analytical process.
• Specific target system environments:
• Python / Jupyter
• The eScience Central, workflow manager (developed at Newcastle)
Obj 3.
To create a decision support system for the selective recomputation of complex
data-centric analytical processes and demonstrate its viability on two target case
studies
• Genomics (human variant analysis)
• Urban Observatory (flood modelling)
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Expected outcomes
Research Outcomes:
Algorithms that operate on metadata to perform:
• impact analysis
• cost estimation
• differential data and change cause analysis of past and new knowledge
outcomes
• estimation of reproducibility effort
System Outcomes:
• A software framework consisting of domain-independent, reusable components,
which implement the metadata infrastructure and the research outcomes
• A user-facing decision support dashboard.
It must be possible to integrate the framework with domain-specific components, to
support specific scenarios, exemplified by our case studies.
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Challenge 1: estimating impact and cost
large-scale
recomp
estimate
change
impact
Estimate
reproducibility
cost/effort
prioritisation
target
population
Change
Impact
Model
Cost
Model
Model
updates
Model
updates
Change impact model: Δ(x,x’)  Δ(y,y’)
-- challenging!!
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Challenge 2: managing the metadata
How do we generate / capture / store / index / query across multiple metadata
types and formats?
Relevant Metadata:
• Logs of past executions, automatically collected;
• Provenance traces:
• Runtime (“retrospective”) provenance
• Automatically collected data dependency graph captured from the
computation
• Process structure (“prospective provenance”)
• obtained by manually annotating a script
• External data and system dependencies, process and data versions, and system
requirements
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Challenge 3: Reproducibility
Ex.: workflow to
Identify mutations
in a patient’s
genome
Workflow
specification
WF manager
Linux VM
cluster on
Azure
Analyse
Input genome
variant
s
GATK/Picard/BWA
Workflow Manager
(and its own dependencies)
Ubuntu
on Azure
Dep.
Input
genome config
Ref
genome
Variants
DBs
What happens when any of the dependencies change?
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Challenge 4: reusability of the solution across cases
• How do we make case-specific solutions generic?
• How do we make the DSS reusable?
• Refactor: Generic framework + case-specific components
• This is hard: most elements are case-specific!
• Metadata formats
• Metadata capture
• Change impact
• Cost models
• Utility functions
• …
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Available technology components
• W3C PROV model for describing data dependencies (provenance)
• DataONE “metacat” for data and metadata management
• The eScience Central Workflow Management System
• Natively provenance-aware
• NoWorkflow: a (experimental) Python provenance recorder
• Cloud resources:
• Azure, our own private cloud (CIC)
ReComp decision dashboard
Execute
Curate
Select/
prioritise
prospective
provenance
curation
(Yworkflow)
Meta-Knowledge
Repository
Research
Objects
Change
Impact
Analysis
Cost
Estimation
Differential
Analysis
Reproducibility
Assessment
- Utility functions
- Priorities policies
- Data similarity functions
domain knowledge
runtime
monitor
Logging
Runtime
Provenance recorder
runtime
monitor
Logging
Runtime
Provenance recorder
Python
WP1
- provenance
- logs
- data and process versions
- process dependencies
(other analytics environments)
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Specific areas for PhD research
Modelling and analytics:
• Impact and cost estimation
• […]
Software engineering
• Generic framework + plugins architecture
• Metadata management
• Capture, storage, index, query
• Reproducibility for recomputation
• […]
Case studies
• Genomics
• Flood modelling / smart cities
• […]
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
Summary
• Value from Big Data analytics may decay as the resources it is
built on change
• Resources = {data, external state, algorithms, libs, …}
• Value = “Knowledge Assets” (KA)
• When should such value be restored?
• How do you estimate the cost of re-computation?
• How do you prioritise over a large pool of KA for given budget?
ReComp:
• A decision support tool aimed at answering these questions
• Through a metadata management infrastructure with metadata
analytics on top
CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
References
[1] V.Stodden,F.Leisch,andR.D.Peng,Implementingreproducibleresearch.CRCPress,2014.
[2] R.Peng,“ReproducibleResearchinComputationalScience,”Science,vol.334,no.6060,pp.1226–1127,Dec-2011.
[3] R.Qasha,J.Cala,andP.Watson,“TowardsAutomatedWorkflowDeploymentintheCloudusingTOSCA,”inProcs.IEEE8th
International Conference on Cloud Computing (IEEE CLOUD 2015), 2015.
[4] D.C.Koboldt,L.Ding,E.Mardis,andR.Wilson,“Challengesofsequencinghumangenomes.,”Brief.Bioinform.,Jun.2010.
[5] A.Nekrutenko,“Galaxy:acomprehensiveapproachforsupportingaccessible,reproducible,andtransparentcomputationalresearchin
the life sciences,” Genome Biol., vol. 11, no. 8, p. R86, 2010.
[6] J.Cala,Y.X.Xu,E.A.Wijaya,andP.Missier,“FromscriptedHPC-basedNGSpipelinestoworkflowsonthecloud,”inProcs.C4Bio
workshop, co-located with the 2014 CCGrid conference, 2013.
[7] P.Missier,E.Wijaya,R.Kirby,andM.Keogh,“SVI:asimplesingle-nucleotideHumanVariantInterpretationtoolforClinicalUse,”in
Procs. 11th International conference on Data Integration in the Life Sciences, 2015.
[8] D.G.MacArthur,T.A.Manolio,D.P.Dimmock,H.L.Rehm,etal.,“Guidelinesforinvestigatingcausalityofsequencevariantsinhuman
disease.,” Nature, vol. 508, no. 7497, pp. 469–76, Apr. 2014.
[9] H.Johnson,R.S.Kovats,G.McGregor,J.Stedman,M.Gibbs,andH.Walton,“Theimpactofthe2003heatwaveondailymortalityin
England and Wales and the use of rapid weekly mortality estimates.,” Euro Surveill. Bull. Eur. sur les Mal. Transm. = Eur.
Commun. Dis.Bull., vol. 10, no. 7, pp. 168–171, 2005.
[10]T. Holderness, S. Barr, R. Dawson, and J. Hall, “An evaluation of thermal Earth observation for characterizing urban heatwave
event dynamics using the urban heat island intensity metric,” International Journal of Remote Sensing, vol. 34, no. 3. pp. 864–
884, 2013.
[11]L. Moreau, B. Clifford, J. Freire, J. Futrelle, Y. Gil, P. Groth, N. Kwasnikowska, S. Miles, P. Missier, et al, “The Open
Provenance Model --- Core Specification (v1.1),” Futur. Gener. Comput. Syst., vol. 7, no. 21, pp. 743–756, 2011.
[12]H. Hiden, P. Watson, S. Woodman, and D. Leahy, “e-Science Central: Cloud-based e-Science and its application to chemical
property modelling,” Newcastle University Technical Report series, http://www.ncl.ac.uk/computing/research/techreports/, 2011.
[13]T. McPhillips, T. Song, T. Kolisnik, S. Aulenbach, K. Belhajjame, et al.., “YesWorkflow: A User-Oriented, Language-Independent
Tool for Recovering Workflow Information from Scripts,” in Procs. 10th Intl. Digital Curation Conference (IDCC), 2015.
[14]S. Bechhofer, D. De Roure, M. Gamble, C. Goble, and I. Buchan, “Research Objects: Towards Exchange and Reuse of Digital
Knowledge,” in Procs. Int’l Workshop on Future of the Web for Collaborative Science (FWCS) -- WWW'10, 2010.
[15]S. Woodman, H. Hiden, P. Watson, and P. Missier, “Achieving Reproducibility by Combining Provenance with Service and
Workflow Versioning,” in Procs. WORKS 2011, 2011.
[16]L. Murta, V. Braganholo, F. Chirigati, D. Koop, and J. Freire, “noWorkflow: Capturing and Analyzing Provenance of Scripts⋆,” in
Procs.IPAW’14, 2014.
[17]L. Moreau, P. Missier, K. Belhajjame, R. B’Far, J. Cheney, S. Coppens, S. Cresswell, Y. Gil, P. Groth, G. Klyne, T. Lebo, J.
McCusker,S. Miles, J. Myers, S. Sahoo, and C. Tilmes, “PROV-DM: The PROV Data Model,” 2012.
[18]T. Miu and P. Missier, “Predicting the Execution Time of Workflow Activities Based on Their Input Features,” in Procs. WORKS,
2012. [19]P. Missier, S. Woodman, H. Hiden, and P. Watson, “Provenance and data differencing for workflow reproducibility
analysis,” Concurr.Comput. Pract. Exp., p. n/a–n/a, 2013.

Más contenido relacionado

La actualidad más candente

Pitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themPitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid them
Albert Bifet
 
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
Pradeeban Kathiravelu, Ph.D.
 
Reusable Software and Open Data To Optimize Agriculture
Reusable Software and Open Data To Optimize AgricultureReusable Software and Open Data To Optimize Agriculture
Reusable Software and Open Data To Optimize Agriculture
David LeBauer
 

La actualidad más candente (20)

Sentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming DataSentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming Data
 
Pitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themPitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid them
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
 
Project Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster ReliefProject Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster Relief
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Bioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pBioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9p
 
Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)
 
The maths behind microscaling
The maths behind microscalingThe maths behind microscaling
The maths behind microscaling
 
CCLS Internship Presentation
CCLS Internship PresentationCCLS Internship Presentation
CCLS Internship Presentation
 
What we do to improve scalability in our RDF processing system
What we do to improve scalability in our RDF processing systemWhat we do to improve scalability in our RDF processing system
What we do to improve scalability in our RDF processing system
 
Opportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesOpportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architectures
 
CBS CEDAR Presentation
CBS CEDAR PresentationCBS CEDAR Presentation
CBS CEDAR Presentation
 
Stream data mining & CluStream framework
Stream data mining & CluStream frameworkStream data mining & CluStream framework
Stream data mining & CluStream framework
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3
 
Reusable Software and Open Data To Optimize Agriculture
Reusable Software and Open Data To Optimize AgricultureReusable Software and Open Data To Optimize Agriculture
Reusable Software and Open Data To Optimize Agriculture
 
Master Thesis Presentation
Master Thesis PresentationMaster Thesis Presentation
Master Thesis Presentation
 

Destacado

Destacado (11)

ReComp and the Variant Interpretations Case Study
ReComp and the Variant Interpretations Case StudyReComp and the Variant Interpretations Case Study
ReComp and the Variant Interpretations Case Study
 
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A CloudScalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
 
Ontology Summit - Track D Standards Summary & Provocative Use Cases
Ontology Summit - Track D Standards Summary & Provocative Use CasesOntology Summit - Track D Standards Summary & Provocative Use Cases
Ontology Summit - Track D Standards Summary & Provocative Use Cases
 
PDT: Personal Data from Things, and its provenance
PDT: Personal Data from Things,and its provenancePDT: Personal Data from Things,and its provenance
PDT: Personal Data from Things, and its provenance
 
Sss14khawaja Thingful IOT Search engine
Sss14khawaja Thingful IOT Search engineSss14khawaja Thingful IOT Search engine
Sss14khawaja Thingful IOT Search engine
 
Two products from a single grinding mill
Two products from a single grinding millTwo products from a single grinding mill
Two products from a single grinding mill
 
Proxies are Awesome!
Proxies are Awesome!Proxies are Awesome!
Proxies are Awesome!
 
The Blockchain and JavaScript
The Blockchain and JavaScriptThe Blockchain and JavaScript
The Blockchain and JavaScript
 
Blockchain Experiments in Trade Finance and IoT
Blockchain Experiments in Trade Finance and IoTBlockchain Experiments in Trade Finance and IoT
Blockchain Experiments in Trade Finance and IoT
 
Trade finance and blockchain
Trade finance and blockchainTrade finance and blockchain
Trade finance and blockchain
 
Design Patterns for Ontologies in IoT
Design Patterns for Ontologies in IoTDesign Patterns for Ontologies in IoT
Design Patterns for Ontologies in IoT
 

Similar a ReComp: challenges in selective recomputation of (expensive) data analytics tasks

Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
Ian Foster
 

Similar a ReComp: challenges in selective recomputation of (expensive) data analytics tasks (20)

Preserving the currency of analytics outcomes over time through selective re-...
Preserving the currency of analytics outcomes over time through selective re-...Preserving the currency of analytics outcomes over time through selective re-...
Preserving the currency of analytics outcomes over time through selective re-...
 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
 
Invited cloud-e-Genome project talk at 2015 NGS Data Congress
Invited cloud-e-Genome project talk at 2015 NGS Data CongressInvited cloud-e-Genome project talk at 2015 NGS Data Congress
Invited cloud-e-Genome project talk at 2015 NGS Data Congress
 
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...Customer Case Study: How Novel Compute Technology Transforms Medical and Life...
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWS
 
How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science research
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
Invited talk @ ESIP summer meeting, 2009
Invited talk @ ESIP summer meeting, 2009Invited talk @ ESIP summer meeting, 2009
Invited talk @ ESIP summer meeting, 2009
 
Aplications for machine learning in IoT
Aplications for machine learning in IoTAplications for machine learning in IoT
Aplications for machine learning in IoT
 
How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science research
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
 
Project Risk Analysis in Aerospace Industry
Project Risk Analysis in Aerospace IndustryProject Risk Analysis in Aerospace Industry
Project Risk Analysis in Aerospace Industry
 
Kafka Summit SF 2017 - Accelerating Particles to Explore the Mysteries of the...
Kafka Summit SF 2017 - Accelerating Particles to Explore the Mysteries of the...Kafka Summit SF 2017 - Accelerating Particles to Explore the Mysteries of the...
Kafka Summit SF 2017 - Accelerating Particles to Explore the Mysteries of the...
 
Streaming Model Transformations by Complex Event Processing
Streaming Model Transformations by Complex Event ProcessingStreaming Model Transformations by Complex Event Processing
Streaming Model Transformations by Complex Event Processing
 
2017 nov reflow sbtb
2017 nov reflow sbtb2017 nov reflow sbtb
2017 nov reflow sbtb
 
ACC-2012, Bangalore, India, 28 July, 2012
ACC-2012, Bangalore, India, 28 July, 2012ACC-2012, Bangalore, India, 28 July, 2012
ACC-2012, Bangalore, India, 28 July, 2012
 
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeWorkflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
 
The Concurrent Constraint Programming Research Programmes -- Redux
The Concurrent Constraint Programming Research Programmes -- ReduxThe Concurrent Constraint Programming Research Programmes -- Redux
The Concurrent Constraint Programming Research Programmes -- Redux
 
[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...
[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...
[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...
 
B Kindilien-Does Manufacturing Have a Future?
B Kindilien-Does Manufacturing Have a Future?B Kindilien-Does Manufacturing Have a Future?
B Kindilien-Does Manufacturing Have a Future?
 

Más de Paolo Missier

Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
Paolo Missier
 

Más de Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for Health
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparency
 
Provenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationProvenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-Computation
 

Último

Último (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

ReComp: challenges in selective recomputation of (expensive) data analytics tasks

  • 1. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier ReComp: preserving the value of big data insights over time Panta Rhei (Heraclitus, through Plato) Paolo Missier Paolo.Missier@ncl.ac.uk November, 2015 Cloud CDT seminar series Newcastle (*) Painting by Johannes Moreelse (*)
  • 2. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Generating Analytical knowledge Specification Deployment Execution KA Dependencies Algorithms, Libs, Packages System External state (DBs) Input Data config KA = Knowledge Asset Ex.: machine learning Using Python and scikit-learn Learn model to recognise activity pattern Python 3 Ubuntu x.y.z Azure VM Model training Model Scikit-learn Numpy Pandas Ubuntu on Azure Dependencies Training + Testing dataset config
  • 3. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Generating Analytical knowledge Specification Deployment Execution KA Dependencies Algorithms, Libs, Packages System External state (DBs) Input Data config KA = Knowledge Asset Ex.: workflow to Identify mutations in a patient’s genome Workflow specification WF manager Linux VM cluster on Azure Analyse Input genome variant s GATK/Picard/BWA Workflow Manager (and its own dependencies) Ubuntu on Azure Dep. Input genome config Ref genome Variants DBs
  • 5. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier How fast does knowledge advance? • Life Sciences knowledge: • Genes (GenBank, Ensembl), Proteins, SNPs, Human Variants DBs (ClinVar) • Life Sciences ontologies (GO, HPO,…) • The human genome assembly • The collection of all PubMed articles • DBPedia, Wikipedia, etc. • All current {Twitter, FB, G+, …} users and their connections • A map of all buildings in a city, with their location and footprint • The Hubble Atlas of Ancient Galaxies • The catalogue of all known Exoplanets (about 2000)
  • 7. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier What analytics? Genomics • Diagnosis of rare genetic diseases • Analyse soil, water composition (metagenomics) Social media analytics, eg Twitter content analysis • Sentiment analysis • Topic discovery • Emergency response • Fostering new communities Climate modelling • Predicting local climate changes • Ecology: understanding change by monitoring local species Environment risk assessment • Flood modelling and simulation
  • 8. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Case study: NGS data processing pipeline (Genomics) Recalibration Corrects for system bias on quality scores assigned by sequencer GATK Computes coverage of each read. VCF Subsetting by filtering, eg non-exomic variants Annovar functional annotations (eg MAF, synonimity, SNPs…) followed by in house annotations Aligns sample sequence to HG19 reference genome using BWA aligner Cleaning, duplicate elimination Picard tools Variant calling operates on multiple samples simultaneously Splits samples into chunks. Haplotype caller detects both SNV as well as longer indels Variant recalibration attempts to reduce false positive rate from caller raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants
  • 9. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Case study: metagenomics From environment to DNA sequence Sample Size Fractioning DNA extraction Sequencing Analysis? mRNA extraction PCR AmpliconMetatranscriptome Metagenome metagenomics
  • 10. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Case study: flood modelling in Newcastle CityCAT (City Catchment Analysis Tool) A unique software tool for modelling, analysis and visualisation of surface water flooding • High resolution flood model • Integrates hydraulic modelling algorithms • Subsurface flow modelling • Topography (DEMs from LIDAR) • Physical structures (buildings etc.) • Landuse data • Outputs high resolution grid of flood depths • Extensively tested • Multi-platform • Integrated into CONDOR and Microsoft Azure
  • 11. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier What kind of changes affect these analytics tasks? Application Knowledge Algorithms and tools LS Diagnosis of rare genetic diseases PubMed Human Variants DBs The human genome assembly SNP DBs Numerous algorithms and tools used for sequence alignment, cleaning, variant calling… LS Metagenomics Collections of known DNA sequences for multiple species Same as for genomics SM Sentiment analysis Past Predictive models Content analysis NLP tools Statistical model learning (classification) SM Topic discovery Clustering algorithms SM Emergency response Content analysis NLP tools Predictive models, topical trend analysis SM Fostering new communities Hubs & authorities algorithms, clustering CS Predicting local climate changes Historical and current time series at multiple resolution Past and current models Statistical model learning CS Ecology: understand change by monitoring local species Local species count & behaviour observations Statistical model learning CE Flood modelling and simulation Local topography, location of buildings Simulation packages (eg CityCat)
  • 12. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Volume: how many data products are affected? Application Volume LS Diagnosis of rare genetic diseases 100K genome project in the UK alone Thousands of samples in Newcastle alone LS Metagenomics A few K (EBI Metagenomics portal) SM Sentiment analysis # of users whose sentiment is being analysed SM Topic discovery A few clusters, containing a large number of Tweets SM Emergency response A few key decisions SM Fostering new communities A few key users CS Predicting local climate changes Local effect CS Ecology: understand change by monitoring local species Local effects CE Flood modelling and simulation Local effects
  • 13. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier LS Diagnosis of rare genetic diseases LS Metagenomics SM Sentiment analysis SM Topic discovery SM Emergency response SM Fostering new communities CS Predicting local climate changes CS Ecology: monitoring local species CE Flood modelling and simulation How fast do these products become obsolete? minutes hours months yearsdays
  • 14. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier How sensitive are data products to change? LS Diagnosis of rare genetic diseases LS Metagenomics SM Sentiment analysis SM Topic discovery SM Emergency response SM Fostering new communities CS Predicting local climate changes CS Ecology: monitoring local species CE Flood modelling and simulation
  • 15. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier How much do they cost? Note: Cost per product vs Cost over all products Cost components: - Design - Development - System - Runtime LS Diagnosis of rare genetic diseases LS Metagenomics SM Sentiment analysis SM Topic discovery SM Emergency response SM Fostering new communities CS Predicting local climate changes CS Ecology: monitoring local species CE Flood modelling and simulation
  • 16. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Case study: NGS data processing pipeline (Genomics) Recalibration Corrects for system bias on quality scores assigned by sequencer GATK Computes coverage of each read. VCF Subsetting by filtering, eg non-exomic variants Annovar functional annotations (eg MAF, synonimity, SNPs…) followed by in house annotations Aligns sample sequence to HG19 reference genome using BWA aligner Cleaning, duplicate elimination Picard tools Variant calling operates on multiple samples simultaneously Splits samples into chunks. Haplotype caller detects both SNV as well as longer indels Variant recalibration attempts to reduce false positive rate from caller raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants
  • 17. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Workflow Deployment on the Azure Cloud <<Azure VM>> Azure Blob store e-SC db backend <<Azure VM>> e-Science Central main server JMS queue REST APIWeb UI web browser rich client app workflow invocations e-SC control data workflow data <<worker role>> Workflow engine <<worker role>> Workflow engine e-SC blob store <<worker role>> Workflow engine Workflow engines Module configuration: 3 nodes, 24 cores Azure workflow engines: D13 VMs with 8-core CPU, 56 GiB of memory and 400 GB SSD, Ubuntu 14.04.
  • 18. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Cost 0 2 4 6 8 10 12 14 16 18 0 6 12 18 24 CostinGBP Number of samples 3 eng (24 cores) 6 eng (48 cores) 12 eng (96 cores)
  • 21. From environment to DNA sequence Sample Size Fractioning DNA extraction Sequencing Analysis? mRNA extraction PCR AmpliconMetatranscriptome Metagenome
  • 22. EBI metagenomics portal Open resource for the archiving and analysis of metagenomics and metatranscriptomics Generic, yet standardised analysis platform for all metagenomics studies Offer a service that small groups would struggle to achieve Submission of sequence data for archiving and analysis Data analysis using selected EBI and external software tools Data presentation and visualisation through web interface Visualisation
  • 24. Marine Datasets - Portal contains over 30 marine metagenomes MillionsofSequences
  • 25. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Case study: flood modelling in Newcastle CityCAT (City Catchment Analysis Tool) A unique software tool for modelling, analysis and visualisation of surface water flooding • High resolution flood model • Integrates hydraulic modelling algorithms • Subsurface flow modelling • Topography (DEMs from LIDAR) • Physical structures (buildings etc.) • Landuse data • Outputs high resolution grid of flood depths • Extensively tested • Multi-platform • Integrated into CONDOR and Microsoft Azure
  • 26.
  • 27. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Fusing UO data and modelling CityCAT Flood model Traffic data Weather data
  • 28. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier The ReComp project Aims: To create a decision support system for 1. detecting changes that affect time-sensitive analytical knowledge, 2. assessing its reprocessing options, and 3. estimating their cost Change Events Utility function s Priority rules Prioritised KAs Cost estimates Reproducibility assessment ReComp DSS Previously Computed KAs And their metadata Funded by the EPSRC - Making sense from data Feb. 2016- Feb. 2019 2 Research Associates In collaboration with - Newcastle Civil Engineering (Phil James) - Department of Clinical Neurosciences Cambridge University (Prof. Patrick Chinnery)
  • 29. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier ReComp: Target operating region Rate of change Volume slowfast low high Cost Volume highlow low high Volume Rate of change ReComp target region
  • 30. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Recomputation analysis: abstraction t1 t2 t3 KA5 KA4 KA3 KA2 KA1a b c a b d a b c d a c Change Events a a’ a a a a b b’ c c’ b,c b b,c c
  • 31. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Recomputation analysis: conceptual steps Assume we have a growing universe of KA of Knowledge Assets. Each ka ∈ KA has dependencies dep(ka) on other assets in a set DA (input data, algorithms, libs…) ReComp analysis steps: Monitor and detect relevant change events {dai  dai’} with dai ∈ DA For each change event {dai  dai’}: • Identify candidate recomp population karec ⊆ KA: • ka ∈ KA such that dai ∈ dep(ka) • For each ka ∈ karec: • Estimate the effect of recomputing ka using da’i instead of dai • Quantitative estimation of impact due to change dai  dai’ • Determine time, cost associated to recomputing ka • Use these estimates along with utility functions to rank karec • Carry out top-k recomputations given a budget: ka  ka’ • Perform post-hoc analysis to improve estimation models: • Compare actual effect with estimates • Differential data analysis: Δ(ka, ka’) • Change cause analysis: has any other element contributed to Δ(ka, ka’)?
  • 32. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Recomputation analysis through sampling Change Events Monitor identify recomp candidates prioritisation budgetutility assess effects of change estimate recomp cost assess reproducibility cost sampling recomp recompsmall scale recomp Meta-K large-scale recomp estimate recomp cost
  • 33. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Recomputation analysis through modelling identify recomp candidates large-scale recomp estimate change impact Estimate reproducibility cost/effort prioritisation target population utility budget Change Events Change Impact Model Cost Model Model updates Model updates Change impact model: Δ(x,x’)  Δ(y,y’) -- challenging!! Can we do better??
  • 34. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Metadata + Analytics The knowledge is in the metadata! Research hypothesis: supporting the analysis can be achieved through analytical reasoning applied to a collection of metadata items, which describe details of past computations. identify recomp candidates large-scale recomp estimate change impact Estimate reproducibility cost/effort Change Events Change Impact Model Cost Model Model updates Model updates Meta-K • Logs • Provenance • Dependencies
  • 35. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier High level architecture ReComp decision dashboard Execute Curate Select/ prioritise prospective provenance curation (Yworkflow) Meta-Knowledge Repository Research Objects Change Impact Analysis Cost Estimation Differential Analysis Reproducibility Assessment - Utility functions - Priorities policies - Data similarity functions domain knowledge runtime monitor Logging Runtime Provenance recorder runtime monitor Logging Runtime Provenance recorder Python WP1 - provenance - logs - data and process versions - process dependencies (other analytics environments)
  • 36. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Project objectives Obj 1. To investigate analytics techniques aimed at supporting re-computation decisions Obj 2. To research techniques for assessing under what conditions it is practically feasible to re-compute an analytical process. • Specific target system environments: • Python / Jupyter • The eScience Central, workflow manager (developed at Newcastle) Obj 3. To create a decision support system for the selective recomputation of complex data-centric analytical processes and demonstrate its viability on two target case studies • Genomics (human variant analysis) • Urban Observatory (flood modelling)
  • 37. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Expected outcomes Research Outcomes: Algorithms that operate on metadata to perform: • impact analysis • cost estimation • differential data and change cause analysis of past and new knowledge outcomes • estimation of reproducibility effort System Outcomes: • A software framework consisting of domain-independent, reusable components, which implement the metadata infrastructure and the research outcomes • A user-facing decision support dashboard. It must be possible to integrate the framework with domain-specific components, to support specific scenarios, exemplified by our case studies.
  • 38. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Challenge 1: estimating impact and cost large-scale recomp estimate change impact Estimate reproducibility cost/effort prioritisation target population Change Impact Model Cost Model Model updates Model updates Change impact model: Δ(x,x’)  Δ(y,y’) -- challenging!!
  • 39. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Challenge 2: managing the metadata How do we generate / capture / store / index / query across multiple metadata types and formats? Relevant Metadata: • Logs of past executions, automatically collected; • Provenance traces: • Runtime (“retrospective”) provenance • Automatically collected data dependency graph captured from the computation • Process structure (“prospective provenance”) • obtained by manually annotating a script • External data and system dependencies, process and data versions, and system requirements
  • 40. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Challenge 3: Reproducibility Ex.: workflow to Identify mutations in a patient’s genome Workflow specification WF manager Linux VM cluster on Azure Analyse Input genome variant s GATK/Picard/BWA Workflow Manager (and its own dependencies) Ubuntu on Azure Dep. Input genome config Ref genome Variants DBs What happens when any of the dependencies change?
  • 41. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Challenge 4: reusability of the solution across cases • How do we make case-specific solutions generic? • How do we make the DSS reusable? • Refactor: Generic framework + case-specific components • This is hard: most elements are case-specific! • Metadata formats • Metadata capture • Change impact • Cost models • Utility functions • …
  • 42. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Available technology components • W3C PROV model for describing data dependencies (provenance) • DataONE “metacat” for data and metadata management • The eScience Central Workflow Management System • Natively provenance-aware • NoWorkflow: a (experimental) Python provenance recorder • Cloud resources: • Azure, our own private cloud (CIC) ReComp decision dashboard Execute Curate Select/ prioritise prospective provenance curation (Yworkflow) Meta-Knowledge Repository Research Objects Change Impact Analysis Cost Estimation Differential Analysis Reproducibility Assessment - Utility functions - Priorities policies - Data similarity functions domain knowledge runtime monitor Logging Runtime Provenance recorder runtime monitor Logging Runtime Provenance recorder Python WP1 - provenance - logs - data and process versions - process dependencies (other analytics environments)
  • 43. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Specific areas for PhD research Modelling and analytics: • Impact and cost estimation • […] Software engineering • Generic framework + plugins architecture • Metadata management • Capture, storage, index, query • Reproducibility for recomputation • […] Case studies • Genomics • Flood modelling / smart cities • […]
  • 44. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier Summary • Value from Big Data analytics may decay as the resources it is built on change • Resources = {data, external state, algorithms, libs, …} • Value = “Knowledge Assets” (KA) • When should such value be restored? • How do you estimate the cost of re-computation? • How do you prioritise over a large pool of KA for given budget? ReComp: • A decision support tool aimed at answering these questions • Through a metadata management infrastructure with metadata analytics on top
  • 45. CenterforDoctoralTraining–Newcastle SeminarSeries–Nov.2015P.Missier References [1] V.Stodden,F.Leisch,andR.D.Peng,Implementingreproducibleresearch.CRCPress,2014. [2] R.Peng,“ReproducibleResearchinComputationalScience,”Science,vol.334,no.6060,pp.1226–1127,Dec-2011. [3] R.Qasha,J.Cala,andP.Watson,“TowardsAutomatedWorkflowDeploymentintheCloudusingTOSCA,”inProcs.IEEE8th International Conference on Cloud Computing (IEEE CLOUD 2015), 2015. [4] D.C.Koboldt,L.Ding,E.Mardis,andR.Wilson,“Challengesofsequencinghumangenomes.,”Brief.Bioinform.,Jun.2010. [5] A.Nekrutenko,“Galaxy:acomprehensiveapproachforsupportingaccessible,reproducible,andtransparentcomputationalresearchin the life sciences,” Genome Biol., vol. 11, no. 8, p. R86, 2010. [6] J.Cala,Y.X.Xu,E.A.Wijaya,andP.Missier,“FromscriptedHPC-basedNGSpipelinestoworkflowsonthecloud,”inProcs.C4Bio workshop, co-located with the 2014 CCGrid conference, 2013. [7] P.Missier,E.Wijaya,R.Kirby,andM.Keogh,“SVI:asimplesingle-nucleotideHumanVariantInterpretationtoolforClinicalUse,”in Procs. 11th International conference on Data Integration in the Life Sciences, 2015. [8] D.G.MacArthur,T.A.Manolio,D.P.Dimmock,H.L.Rehm,etal.,“Guidelinesforinvestigatingcausalityofsequencevariantsinhuman disease.,” Nature, vol. 508, no. 7497, pp. 469–76, Apr. 2014. [9] H.Johnson,R.S.Kovats,G.McGregor,J.Stedman,M.Gibbs,andH.Walton,“Theimpactofthe2003heatwaveondailymortalityin England and Wales and the use of rapid weekly mortality estimates.,” Euro Surveill. Bull. Eur. sur les Mal. Transm. = Eur. Commun. Dis.Bull., vol. 10, no. 7, pp. 168–171, 2005. [10]T. Holderness, S. Barr, R. Dawson, and J. Hall, “An evaluation of thermal Earth observation for characterizing urban heatwave event dynamics using the urban heat island intensity metric,” International Journal of Remote Sensing, vol. 34, no. 3. pp. 864– 884, 2013. [11]L. Moreau, B. Clifford, J. Freire, J. Futrelle, Y. Gil, P. Groth, N. Kwasnikowska, S. Miles, P. Missier, et al, “The Open Provenance Model --- Core Specification (v1.1),” Futur. Gener. Comput. Syst., vol. 7, no. 21, pp. 743–756, 2011. [12]H. Hiden, P. Watson, S. Woodman, and D. Leahy, “e-Science Central: Cloud-based e-Science and its application to chemical property modelling,” Newcastle University Technical Report series, http://www.ncl.ac.uk/computing/research/techreports/, 2011. [13]T. McPhillips, T. Song, T. Kolisnik, S. Aulenbach, K. Belhajjame, et al.., “YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts,” in Procs. 10th Intl. Digital Curation Conference (IDCC), 2015. [14]S. Bechhofer, D. De Roure, M. Gamble, C. Goble, and I. Buchan, “Research Objects: Towards Exchange and Reuse of Digital Knowledge,” in Procs. Int’l Workshop on Future of the Web for Collaborative Science (FWCS) -- WWW'10, 2010. [15]S. Woodman, H. Hiden, P. Watson, and P. Missier, “Achieving Reproducibility by Combining Provenance with Service and Workflow Versioning,” in Procs. WORKS 2011, 2011. [16]L. Murta, V. Braganholo, F. Chirigati, D. Koop, and J. Freire, “noWorkflow: Capturing and Analyzing Provenance of Scripts⋆,” in Procs.IPAW’14, 2014. [17]L. Moreau, P. Missier, K. Belhajjame, R. B’Far, J. Cheney, S. Coppens, S. Cresswell, Y. Gil, P. Groth, G. Klyne, T. Lebo, J. McCusker,S. Miles, J. Myers, S. Sahoo, and C. Tilmes, “PROV-DM: The PROV Data Model,” 2012. [18]T. Miu and P. Missier, “Predicting the Execution Time of Workflow Activities Based on Their Input Features,” in Procs. WORKS, 2012. [19]P. Missier, S. Woodman, H. Hiden, and P. Watson, “Provenance and data differencing for workflow reproducibility analysis,” Concurr.Comput. Pract. Exp., p. n/a–n/a, 2013.

Notas del editor

  1. The times they are a’changin
  2. So, how do we do from the environment to a whole genome shotgun sequencing project. In this slide we schematically present the ‘typical’ approach. So, from the sample there is usually and extraction process, for example washing of the bacteria from the solid matter in a soil sample. This is then typically followed by some size fractionation. More often this has focused on the bacterial component, but is now moving in both directions of size, to viruses and picoeukayotes. After you have isolated the micro-organisms in a given size range, then there is normally a process were the DNA is extracted and processed ready for sequencing. The sequencing approach most widely used today is Illumina, having originally been 454. However, some of this depends on the nature of the study. With the cost of sequencing ever decreasing the bottleneck in the process is now the analysis of the DNA samples. Most samples submitted to the portal are about 10 time greater than the average bacterial genome and you may be looking at a series of samples.
  3. S1. identifying re-computation candidates and understand the impact of changes and in Information Assets on a corpus of knowledge outcomes: which outcomes are affected by the changes, and to what extent? This step defines the target re-computation population; S2. Estimate effects, costs and benefits of re-computation, across the target population (S1); S3. Establish re-computation priorities within the target population, based on a budget for computational resources, a problem-specific definition of utility functions and prioritisation policy, and estimates as in (S2); S4. Selectively carry out priority re-computations, when the processes are reproducible; S5. Differential data analysis and change cause analysis: Assess the effects of the re- computation. This involves understanding how the new outcomes differ from the original (differential data analysis), and which of the changes in the process are responsible for the changes observed in ReComp the outcomes (change cause analysis). The latter analysis helps data scientists understand the actual effect of an improved process “post hoc”, and has also the potential to improve future effect estimates.
  4. Problem: this is “blind” and expensive. Can we do better?
  5. These items are partly collected automatically, and partly as manual annotations. They include: Logs of past executions, automatically collected, to be used for post hoc performance analysis and estimation of future resource requirements and thus costs (S1) ; Runtime provenance traces and prospective provenance. The former are automatically collected graphs of data dependencies, captured from the computation [11]. The latter are formal descriptions of the analytics process, obtained from the workflow specification, or more generally by manually annotating a script. Both are instrumental to understanding how the knowledge outcomes have changed and why (S5), as well as to estimate future re-computation effects. External data and system dependencies, process and data versions, and system requirements associated with the analytics process, which are used to understand whether it will be practically possible to re-compute the process.