SlideShare a Scribd company logo
1 of 21
ReCompkickoff
NewcaslteMarch11,2016
ReComp: preserving the value of big data insights over time
Panta Rhei (Heraclitus, through Plato)
Project Kickoff
March, 2016
(*) Painting by Johannes Moreelse
(*)
ReCompkickoff
NewcaslteMarch11,2016
Data to Knowledge
Meta-knowledge
Big
Data
The Big
Analytics
Machine
Algorithms
Tools
Middleware
Reference
datasets
“Valuable
Knowledge”
The Data-to-knowledge axiom of the Knowledge Economy:
What is the Total Cost of Ownership (TCO) of these knowledge assets?
ReCompkickoff
NewcaslteMarch11,2016
Learning from data (supervised)
Meta-knowledge
Training
set
Model
learning
Classification
algorithms
Predictive
classifier
Background
Knowledge
(prior)
ReCompkickoff
NewcaslteMarch11,2016
Learning from data (unsupervised)
Meta-knowledge
Observations Model
learning
Clustering
algorithms
Clustering
scheme
Background
Knowledge
ReCompkickoff
NewcaslteMarch11,2016
Stream Analytics
Meta-knowledge
Data
stream
Time
Series
analysis
Pattern
recognition
algorithms
Temporal
Patterns
Background
Knowledge
ReCompkickoff
NewcaslteMarch11,2016
The missing element: time
Big
Data
The Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
ReCompkickoff
NewcaslteMarch11,2016
The ReComp decision support system
Observe change
• In big data
• In meta-knowledge
Assess and
measure
• knowledge decay
Estimate
• Cost and benefits of refresh
Enact
• Reproduce (analytics)
processes
Big
Data
The Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
ReCompkickoff
NewcaslteMarch11,2016
ReComp
Change
Events
Diff(.,.)
functions
“business
Rules”
Prioritised KAs
Cost estimates
Reproducibility
assessment
ReComp DSS
Previously
Computed KAs
And their metadata
Observe
change
Assess and
measure
Estimate
Enact
KA: Knowledge Assets
ReCompkickoff
NewcaslteMarch11,2016
ReComp: project objectives
Obj 1.
To investigate analytics techniques aimed at supporting re-computation decisions
Obj 2.
To research techniques for assessing under what conditions it is practically feasible
to re-compute an analytical process.
• Specific target system environments:
• Python / Jupyter
• The eScience Central, workflow manager
Obj 3.
To create a decision support system for the selective recomputation of complex
data-centric analytical processes and demonstrate its viability on two target case
studies
• Genomics (human variant analysis)
• Urban Observatory (flood modelling)
ReCompkickoff
NewcaslteMarch11,2016
ReComp: Expected outcomes
Research Outcomes:
Algorithms that operate on metadata to perform:
• impact analysis
• cost estimation
• differential data and change cause analysis of past and new knowledge
outcomes
• estimation of reproducibility effort
System Outcomes:
• A software framework consisting of domain-independent, reusable components,
which implement the metadata infrastructure and the research outcomes
• A user-facing decision support dashboard.
It must be possible to integrate the framework with domain-specific components, to
support specific scenarios, exemplified by our case studies.
ReCompkickoff
NewcaslteMarch11,2016
ReComp: Target operating region
Rate of change
Volume
slowfast
low
high
Cost
Volume
highlow
low
high
Volume
Rate of change
ReComp target region
ReCompkickoff
NewcaslteMarch11,2016
Recomputation analysis: abstraction
t1 t2 t3
KA5
KA4
KA3
KA2
KA1a b c
a b
d
a b c d
a c
Change
Events
a a’
a
a
a
a
b b’
c c’
b,c
b
b,c
c
ReCompkickoff
NewcaslteMarch11,2016
Recomputation analysis through sampling
Change
Events
Monitor
identify
recomp
candidates
prioritisation budgetutility
assess
effects of
change
estimate
recomp
cost
assess
reproducibility
cost
sampling
recomp
recompsmall scale
recomp
Meta-K
large-scale
recomp
estimate
recomp
cost
ReCompkickoff
NewcaslteMarch11,2016
Recomputation analysis through modelling
identify
recomp
candidates
large-scale
recomp
estimate
change
impact
Estimate
reproducibility
cost/effort
prioritisation
target
population
utility budget
Change
Events
Change
Impact
Model
Cost
Model
Model
updates
Model
updates
Change impact model: Δ(x,x’)  Δ(y,y’)
-- challenging!!
Can we do better??
- Batching given an
allocation of resources?
- Consolidating jobs with
different resource reqs to
optimise resource
allocation
ReCompkickoff
NewcaslteMarch11,2016
Metadata + Analytics
The knowledge is
in the metadata!
Research hypothesis:
supporting the analysis can be achieved through analytical reasoning applied to a
collection of metadata items, which describe details of past computations.
identify
recomp
candidates
large-scale
recomp
estimate
change
impact
Estimate
reproducibility
cost/effort
Change
Events
Change
Impact
Model
Cost
Model
Model
updates
Model
updates
Meta-K • Logs
• Provenance
• Dependencies
ReCompkickoff
NewcaslteMarch11,2016
High level architecture
ReComp decision dashboard
Execute
Curate
Select/
prioritise
prospective
provenance
curation
(Yworkflow)
Meta-Knowledge
Repository
Research
Objects
Change
Impact
Analysis
Cost
Estimation
Differential
Analysis
Reproducibility
Assessment
- Utility functions
- Priorities policies
- Data similarity functions
domain knowledge
runtime
monitor
Logging
Runtime
Provenance recorder
runtime
monitor
Logging
Runtime
Provenance recorder
Python
WP1
- provenance
- logs
- data and process versions
- process dependencies
(other analytics environments)
ReCompkickoff
NewcaslteMarch11,2016
Available technology components
• W3C PROV model for describing data dependencies (provenance)
• DataONE “metacat” for data and metadata management
• The eScience Central Workflow Management System
• Natively provenance-aware
• NoWorkflow: a (experimental) Python provenance recorder
• Cloud resources:
• Azure, our own private cloud (CIC)
ReComp decision dashboard
Execute
Curate
Select/
prioritise
prospective
provenance
curation
(Yworkflow)
Meta-Knowledge
Repository
Research
Objects
Change
Impact
Analysis
Cost
Estimation
Differential
Analysis
Reproducibility
Assessment
- Utility functions
- Priorities policies
- Data similarity functions
domain knowledge
runtime
monitor
Logging
Runtime
Provenance recorder
runtime
monitor
Logging
Runtime
Provenance recorder
Python
WP1
- provenance
- logs
- data and process versions
- process dependencies
(other analytics environments)
ReCompkickoff
NewcaslteMarch11,2016
Challenge 1: estimating impact and cost
large-scale
recomp
estimate
change
impact
Estimate
reproducibility
cost/effort
prioritisation
target
population
Change
Impact
Model
Cost
Model
Model
updates
Model
updates
Change impact model: Δ(x,x’)  Δ(y,y’)
-- challenging!!
ReCompkickoff
NewcaslteMarch11,2016
Challenge 2: managing the metadata
How do we generate / capture / store / index / query across multiple metadata
types and formats?
Relevant Metadata:
• Logs of past executions, automatically collected;
• Provenance traces:
• Runtime (“retrospective”) provenance
• Automatically collected data dependency graph captured from the
computation
• Process structure (“prospective provenance”)
• obtained by manually annotating a script
• External data and system dependencies, process and data versions, and system
requirements
ReCompkickoff
NewcaslteMarch11,2016
Challenge 3: Reproducibility
Ex.: workflow to
Identify mutations
in a patient’s
genome
Workflow
specification
WF manager
Linux VM
cluster on
Azure
Analyse
Input genome
variant
s
GATK/Picard/BWA
Workflow Manager
(and its own dependencies)
Ubuntu
on Azure
Dep.
Input
genome config
Ref
genome
Variants
DBs
What happens when any of the dependencies change?
ReCompkickoff
NewcaslteMarch11,2016
Challenge 4: reusability of the solution across cases
• How do we make case-specific solutions generic?
• How do we make the DSS reusable?
• Refactor: Generic framework + case-specific components
• This is hard: most elements are case-specific!
• Metadata formats
• Metadata capture
• Change impact
• Cost models
• Utility functions
• …

More Related Content

What's hot

Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
Unsupervised Learning and Image Classification in High Performance Computing ...
Unsupervised Learning and Image Classification in High Performance Computing ...Unsupervised Learning and Image Classification in High Performance Computing ...
Unsupervised Learning and Image Classification in High Performance Computing ...HPCC Systems
 
accessible-streaming-algorithms
accessible-streaming-algorithmsaccessible-streaming-algorithms
accessible-streaming-algorithmsFarhan Zaki
 
Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...
Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...
Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...Ioannis Katakis
 
A framework for mining signatures from event sequences and its applications i...
A framework for mining signatures from event sequences and its applications i...A framework for mining signatures from event sequences and its applications i...
A framework for mining signatures from event sequences and its applications i...JPINFOTECH JAYAPRAKASH
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introductionNeeraj Tewari
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...Till Blume
 
Multiple Models for Recommending Temporal Aspects of Entities
Multiple Models for Recommending Temporal Aspects of EntitiesMultiple Models for Recommending Temporal Aspects of Entities
Multiple Models for Recommending Temporal Aspects of EntitiesTu Nguyen
 
Nimrita koul Machine Learning
Nimrita koul  Machine LearningNimrita koul  Machine Learning
Nimrita koul Machine LearningNimrita Koul
 
La résolution de problèmes à l'aide de graphes
La résolution de problèmes à l'aide de graphesLa résolution de problèmes à l'aide de graphes
La résolution de problèmes à l'aide de graphesData2B
 
NG2S: A Study of Pro-Environmental Tipping Point via ABMs
NG2S: A Study of Pro-Environmental Tipping Point via ABMsNG2S: A Study of Pro-Environmental Tipping Point via ABMs
NG2S: A Study of Pro-Environmental Tipping Point via ABMsKan Yuenyong
 
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Bertram Ludäscher
 
Himansu sahoo resume-ds
Himansu sahoo resume-dsHimansu sahoo resume-ds
Himansu sahoo resume-dsHimansu Sahoo
 
Towards reproducibility and maximally-open data
Towards reproducibility and maximally-open dataTowards reproducibility and maximally-open data
Towards reproducibility and maximally-open dataPablo Bernabeu
 
IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079ibankuk
 
Master Thesis Presentation
Master Thesis PresentationMaster Thesis Presentation
Master Thesis PresentationEhab Qadah
 

What's hot (19)

Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
Unsupervised Learning and Image Classification in High Performance Computing ...
Unsupervised Learning and Image Classification in High Performance Computing ...Unsupervised Learning and Image Classification in High Performance Computing ...
Unsupervised Learning and Image Classification in High Performance Computing ...
 
accessible-streaming-algorithms
accessible-streaming-algorithmsaccessible-streaming-algorithms
accessible-streaming-algorithms
 
Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...
Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...
Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web In...
 
A framework for mining signatures from event sequences and its applications i...
A framework for mining signatures from event sequences and its applications i...A framework for mining signatures from event sequences and its applications i...
A framework for mining signatures from event sequences and its applications i...
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introduction
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
 
Multiple Models for Recommending Temporal Aspects of Entities
Multiple Models for Recommending Temporal Aspects of EntitiesMultiple Models for Recommending Temporal Aspects of Entities
Multiple Models for Recommending Temporal Aspects of Entities
 
Nimrita koul Machine Learning
Nimrita koul  Machine LearningNimrita koul  Machine Learning
Nimrita koul Machine Learning
 
Big data
Big dataBig data
Big data
 
Skillwise Big data
Skillwise Big dataSkillwise Big data
Skillwise Big data
 
La résolution de problèmes à l'aide de graphes
La résolution de problèmes à l'aide de graphesLa résolution de problèmes à l'aide de graphes
La résolution de problèmes à l'aide de graphes
 
NG2S: A Study of Pro-Environmental Tipping Point via ABMs
NG2S: A Study of Pro-Environmental Tipping Point via ABMsNG2S: A Study of Pro-Environmental Tipping Point via ABMs
NG2S: A Study of Pro-Environmental Tipping Point via ABMs
 
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
 
Himansu sahoo resume-ds
Himansu sahoo resume-dsHimansu sahoo resume-ds
Himansu sahoo resume-ds
 
Towards reproducibility and maximally-open data
Towards reproducibility and maximally-open dataTowards reproducibility and maximally-open data
Towards reproducibility and maximally-open data
 
IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079
 
Master Thesis Presentation
Master Thesis PresentationMaster Thesis Presentation
Master Thesis Presentation
 

Similar to ReComp project kickoff presentation 11-03-2016

Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
 
QRS16: NOTICE: A Framework for Non-functional Testing of Compilers
QRS16: NOTICE: A Framework for Non-functional Testing of CompilersQRS16: NOTICE: A Framework for Non-functional Testing of Compilers
QRS16: NOTICE: A Framework for Non-functional Testing of CompilersMohamed BOUSSAA
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
Lecture2 big data life cycle
Lecture2 big data life cycleLecture2 big data life cycle
Lecture2 big data life cyclehktripathy
 
Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...RINUSATHYAN
 
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...HostedbyConfluent
 
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuantUniversity
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsEsther Vasiete
 
ABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack BenchmarkABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack Benchmarkt_ivanov
 
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...University of California, San Diego
 
Time Series Analysis… using an Event Streaming Platform
Time Series Analysis… using an Event Streaming PlatformTime Series Analysis… using an Event Streaming Platform
Time Series Analysis… using an Event Streaming Platformconfluent
 
Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming PlatformDr. Mirko Kämpf
 
Converting Scripts into Reproducible Workflow Research Objects
Converting Scripts into Reproducible Workflow Research ObjectsConverting Scripts into Reproducible Workflow Research Objects
Converting Scripts into Reproducible Workflow Research ObjectsLucas Augusto Carvalho
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsKhalid Belhajjame
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Databricks
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...MLconf
 

Similar to ReComp project kickoff presentation 11-03-2016 (20)

Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 
QRS16: NOTICE: A Framework for Non-functional Testing of Compilers
QRS16: NOTICE: A Framework for Non-functional Testing of CompilersQRS16: NOTICE: A Framework for Non-functional Testing of Compilers
QRS16: NOTICE: A Framework for Non-functional Testing of Compilers
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Lecture2 big data life cycle
Lecture2 big data life cycleLecture2 big data life cycle
Lecture2 big data life cycle
 
Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...
 
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
 
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
 
ABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack BenchmarkABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack Benchmark
 
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
 
Time Series Analysis… using an Event Streaming Platform
Time Series Analysis… using an Event Streaming PlatformTime Series Analysis… using an Event Streaming Platform
Time Series Analysis… using an Event Streaming Platform
 
Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
 
Converting Scripts into Reproducible Workflow Research Objects
Converting Scripts into Reproducible Workflow Research ObjectsConverting Scripts into Reproducible Workflow Research Objects
Converting Scripts into Reproducible Workflow Research Objects
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objects
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
 
03_aiops-1.pptx
03_aiops-1.pptx03_aiops-1.pptx
03_aiops-1.pptx
 

More from Paolo Missier

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...Paolo Missier
 

More from Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
 

Recently uploaded

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 

Recently uploaded (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 

ReComp project kickoff presentation 11-03-2016

  • 1. ReCompkickoff NewcaslteMarch11,2016 ReComp: preserving the value of big data insights over time Panta Rhei (Heraclitus, through Plato) Project Kickoff March, 2016 (*) Painting by Johannes Moreelse (*)
  • 2. ReCompkickoff NewcaslteMarch11,2016 Data to Knowledge Meta-knowledge Big Data The Big Analytics Machine Algorithms Tools Middleware Reference datasets “Valuable Knowledge” The Data-to-knowledge axiom of the Knowledge Economy: What is the Total Cost of Ownership (TCO) of these knowledge assets?
  • 3. ReCompkickoff NewcaslteMarch11,2016 Learning from data (supervised) Meta-knowledge Training set Model learning Classification algorithms Predictive classifier Background Knowledge (prior)
  • 4. ReCompkickoff NewcaslteMarch11,2016 Learning from data (unsupervised) Meta-knowledge Observations Model learning Clustering algorithms Clustering scheme Background Knowledge
  • 6. ReCompkickoff NewcaslteMarch11,2016 The missing element: time Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t
  • 7. ReCompkickoff NewcaslteMarch11,2016 The ReComp decision support system Observe change • In big data • In meta-knowledge Assess and measure • knowledge decay Estimate • Cost and benefits of refresh Enact • Reproduce (analytics) processes Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t
  • 8. ReCompkickoff NewcaslteMarch11,2016 ReComp Change Events Diff(.,.) functions “business Rules” Prioritised KAs Cost estimates Reproducibility assessment ReComp DSS Previously Computed KAs And their metadata Observe change Assess and measure Estimate Enact KA: Knowledge Assets
  • 9. ReCompkickoff NewcaslteMarch11,2016 ReComp: project objectives Obj 1. To investigate analytics techniques aimed at supporting re-computation decisions Obj 2. To research techniques for assessing under what conditions it is practically feasible to re-compute an analytical process. • Specific target system environments: • Python / Jupyter • The eScience Central, workflow manager Obj 3. To create a decision support system for the selective recomputation of complex data-centric analytical processes and demonstrate its viability on two target case studies • Genomics (human variant analysis) • Urban Observatory (flood modelling)
  • 10. ReCompkickoff NewcaslteMarch11,2016 ReComp: Expected outcomes Research Outcomes: Algorithms that operate on metadata to perform: • impact analysis • cost estimation • differential data and change cause analysis of past and new knowledge outcomes • estimation of reproducibility effort System Outcomes: • A software framework consisting of domain-independent, reusable components, which implement the metadata infrastructure and the research outcomes • A user-facing decision support dashboard. It must be possible to integrate the framework with domain-specific components, to support specific scenarios, exemplified by our case studies.
  • 11. ReCompkickoff NewcaslteMarch11,2016 ReComp: Target operating region Rate of change Volume slowfast low high Cost Volume highlow low high Volume Rate of change ReComp target region
  • 12. ReCompkickoff NewcaslteMarch11,2016 Recomputation analysis: abstraction t1 t2 t3 KA5 KA4 KA3 KA2 KA1a b c a b d a b c d a c Change Events a a’ a a a a b b’ c c’ b,c b b,c c
  • 13. ReCompkickoff NewcaslteMarch11,2016 Recomputation analysis through sampling Change Events Monitor identify recomp candidates prioritisation budgetutility assess effects of change estimate recomp cost assess reproducibility cost sampling recomp recompsmall scale recomp Meta-K large-scale recomp estimate recomp cost
  • 14. ReCompkickoff NewcaslteMarch11,2016 Recomputation analysis through modelling identify recomp candidates large-scale recomp estimate change impact Estimate reproducibility cost/effort prioritisation target population utility budget Change Events Change Impact Model Cost Model Model updates Model updates Change impact model: Δ(x,x’)  Δ(y,y’) -- challenging!! Can we do better?? - Batching given an allocation of resources? - Consolidating jobs with different resource reqs to optimise resource allocation
  • 15. ReCompkickoff NewcaslteMarch11,2016 Metadata + Analytics The knowledge is in the metadata! Research hypothesis: supporting the analysis can be achieved through analytical reasoning applied to a collection of metadata items, which describe details of past computations. identify recomp candidates large-scale recomp estimate change impact Estimate reproducibility cost/effort Change Events Change Impact Model Cost Model Model updates Model updates Meta-K • Logs • Provenance • Dependencies
  • 16. ReCompkickoff NewcaslteMarch11,2016 High level architecture ReComp decision dashboard Execute Curate Select/ prioritise prospective provenance curation (Yworkflow) Meta-Knowledge Repository Research Objects Change Impact Analysis Cost Estimation Differential Analysis Reproducibility Assessment - Utility functions - Priorities policies - Data similarity functions domain knowledge runtime monitor Logging Runtime Provenance recorder runtime monitor Logging Runtime Provenance recorder Python WP1 - provenance - logs - data and process versions - process dependencies (other analytics environments)
  • 17. ReCompkickoff NewcaslteMarch11,2016 Available technology components • W3C PROV model for describing data dependencies (provenance) • DataONE “metacat” for data and metadata management • The eScience Central Workflow Management System • Natively provenance-aware • NoWorkflow: a (experimental) Python provenance recorder • Cloud resources: • Azure, our own private cloud (CIC) ReComp decision dashboard Execute Curate Select/ prioritise prospective provenance curation (Yworkflow) Meta-Knowledge Repository Research Objects Change Impact Analysis Cost Estimation Differential Analysis Reproducibility Assessment - Utility functions - Priorities policies - Data similarity functions domain knowledge runtime monitor Logging Runtime Provenance recorder runtime monitor Logging Runtime Provenance recorder Python WP1 - provenance - logs - data and process versions - process dependencies (other analytics environments)
  • 18. ReCompkickoff NewcaslteMarch11,2016 Challenge 1: estimating impact and cost large-scale recomp estimate change impact Estimate reproducibility cost/effort prioritisation target population Change Impact Model Cost Model Model updates Model updates Change impact model: Δ(x,x’)  Δ(y,y’) -- challenging!!
  • 19. ReCompkickoff NewcaslteMarch11,2016 Challenge 2: managing the metadata How do we generate / capture / store / index / query across multiple metadata types and formats? Relevant Metadata: • Logs of past executions, automatically collected; • Provenance traces: • Runtime (“retrospective”) provenance • Automatically collected data dependency graph captured from the computation • Process structure (“prospective provenance”) • obtained by manually annotating a script • External data and system dependencies, process and data versions, and system requirements
  • 20. ReCompkickoff NewcaslteMarch11,2016 Challenge 3: Reproducibility Ex.: workflow to Identify mutations in a patient’s genome Workflow specification WF manager Linux VM cluster on Azure Analyse Input genome variant s GATK/Picard/BWA Workflow Manager (and its own dependencies) Ubuntu on Azure Dep. Input genome config Ref genome Variants DBs What happens when any of the dependencies change?
  • 21. ReCompkickoff NewcaslteMarch11,2016 Challenge 4: reusability of the solution across cases • How do we make case-specific solutions generic? • How do we make the DSS reusable? • Refactor: Generic framework + case-specific components • This is hard: most elements are case-specific! • Metadata formats • Metadata capture • Change impact • Cost models • Utility functions • …

Editor's Notes

  1. The times they are a’changin
  2. S1. identifying re-computation candidates and understand the impact of changes and in Information Assets on a corpus of knowledge outcomes: which outcomes are affected by the changes, and to what extent? This step defines the target re-computation population; S2. Estimate effects, costs and benefits of re-computation, across the target population (S1); S3. Establish re-computation priorities within the target population, based on a budget for computational resources, a problem-specific definition of utility functions and prioritisation policy, and estimates as in (S2); S4. Selectively carry out priority re-computations, when the processes are reproducible; S5. Differential data analysis and change cause analysis: Assess the effects of the re- computation. This involves understanding how the new outcomes differ from the original (differential data analysis), and which of the changes in the process are responsible for the changes observed in ReComp the outcomes (change cause analysis). The latter analysis helps data scientists understand the actual effect of an improved process “post hoc”, and has also the potential to improve future effect estimates.
  3. Problem: this is “blind” and expensive. Can we do better?
  4. These items are partly collected automatically, and partly as manual annotations. They include: Logs of past executions, automatically collected, to be used for post hoc performance analysis and estimation of future resource requirements and thus costs (S1) ; Runtime provenance traces and prospective provenance. The former are automatically collected graphs of data dependencies, captured from the computation [11]. The latter are formal descriptions of the analytics process, obtained from the workflow specification, or more generally by manually annotating a script. Both are instrumental to understanding how the knowledge outcomes have changed and why (S5), as well as to estimate future re-computation effects. External data and system dependencies, process and data versions, and system requirements associated with the analytics process, which are used to understand whether it will be practically possible to re-compute the process.