2. Welcome & Bem-Vindo!
• Boa tarde e bem-vindo ao tutorial sobre
proveniência!
• Welcome to the Tutorial on: Provenance in
Databases and Scientific Workflows
– Proveniência em bancos (bases) de dados e fluxos de
trabalho (workflows) científicos
• Desculpas ...
– Back to English (my 2nd language)
• Feel free to interrupt and ask questions!
• (You can also ask questions in German or Spanish ...)
Provenance @ SBBD'16
4. • Part I: Provenance in Scientific Workflows
– Alta Vista: Provenance everywhere!
– Provenance & Scientific Workflows
– Provenance Models and Standards (not so much)
– Provenance Tools
• Example & Demo: YesWorkflow
• Part II: Provenance in Databases
– Foundations of provenance in databases
– Why-, How-, and Why-Not provenance
Outline of the Tutorial:
A “Tour de Provenance”
Provenance @ SBBD'16
6. Provenance - Proveniência
• Oxford English Dictionary
– The place of origin or earliest known history of something:
• an orange rug of Iranian provenance
– The beginning of something’s existence; its origin:
• they try to understand the whole universe, its provenance and fate
– A record of ownership of a work of art or an antique, used as a
guide to authenticity or quality:
• the manuscript has a distinguished provenance
• What is the origin (provenance!) of “provenance” ?
Provenance @ SBBD'16
7. The Many Faces of Provenance
• What are those?
• Cosmology
• Geology, Stratigraphy
• Phylogeny
– the Tree of Life
• Genealogy
– your family: literally
• Academic Pedigree
– “Doktorvater” (Doktor-Mutter?)
• Etymology
• Chain of custody
– of art(ifacts)
• Yes: all about origins and history …
Provenance @ SBBD'16
11. 2nd Stop: Liberal Arts & Sciences
• Can you “see provenance” in this image?
• Grand Canyon’s rock layers are a record of the early geologic history of North America.
The ancestral puebloan granaries at Nankoweap Creek tell archaeologists about more
recent human history. (By Drenaline, licensed under CC BY-SA 3.0)
Provenance @ SBBD'16
13. Computational Provenance
• Origin and processing history of an artifact
– usually: data (products), figures, ...
– sometimes: workflow (and script) evolution …
• Different sub-communities:
– Provenance in (scientific) workflows (Tutorial Part I)
– Provenance in databases (Tutorial Part II)
– Wait, there is more:
• ... programming languages, systems/security, …
Provenance @ SBBD'16
14. Why should you care about provenance?
• It’s an important problem:
– reproducibility crisis, transparency, data sharing, …
• There are (still) many deeply technical and practical challenges:
– Efficient capture, management, use of provenance
– Models, semantics, query languages
– Provenance .. for others? Or provenance for self!
– Interdisciplinary work; cross-fertilization: databases, workflows,
programming languages, security, …, various scientific communities
(bioinformatics, ...)
• You have a head start here!
– Marta Mattoso, Daniel de Oliveira, Vanessa Braganholo, Juliana Freire, ...
(e.g. SBBD proceedings ..)
• … oh, and it’s also a fun topic ...
Provenance @ SBBD'16
17. Use Provenance for
Transparency, Reproducibility
• What input data went into
this study?
• What methods were used?
• … with what parameter
settings, calibrations, …?
• Can we trust the data and
methods?
§ Provenance (lineage): track origin and processing history of data è
trust, data quality ~ audit trail for attribution, credit
§ Discovery of data, methodologies, experiments
Provenance @ SBBD'16
20. Provenance today: Important but hard
èmany research projects,
groups conduct R&D on
provenance methods,
tools, …
Example:
Climate Change Impacts
in the United States
U.S. National Climate Assessment
U.S. Global Change Research Program
“This report is the result of a three-year
analytical effort by a team of over 300
experts, overseen by a broadly
constituted Federal Advisory Committee
of 60 members. It was developed from
information and analyses gathered in
over 70 workshops and listening sessions
held across the country.”
Provenance @ SBBD'16
25. Provenance in Action: Benefits & Impact
A DataONE search (here: “grass”) yields different packages with provenance
Provenance @ SBBD'16
26. DataONE: Support for Provenance
Yaxing’s script with
inputs & output
products
Christopher’s
YesWorkflow
model
Christopher using
Yaxing’s outputs as
inputs for his script
Christopher’s results
can be traced back all
the way to Yaxing’s
input
Provenance @ SBBD'16
27. REWIND: From Provenance to Reproducible Science …
Capturing provenance is crucial for
transparency, interpretation, debugging, …
=> repeatable experiments,
=> reproducible science
=> need workflow-system agnostic model
Provenance @ SBBD'16
29. … 3rd stop: Scientific Workflows: ASAP
• Automation
– wfs to automate computational aspects of science
• Scaling (exploit and optimize machine cycles)
– wfs should make use of parallel compute resources
– wfs should be able handle large data
• Abstraction, Evolution, Reuse (human cycles)
– wfs should be easy to (re-)use, evolve, share
• Provenance
– wfs should capture processing history, data lineage
è traceable data- and wf-evolution
è Reproducible Science
Trident
Workbench
VisTrails
Es war einmal …
Provenance @ SBBD'16
30. 10 Essential Functions
of a Scientific Workflow System1. Automate programs and services scientists already use.
2. Schedule invocations of programs and services correctly and efficiently – in
parallel where possible.
3. Manage dataflow to, from, and between programs and services.
4. Enable scientists (not just developers) to author or modify workflows easily.
5. Predict what a workflow will do when executed: prospective provenance.
6. Record what happened during workflow execution: retrospective provenance.
7. Reveal and query provenance – how workflow products were derived from inputs
via programs and services.
8. Organize intermediate and final data products as desired by users.
9. Enable scientists to version, share and publish their workflows.
10. Empower scientists who wish to automate additional programs and services
themselves.
These functions (not just dataflow & actors) distinguish scientific workflow automation
from general (scientific) software development.
Src: Tim McPhillipsProvenance @ SBBD'16
34. Motif-Catcher workflow, implemented in Kepler
S Köhler et al. Improved Motif Detection in Large Sequence Sets with
Random Sampling in a Kepler workflow, ICCS-WS, 2012
Provenance @ SBBD'16
35. Kepler Workflows & Decision Making
(Kruger Natl. Park, South Africa)
SANParks Matt Jones, NCEAS @ UC Santa Barbara
Provenance @ SBBD'16
39. So what is “provenance” (sensu W3C) ?
• Provenance refers to the sources of information, including entities
and processes, involved in producing or delivering an artifact (*)
• Provenance is a description of how things came to be, and how
they came to be in the state they are in today (*)
• Provenance is a record that describes the people, institutions, entities,
and activities, involved in producing, influencing, or delivering a piece
of data or a thing in the world
Provenance @ SBBD'16
49. ProvONE: PROV for scientific workflows
(Transfer station to any of several other “standard extensions”)
“Trace-Land” (retrospective provenance)
“Data-Land”
Yang Cao1
, Christopher Jones2
, Víctor Cuevas-
Vicenttín3
, Matthew B. Jones2
, Bertram
Ludäscher1
, Timothy McPhillips1
, Paolo
Missier4
, Christopher Schwalm5
, Peter
Slaughter2
, Dave Vieglais6
, Lauren Walker2
,
Yaxing Wei7
1
University of Illinois, Urbana-Champaign, 2
National Center for
Ecological Analysis and Synthesis, UCSB, 3
Universidad Popular
Autónoma del Estado de Puebla, Mexico, 4
School of Computing
Science,
Newcastle University, UK, 5
Woods Hole Research Center, Falmouth,
MA, 6
University of Kansas, Lawrence, 7
Environmental Sciences
Division, Oak Ridge National Lab, TN
Also: A. Marinho, L. Murta, C.Werner, V.Braganholo, S. Serra da Cruz,
E.Ogasawara, M. Mattoso. “ProvManager: A Provenance Management
System for Scientific Workflows.” Concurrency and Computation:
Practice and Experience 24, no. 13 (2012): 1513–1530
…
“Workflow-Land” (prospective prov.)
Provenance @ SBBD'16
50. Provenance Sleuth or Engineer?
• Scientists are Provenance (i.e., Natural History) Sleuths
• {Computational, Computer, Information}-Scientists should
(also) be Provenance Engineers
– Ensure your “Data Tree of Life” (data provenance) is correct!
– What is the origin and processing history of your data?
• With great provenance come great questions!
– “We store everything!”
– Huh? Yes, provenance is the answer… (yawn..)
– But what is the question??
• Engineer’s Stance:
– What questions do you want to answer?
– Let’s find out what observables we need to capture, what query
language we should use, how we do that efficiently (later), …
Provenance @ SBBD'16
51. Drilling down into “Trace-Land”:
From MoC to MoP via Observables
• Model of Computation MoC
– specification/algorithm to compute Outputs = MoC(Wf,Params,Inputs)
– a director or scheduler implements MoC
– gives rise to formal notions of
• computation (aka run) R
– Formalisms to define M?
• Model of Provenance MoP
– associate with a MoC a “default” MoP (= MoC ± Δ)
– the MoP is a “trimmed” MoC
• T = R – I + M
– Trace = Run – Ignored-observables + Modeled-observables
• Observables (of a MoC / MoP)
– functional observables (may influence output o)
• token rate, notions of firing, …
– non-functional observables (not part of M, do not influence o)
• token timestamp, size, … (unless the MoC cares about those)
Provenance @ SBBD'16
52. M. Anand, S. Bowers,
et al., SSDBM’09
From MoCs to Models of Provenance (MoPs)
Provenance @ SBBD'16
53. Fine-grained, Data & MoC-aware MoP
M. Anand, S. Bowers,
et al., SSDBM’09
Provenance @ SBBD'16
54. Types of Data Provenance
• Black-box
– know (next to) nothing at compile-time
– at runtime: keep some data lineage
– most prov sensu WF work use this
• White-box
– statically (compile-time) analyzable
– q(Y1,Y2) :- p(X1,X2), r(X1,Y1), s(X2,Y2)
– Most prov sensu DB work use this
• Grey-box
– can “look inside” (some black boxes)
– … e.g. b/c they have subworkflows
– … or FP signatures: A :: t1, t2à t3,t4
– … or semantic annotations (sem.types)
f
A
q
t1
t2
t3
t4
X1
X2
Y1
Y2
Provenance @ SBBD'16
55. ✔ Provenance capture (Matlab, R, Python, … scientific workflow systems)
✔ Uploading, sharing, linking provenance through various provenance tools
✗ Tools for scientists to exploit (≠ capture, share, link) provenance for their own
day-to-day work.
è Prime the provenance pump and increase provenance generation
è Scientists accelerate their work via new, active uses of provenance.
But … how to prime the provenance pump??
Must support “Provenance for Self” !
Provenance
for Self?!
Provenance
for Others
Provenance @ SBBD'16
58. SKOPE: Synthesized Knowledge Of Past Environments
Bocinsky, Kohler et al. study rain-fed maize of Anasazi
– Four Corners; AD 600–1500. Climate change influenced Mesa Verde Migrations; late
13th century AD. Uses network of tree-ring chronologies to reconstruct a spatio-
temporal climate field at a fairly high resolution (~800 m) from AD 1–2000. Algorithm
estimates joint information in tree-rings and a climate signal to identify “best” tree-ring
chronologies for climate reconstructing.
K. Bocinsky, T. Kohler, A 2000-year reconstruction of the rain-fed
maize agricultural niche in the US Southwest. Nature
Communications. doi:10.1038/ncomms6618
… implemented as an R Script …
Provenance @ SBBD'16
61. YesWorkflow.org
• YesWorkflow (YW)
– Started as a grass-roots effort (Kurator, SKOPE, ..)
– … meeting the scientists/users where they R!
• R, Matlab, (i)Python, Jupyter, …
– Scripts + simple user annotations
• => Reveal the workflow model/abstraction
… that underlies the (script) implementation
• => YW can give us more of ASAP!
– First YW: ASAP (Abstraction)...
– Then YW-recon: ASAP (reconstructing runtime Provenance)
Provenance @ SBBD'16
62. YW (prospective) and
YW-Recon (retrospective) Provenance
• 1. YW: Annotate Script => YW Model
– Annotate @BEGIN..@END, @IN, @OUT
– Visualize, share, be happy J
• 2. Run script
– Files are read and written
– Folder- & Filenames have metadata
• 3. YW-Recon
– Use @URI tags that link YW Model ó Persisted Data
– Run URI-template queries
• cf. “ls -R” & RegEx matching
• 4. YW-Query
– Answer the user’s provenance queries
Provenance @ SBBD'16
68. Figure 4: Process workflow view of an A↵ymetrix analysis script (in R).
4 YesWorkflow Examples
In the following we show YesWorkflow views extracted from real-world scientific use cases.
The scripts were annoted with YW tags by scientists and script authors, using a very
modest training and mark-up e↵ort.1
Due to lack of space, the actual MATLAB and R
scripts with their YW markup are not included here. However, they are all available
from the yw-idcc-15 repository on the YW GitHub site [Yes15].
Gene Expression Microarray Data Analysis
• [Normalize]
– Normalization of data across microarray datasets
• [SelectDEGs]
– Selection of differentially expressed genes between conditions
• [GO Analysis]
– determination of gene ontology statistics for the resulting datasets
• [MakeHeatmap]
– creation of a heatmap of the differentially expressed genes.
Tyler Kolisnik, Mark Bieda
Provenance @ SBBD'16
71. YW (prospective) and
YW-Recon (retrospective) Provenance
• 1. YW: Annotate Script => YW Model
– Annotate @BEGIN..@END, @IN, @OUT
– Visualize, share, be happy J
• 2. Run script
– Files are read and written
– Folder- & Filenames have metadata
• 3. YW-Recon
– Use @URI tags that link YW Model ó Persisted Data
– Run URI-template queries
• cf. “ls -R” & RegEx matching
• 4. YW-Query
– Answer the user’s provenance queries
Provenance @ SBBD'16
85. simulate_data_collection
collect_data_set
sample_id energy frame_number raw_image
calculate_strategy
accepted_sample num_imagesenergies
load_screening_results
sample_namesample_quality
transform_images
corrected_image
sample_spreadsheet
calibration_image
sample_score_cutoff data_redundancy
cassette_id
Subgraph
resulting from
lineage query
on YW workflow
model
What is the lineage of
corrected_image?
Provenance @ SBBD'16
87. We’re off to see the Wizard of Prov ...
We're off to see the Wizard,
The wonderful Wizard of Prov!
--
We hear he is a wiz of a wiz
If ever a wiz there was.
--
If ever, oh ever, a wiz there was,
The Wizard of Prov is one because,
Because, because, because, because, because,
Because of the wonderful things he does.
• Enrich YW conceptual view
with NW Python provenance!
• Get the best of both worlds!
• How hard can it be to bridge
YW and NW …
(cf. TaPP’15 prototype)
Provenance @ SBBD'16
91. Secret Reproducible Sauce
• Combining provenance information from
noWorkflow and YesWorkflow
• Using all the good stuff:
– make, docker, Prolog, SQL, Graphviz
• Open source
– github.com/yesworkflow-org/yw-noworkflow
– github.com/gems-uff/yin-yang-demo
• Have a closer look at the demo!
Provenance @ SBBD'16