SlideShare una empresa de Scribd logo
1 de 93
Descargar para leer sin conexión
Provenance	in	Databases	
and	Scientific	Workflows
Bertram	Ludäscher
ludaesch@illinois.edu
31st	Brazilian	Symposium	on	Databases
October	4-7,	2016,	Salvador	Bahia
Director,	Center	for	Informatics	Research	in	Science	&	Scholarship	(CIRSS)	
School	of	Information	Sciences	(iSchool@Illinois)
&	National	Center	for	Supercomputing	Applications	(NCSA)
&	Department	of	Computer	Science	(CS@Illinois)
Welcome	&	Bem-Vindo!
• Boa	tarde	e	bem-vindo	ao	tutorial	sobre	
proveniência!	
• Welcome to	the Tutorial	on:	Provenance in	
Databases and Scientific Workflows
– Proveniência	em	bancos	(bases)	de	dados	e	fluxos	de	
trabalho	(workflows)	científicos
• Desculpas ...
– Back to	English (my 2nd	language)
• Feel free	to	interrupt and ask questions!
• (You can	also ask questions in	German or Spanish ...)
Provenance	@	SBBD'16
Introductions	should	come	first	…	
• MS	Computer	Science,	U	Karlsruhe	(K.I.T.)
• PhD	Computer	Science,	U	Freiburg,	Germany
• Research	Scientist,	UC	San	Diego,	SDSC
• Dept.	of	Computer	Science,	UC	Davis
• University	of	Illinois	at	Urbana-Champaign
– School	of	Information	Sciences	
– Natl.	Center	for	Supercomputing	Applications	
– Dept.	of	Computer	Science	
Provenance	@	SBBD'16
• Part	I:	Provenance	in	Scientific	Workflows
– Alta	Vista:	Provenance	everywhere!	
– Provenance	&	Scientific	Workflows	
– Provenance	Models	and	Standards	(not	so	much)
– Provenance	Tools
• Example	&	Demo:	YesWorkflow
• Part	II:	Provenance	in	Databases
– Foundations	of	provenance	in	databases
– Why-,	How-,	and	Why-Not provenance
Outline	of	the	Tutorial:	
A “Tour	de	Provenance”
Provenance	@	SBBD'16
1st Tour	Stop:	The	Fine	Arts
• One	of	these	is	has	been	sold	for	nearly	$180	million.
• The	other	could be	worth	as	much	or	more.
• Which	is	which?
• What	is	the	difference?	
Provenance	@	SBBD'16
Provenance	- Proveniência	
• Oxford	English	Dictionary	
– The	place	of	origin	or	earliest	known history of	something:
• an	orange	rug	of	Iranian	provenance
– The	beginning of	something’s	existence;	its	origin:
• they	try	to	understand	the	whole	universe,	its	provenance and	fate
– A	record	of	ownership	of	a	work	of	art	or	an	antique,	used	as	a	
guide	to	authenticity or	quality:
• the	manuscript	has	a	distinguished	provenance
• What	is	the	origin	(provenance!)	of	“provenance”	?
Provenance	@	SBBD'16
The	Many	Faces	of	Provenance	
• What	are	those?
• Cosmology
• Geology,	Stratigraphy
• Phylogeny
– the	Tree	of	Life
• Genealogy
– your	family:	literally
• Academic	Pedigree
– “Doktorvater”	(Doktor-Mutter?)
• Etymology
• Chain	of	custody
– of	art(ifacts)
• Yes:	all	about	origins	and history	…	
Provenance	@	SBBD'16
Pedigree
Provenance	@	SBBD'16
Provenance	or	Provenience?	
Provenance	@	SBBD'16
Provenience	vs Provenance
• In	archaeology:	special	sense	of	location	(3D),	layer	..
Provenance	@	SBBD'16
2nd Stop:	Liberal	Arts	&	Sciences
• Can	you	“see	provenance”	in	this	image?
• Grand	Canyon’s	rock	layers	are	a	record	of	the	early	geologic	history	of	North	America.	
The	ancestral	puebloan granaries	at	Nankoweap Creek	tell	archaeologists	about	more	
recent	human	history.	(By	Drenaline,	licensed	under	CC	BY-SA	3.0)
Provenance	@	SBBD'16
Science	&	Natural	History:	
Understanding what	happened…
Zrzavý,	Jan,	David	Storch,	and Stanislav	Mihulka.	
Evolution:	Ein	Lese-Lehrbuch.	Springer-Verlag,	2009.
Author:	Jkwchui (Based	on	
drawing	by	Truth-seeker2004)
Provenance	@	SBBD'16
Computational	Provenance
• Origin and	processing	history	of	an	artifact
– usually:	data	(products),	figures,	...
– sometimes:	workflow (and	script)	evolution	…
• Different	sub-communities:
– Provenance	in	(scientific)	workflows	(Tutorial	Part	I)
– Provenance	in	databases	(Tutorial	Part	II)	
– Wait,	there	is	more:
• ...	programming	languages,	systems/security,	…	
Provenance	@	SBBD'16
Why	should	you	care	about	provenance?
• It’s	an	important problem:	
– reproducibility	crisis,	transparency,	data	sharing,	…
• There	are	(still)	many	deeply	technical and	practical challenges:
– Efficient	capture,	management,	use	of	provenance
– Models,	semantics,	query	languages
– Provenance	..	for	others?	Or	provenance	for	self!
– Interdisciplinary	work;	cross-fertilization:	databases,	workflows,	
programming	languages,	security,	…,	various	scientific	communities	
(bioinformatics,	...)
• You	have	a	head	start	here!
– Marta	Mattoso,	Daniel	de	Oliveira,		Vanessa	Braganholo,		Juliana	Freire,	...	
(e.g.	SBBD	proceedings	..)
• …	oh,	and	it’s	also	a	fun	topic	...	
Provenance	@	SBBD'16
Provenance	Research	everywhere	…	
Provenance	@	SBBD'16
OneProblem:	Reproducibility	Crisis	
• Different	sciences	have	
different	reproducibility	
crises	…	
• Focus	here:	
Computational	
Reproducibility
– R,	Matlab,	Python,	..	scripts
– Scientific	workflows,	...
• How	to	facilitate	reproducibility	
for	computational	and	data	scientists?
Provenance	@	SBBD'16
Use	Provenance	for	
Transparency,	Reproducibility
• What	input data went	into	
this	study?
• What	methods were	used?
• …	with	what	 parameter
settings,	 calibrations,	…?	
• Can	we	trust the	data	and	
methods?
§ Provenance (lineage):	track	origin and	processing	history	of	data	è
trust,	data	quality	~	audit	trail for	attribution,	credit	
§ Discovery of	data,	methodologies,	experiments
Provenance	@	SBBD'16
Climate	Change:	Whodunnit?
Provenance	@	SBBD'16
Tracing	the	sources	(data,	code)	
Provenance	@	SBBD'16
Provenance today: Important but hard
èmany	research	projects,	
groups	conduct	R&D	on	
provenance	methods,	
tools,	…	
Example:
Climate Change Impacts
in the United States
U.S. National Climate Assessment
U.S. Global Change Research Program
“This	report	is	the	result	of	a	three-year	
analytical	effort	by	a	team	of	over	300	
experts,	overseen	by	a	broadly	
constituted	Federal	Advisory	Committee	
of	60	members.	It	was	developed	from	
information	and	analyses	gathered	in	
over	70	workshops	and	listening	sessions	
held	across	the	country.”	
Provenance	@	SBBD'16
A	scientific	data	federation:	DataONE
Data Observation	Network	for	Earth
search.dataone.orgProvenance	@	SBBD'16
DataONE Cyberinfrastucture:
Coordinating	Nodes
www.dataone.org/coordinating-nodes
Coordinating	Nodes
• retain	complete	metadata	
catalog	
• indexing	for	search
• network-wide	services
• ensure	content	availability	
(preservation)		
• replication	services
Components	for	a	flexible,	scalable,	sustainable	
network
Provenance	@	SBBD'16
Components	for	a	flexible,	scalable,	sustainable	
network
DataONE Cyberinfrastructure:
Member	Nodes
www.dataone.org/member-nodes
Coordinating	Nodes
• retain	complete	metadata	
catalog	
• indexing	for	search
• network-wide	services
• ensure	content	availability	
(preservation)		
• replication	services
Member	Nodes
• diverse	institutions
• serve	local	community
• provide	resources	for	
managing	their	data
• retain	copies	of	data
Provenance	@	SBBD'16
DataONE Cyberinfrastructure:
Investigator	Toolkit
www.dataone.org/investigator-toolkit
Coordinating	Nodes
• retain	complete	metadata	
catalog	
• indexing	for	search
• network-wide	services
• ensure	content	availability	
(preservation)		
• replication	services
Member	Nodes
• diverse	institutions
• serve	local	community
• provide	resources	for	
managing	their	data
• retain	copies	of	data
Investigator	Toolkit
Components	for	a	flexible,	scalable,	sustainable	
network
Provenance	@	SBBD'16
Provenance in Action: Benefits & Impact
A	DataONE search	(here:	“grass”)	yields	different	packages	with	provenance		
Provenance	@	SBBD'16
DataONE: Support for Provenance
Yaxing’s script with	
inputs &	output	
products
Christopher’s	
YesWorkflow
model
Christopher	using
Yaxing’s outputs	as	
inputs	for	his	script
Christopher’s	results	
can	be	traced	back	all	
the	way	to	Yaxing’s
input
Provenance	@	SBBD'16
REWIND: From Provenance to Reproducible Science …
Capturing provenance is crucial for
transparency, interpretation, debugging, …
=> repeatable experiments,
=> reproducible science
=> need workflow-system agnostic model
Provenance	@	SBBD'16
... via scientific workflows (… and scripts)
Provenance	@	SBBD'16
… 3rd stop: Scientific Workflows: ASAP
• Automation
– wfs to automate computational aspects of science
• Scaling (exploit and optimize machine cycles)
– wfs should make use of parallel compute resources
– wfs should be able handle large data
• Abstraction, Evolution, Reuse (human cycles)
– wfs should be easy to (re-)use, evolve, share
• Provenance
– wfs should capture processing history, data lineage
è traceable data- and wf-evolution
è Reproducible Science
Trident
Workbench
VisTrails
Es war	einmal …		
Provenance	@	SBBD'16
10	Essential	Functions	
of	a	Scientific	Workflow	System1. Automate programs	and	services	scientists	already	use.	
2. Schedule invocations	of	programs	and	services	correctly	and	efficiently	– in	
parallel	where	possible.
3. Manage	dataflow	to,	from,	and	between	programs	and	services.
4. Enable scientists (not	just	developers)	to	author or	modify workflows	easily.
5. Predict what	a	workflow	will	do	when	executed:	prospective provenance.
6. Record what	happened	during	workflow	execution:	retrospective provenance.
7. Reveal	and	query	provenance	– how	workflow	products	were	derived	from	inputs	
via	programs	and	services.
8. Organize intermediate	and	final	data	products	as	desired	by	users.
9. Enable	scientists	to	version,	share and	publish their	workflows.	
10. Empower	scientists	who	wish	to	automate	additional	programs	and	services	
themselves.
These	functions	(not	just	dataflow	&	actors)	distinguish	scientific	workflow	automation
from	general	(scientific)	software	development.
Src:	Tim	McPhillipsProvenance	@	SBBD'16
Find	OTUs
(OTUHunter)
Assign	Taxonomy			
(STAP)
Profile	alignment
(STAP	or	Infernal)
Build	phylogenetic	
tree	(RaxML	or	
Quicktree)
View	tree:	
Dendroscope
UniFrac:		tree	&
environment	file
Assembled
contigs
Chimera	check
(Mallard)
Diversity	statistics:
Text:	OUT	list,	Chao1,	Shannon
Graphs:	rarefaction	curves,	rank-
abundance	curves
Visualization	tools:	
Cytoscape	networks	&	
Heat	map
WATERS:
Workflow	for	Alignment,	Taxonomy,	
Ecology	of	Ribosomal	Sequences	
(Amber	Hartman;	Eisen	Lab;	UC	Davis)
+/- cipres
+/- cluster
+/- cluster
+/- cluster
Provenance	@	SBBD'16
Executable WATERS Workflow in Kepler
Provenance	@	SBBD'16
Example
Bioinformatics
Workflow:
Motif-Catcher
Marc	Facciotti et	al.
UC	Davis	Genome	Center
Provenance	@	SBBD'16
Motif-Catcher workflow, implemented in Kepler
S	Köhler et	al.	Improved	Motif	Detection	in	Large	Sequence	Sets	with	
Random	Sampling	in	a	Kepler	workflow,	ICCS-WS,	2012
Provenance	@	SBBD'16
Kepler Workflows & Decision Making
(Kruger Natl. Park, South Africa)
SANParks Matt	Jones,	NCEAS	@	UC	Santa	Barbara
Provenance	@	SBBD'16
A Data-Streaming Workflow over Sensor Data
Provenance	@	SBBD'16
• Monitor	and	control	supercomputer	
simulations			
– 50+	composite	actors	(subworkflows)
– 4	levels	of	hierarchy	
– 1000+	atomic	(Java)	actors
43	actors,	3	levels
196	actors,	4	levels
30	actors
206	actors,	4	levels
137	actors
33	actors
150
123	actors
66	actors
12	actors
243	actors,	4	levels
Norbert	Podhorszki
ORNL	(then:	UC	Davis)				
“Plumbing”	workflow	
Provenance	@	SBBD'16
Data Curation Workflows
(Filtered-Push … Kepler … Kurator projects)
Provenance	@	SBBD'16
So	what	is	“provenance”	(sensu W3C)	?
• Provenance refers	to	the	sources of	information,	including	entities
and	processes,	involved	in	producing	or	delivering	an	artifact	(*)
• Provenance	is	a	description	of	how	things	came	to	be,	and	how	
they	came	to	be	in	the	state	they	are	in	today		(*)
• Provenance	is	a	record that	describes	the	people,	institutions,	entities,	
and	activities,	involved	in	producing,	influencing,	or	delivering	a	piece	
of	data	or	a	thing	in	the	world
Provenance	@	SBBD'16
W3C	PROV	Family	of	Specifications:	Modeling
Provenance	Analysis	and	RDF	Query	Processing,	
Satya	S.	Sahoo,	Praveen	Rao,	ISWC,	October,	2015.
W3C PROV Family of Specifications: Provenance Modeling
•  W3C Recommendations
o  PROV Data Model (PROV-
DM)
o  PROV Ontology (PROV-O)
o  PROV-Constraints
o  PROV Notations (PROV-
N)
•  PROV Working Group Notes
(selected)
o  PROV-Access and Querying (AQ)
o  PROV Dictionary
o  PROV XML
o  PROV and Dublin Core Mappings
(PROV-DC)
o  PROV Semantics (using first-order logic)
(PROV-SEM)
Provenance	@	SBBD'16
Provenance	&	Semantic	Web	Layer	Cake
Provenance	Analysis	and	RDF	Query	Processing,	
Satya	S.	Sahoo,	Praveen	Rao,	ISWC,	October,	2015.
Provenance and Semantic Web Layer Cake
•  Proof layer aka
Provenance
•  Trust is derived from
provenance
information
Provenance	@	SBBD'16
Back	to	the	basics:	Open	Provenance	
Model	(OPM)	=>	W3C	Prov
Provenance	@	SBBD'16
W3C	Prov:	some	finer	points
Provenance	@	SBBD'16
Runtime	Provenance	
(a.k.a.	traces,	logs,		
retrospective	
provenance,
“Trace-land”)
4th Stop:	Different	Kinds	of	Data	Provenance	in	
Workflows Workflow	Modeling	&	Design
(a.k.a.	prospective provenance
“Workflow-land”)
Provenance	@	SBBD'16
Another	PROV	Extension:	OPMW
Provenance	@	SBBD'16
Another	PROV	Extension:	OPMW
• Approximately:	
– Workflow-Land	vs	Trace-Land
Provenance	@	SBBD'16
Provenance	Standards	vs	Tools	
• Do	we	need	more	standards	to	sort	this	out?
– …	or	do	we	already	have	too	many	“standards”?
• How	should	we	think	about	provenance?
– ...		in	workflows	and	databases?
• What	can	we	do with	provenance?
– ...	in	workflows	and	databases?
• Tools	to	create,	share,	use	provenance	
– …	not	just	for	“provenance	for	others”
– ...	need	more	“provenance	for	self”
Provenance	@	SBBD'16
A	simple	(simplistic?)	World	View:
Workflow-Land	ó Trace-Land
• Not	a	“standard”	– but	helps	(me!)	think	about	
workflows (prospective provenance)	and	traces
(retrospective provenance).	Provenance	@	SBBD'16
ProvONE:	PROV	for	scientific	workflows
(Transfer	station	to	any	of	several	other	“standard	extensions”)
“Trace-Land” (retrospective provenance)
“Data-Land”
Yang	Cao1
,	Christopher	Jones2
,	Víctor Cuevas-
Vicenttín3
,	Matthew	B.	Jones2
,	Bertram	
Ludäscher1
,	Timothy	McPhillips1
,	Paolo	
Missier4
,	Christopher	Schwalm5
,	Peter	
Slaughter2
,	Dave	Vieglais6
,	Lauren	Walker2
,	
Yaxing Wei7
1
University	of	Illinois,	Urbana-Champaign,	2
National	Center	for	
Ecological	Analysis	and	Synthesis,	UCSB,	3
Universidad	Popular	
Autónoma del	Estado	de	Puebla,	Mexico,	4
School	of	Computing	
Science,	
Newcastle	University,	UK,	5
Woods	Hole	Research	Center,	Falmouth,	
MA,	6
University	of	Kansas,	Lawrence,	7
Environmental	Sciences	
Division,	Oak	Ridge	National	Lab,	TN
Also: A.	Marinho,	L.	Murta,	C.Werner,	V.Braganholo,	S.	Serra	da	Cruz,	
E.Ogasawara,	M.	Mattoso.	“ProvManager:	A	Provenance	Management	
System	for	Scientific	Workflows.”	Concurrency	and	Computation:	
Practice	and	Experience	24,	no.	13	(2012):	1513–1530
…	
“Workflow-Land” (prospective	prov.)
Provenance	@	SBBD'16
Provenance	Sleuth	or	Engineer?
• Scientists are	Provenance	(i.e.,	Natural	History)	Sleuths
• {Computational,	Computer,	Information}-Scientists	should	
(also)	be	Provenance	Engineers
– Ensure	your	“Data	Tree	of	Life”	(data	provenance)	is	correct!
– What	is	the	origin	and	processing	history	of	your	data?
• With	great	provenance	come	great	questions!
– “We	store	everything!”		
– Huh?	Yes,	provenance	is	the	answer…	(yawn..)
– But	what	is	the	question??
• Engineer’s	Stance:
– What	questions	do	you	want	to	answer?
– Let’s	find	out	what	observables we	need	to	capture,	what	query	
language	we	should	use,	how	we	do	that	efficiently	(later),	…	
Provenance	@	SBBD'16
Drilling down into “Trace-Land”:
From MoC to MoP via Observables
• Model	of	Computation MoC
– specification/algorithm	to	compute	Outputs	=	MoC(Wf,Params,Inputs)
– a	director												or	scheduler	implements	MoC
– gives	rise	to	formal	notions	of
• computation (aka	run)	R	
– Formalisms	to	define	M?	
• Model	of	Provenance MoP
– associate	with	a	MoC a	“default”	MoP (=	MoC ± Δ)	
– the	MoP is	a	“trimmed”	MoC
• T	=	R	– I	+	M	
– Trace	=	Run	– Ignored-observables	+	Modeled-observables
• Observables (of	a	MoC /	MoP)
– functional	observables (may	influence	output	o)
• token	rate,	notions	of	firing,	…	
– non-functional	observables (not	part	of	M,	do	not	influence	o)
• token	timestamp,	size,	…	(unless	the	MoC cares	about	those)
Provenance	@	SBBD'16
M. Anand, S. Bowers,
et al., SSDBM’09
From MoCs to Models of Provenance (MoPs)
Provenance	@	SBBD'16
Fine-grained, Data & MoC-aware MoP
M. Anand, S. Bowers,
et al., SSDBM’09
Provenance	@	SBBD'16
Types of Data Provenance
• Black-box
– know (next to) nothing at compile-time
– at runtime: keep some data lineage
– most prov sensu WF work use this
• White-box
– statically (compile-time) analyzable
– q(Y1,Y2) :- p(X1,X2), r(X1,Y1), s(X2,Y2)
– Most prov sensu DB work use this
• Grey-box
– can “look inside” (some black boxes)
– … e.g. b/c they have subworkflows
– … or FP signatures: A :: t1, t2à t3,t4
– … or semantic annotations (sem.types)
f
A
q
t1
t2
t3
t4
X1
X2
Y1
Y2
Provenance	@	SBBD'16
✔ Provenance	capture (Matlab,	R,	Python,	…	scientific	workflow	systems)	
✔ Uploading,	sharing,	linking provenance	through	various	provenance	tools
✗ Tools	for	scientists	to	exploit (≠ capture,	share,	link)	provenance for	their	own	
day-to-day	work.
è Prime	the	provenance	pump and	increase	provenance	generation
è Scientists	accelerate	their	work	via	new,	active uses	of	provenance.
But	…	how	to	prime	the	provenance	pump??	
Must support “Provenance	for	Self”	!
Provenance	
for	Self?!
Provenance	
for	Others
Provenance	@	SBBD'16
From	Workflows	&	Provenance	to	
Provenance	for	Script-based	Workflows	…	
• What	workflow	tools	are	(most)	scientists	using?
– Workflow	systems
– …	vs	scripts	(Python,	R,	MATLAB,	...)
• What	provenance	tools	are	their?
– Workflow	system	support
– Tools	for	“workflow”	scripts!?
Provenance	@	SBBD'16
Yes, scripts are (can be) workflows too!
Interactive Visualization
Provenance	@	SBBD'16
SKOPE:	Synthesized	Knowledge	Of	Past	Environments
Bocinsky,	Kohler	et	al.	study	rain-fed	maize	of Anasazi
– Four	Corners;	AD	600–1500. Climate	change	influenced	Mesa	Verde	Migrations;	late	
13th	century	AD.	Uses	network	of	tree-ring	chronologies	to	reconstruct	a	spatio-
temporal	climate	field	at	a	fairly	high	resolution	(~800	m)	from	AD	1–2000.	Algorithm	
estimates	joint	information	in	tree-rings	and	a	climate	signal	to	identify	“best”	 tree-ring	
chronologies	for	climate	reconstructing.
K.	Bocinsky,	T.	Kohler,	A	2000-year	reconstruction	of	the	rain-fed	
maize	agricultural	niche	in	the	US	Southwest.	Nature
Communications.	doi:10.1038/ncomms6618
… implemented as an R Script …
Provenance	@	SBBD'16
Provenance	Support	for	Reproducible	Science	
Example:	Paleoclimate	Reconstruction
Science	paper	(OA)	uses:
• open	source	code:
– R,	PaleoCAR,	…
• Is	that	all	we	need?
• What	was	the	
“workflow”?
• Is	there	prospective	
and/or	retrospective	
provenance?
Provenance	@	SBBD'16
GetModernClimate
PRISM_annual_growing_season_precipitation
SubsetAllData
dendro_series_for_calibration
dendro_series_for_reconstruction CAR_Analysis_unique
cellwise_unique_selected_linear_models
CAR_Analysis_union
cellwise_union_selected_linear_models
CAR_Reconstruction_union
raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors
CAR_Reconstruction_union_output
ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif
master_data_directory prism_directory
tree_ring_datacalibration_years retrodiction_years
?
5th Stop:	YesWorkflow:	
Yes,	scripts	are	workflows,	too!
• Script vs Workflows/ASAP:
– Automation: *****
– Scaling: **
– Abstraction: *
– Provenance: **
Provenance	@	SBBD'16
YesWorkflow.org
• YesWorkflow (YW)
– Started	as	a	grass-roots	effort		(Kurator,	SKOPE,	..)
– …	meeting	the	scientists/users	where	they	R!
• R,	Matlab,	(i)Python,	Jupyter,	…
– Scripts	+	simple user	annotations
• =>	Reveal	the	workflow	model/abstraction	
…	that	underlies	the	(script)	implementation
• =>	YW	can	give	us	more	of	ASAP!
– First	YW:	 ASAP	(Abstraction)...
– Then	YW-recon:	ASAP	(reconstructing	runtime Provenance)
Provenance	@	SBBD'16
YW (prospective)	and	
YW-Recon	(retrospective)	Provenance
• 1.	YW:	Annotate	Script	=>	YW	Model
– Annotate	@BEGIN..@END,	@IN,	@OUT
– Visualize,	share,	be	happy	J
• 2.	Run	script
– Files	are	read	and	written
– Folder- &	Filenames	have	metadata
• 3.	YW-Recon
– Use	@URI	tags	that	link	YW	Model	ó Persisted	Data
– Run	URI-template	queries	
• cf.	“ls -R”	&	RegEx matching
• 4.	YW-Query
– Answer	the	user’s	provenance	queries	
Provenance	@	SBBD'16
YW	annotations:	Model	your	Workflow!
Provenance	@	SBBD'16
YesWorkflow:	Prospective &	Retrospective	
Provenance	…	(almost)	for	free!	
• YW	annotations	in	
the	script	(R,	
Python,	Matlab)	
are	used	to	
recreate	the	
workflow	view	
from	the	script	…	
cassette_id
sample_score_cutoff
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_id energy frame_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
YW!
Provenance	@	SBBD'16
GetModernClimate
PRISM_annual_growing_season_precipitation
SubsetAllData
dendro_series_for_calibration
dendro_series_for_reconstruction CAR_Analysis_unique
cellwise_unique_selected_linear_models
CAR_Analysis_union
cellwise_union_selected_linear_models
CAR_Reconstruction_union
raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors
CAR_Reconstruction_union_output
ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif
master_data_directory prism_directory
tree_ring_datacalibration_years retrodiction_years
Paleoclimate Reconstruction	(EnviRecon.org)
• …	explained	using	YesWorkflow!
Kyle	B.,	(computational)	archaeologist:	
"It	took	me	about	20	minutes	to	comment.	Less	
than	an	hour	to	learn	and	YW-annotate,	all-told."
Provenance	@	SBBD'16
main
fetch_mask
input_mask_file
load_data
input_data_file standardize_with_mask
land_water_mask
NEE_data simple_diagnose
standardized_NEE_data result_NEE_pdf
Get	3	views	for	the	price	of	1!
result_NEE_pdf
input_mask_file land_water_mask
fetch_mask
input_data_file NEE_data
load_data
standardized_NEE_data
standardize_with_mask
standardize_with_mask
simple_diagnose
fetch_mask land_water_mask
load_data NEE_data
standardize_with_mask standardized_NEE_data simple_diagnose result_NEE_pdf
input_mask_file
input_data_file
Process	view
Data	view
Combined	view
Provenance	@	SBBD'16
Multi-Scale	Synthesis	and	Terrestrial	Model	Intercomparison
Project	(MsTMIP)
fetch_drought_variable
drought_variable_1
fetch_effect_variable
effect_variable_1
convert_effect_variable_units
effect_variable_2
create_land_water_mask
land_water_mask
init_data_variables
predrought_effect_variable_1 drought_value_variable_1 recovery_time_variable_1 drought_number_variable_1
define_droughts
sigma_dv_event month_dv_length
detrend_deseasonalize_effect_variable
effect_variable_3
calculate_data_variables
recovery_time_variable_2 drought_value_variable_2 predrought_effect_variable_2 drought_number_variable_2
export_recovery_time_figure
output_recovery_time_figure
export_drought_value_variable_figure
output_drought_value_variable_figure
export_predrought_effect_variable_figure
output_predrought_effect_variable_figure
export_drought_number_variable_figure
output_drought_number_figure
input_drough_variable
input_effect_variable
Christopher	Schwalm,
Yaxing Wei
Provenance	@	SBBD'16
Figure 4: Process workflow view of an A↵ymetrix analysis script (in R).
4 YesWorkflow Examples
In the following we show YesWorkflow views extracted from real-world scientific use cases.
The scripts were annoted with YW tags by scientists and script authors, using a very
modest training and mark-up e↵ort.1
Due to lack of space, the actual MATLAB and R
scripts with their YW markup are not included here. However, they are all available
from the yw-idcc-15 repository on the YW GitHub site [Yes15].
Gene	Expression	Microarray	Data	Analysis
• [Normalize]	
– Normalization	of	data	across	microarray	datasets	
• [SelectDEGs]	
– Selection	of	differentially	expressed	genes	between	conditions	
• [GO	Analysis]	
– determination	of	gene	ontology	statistics	for	the	resulting	datasets	
• [MakeHeatmap]	
– creation	of	a	heatmap of	the	differentially	expressed	genes.	
Tyler	Kolisnik,	Mark	Bieda
Provenance	@	SBBD'16
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
cassette_id
sample_score_cutoff
Data	collection	workflow (X-ray	diffraction)
Provenance	@	SBBD'16
run/  
├──  raw  
│      └──  q55  
│              ├──  DRT240  
│              │      ├──  e10000  
│              │      │      ├──  image_001.raw  
...          ...  ...  ...  
│              │      │      └──  image_037.raw  
│              │      └──  e11000  
│              │              ├──  image_001.raw  
...          ...          ...  
│              │              └──  image_037.raw  
│              └──  DRT322  
│                      ├──  e10000  
│                      │      ├──  image_001.raw  
...                  ...  ...  
│                      │      └──  image_030.raw  
│                      └──  e11000  
│                              ├──  image_001.raw  
...                          ...  
│                              └──  image_030.raw  
├──  data  
│      ├──  DRT240  
│      │      ├──  DRT240_10000eV_001.img  
...  ...  ...  
│      │      └──  DRT240_11000eV_037.img  
│      └──  DRT322  
│              ├──  DRT322_10000eV_001.img  
...          ...  
│              └──  DRT322_11000eV_030.img  
│  
├──  collected_images.csv  
├──  rejected_samples.txt  
└──  run_log.txt  
  
YW-RECON:	Prospective	&	Retrospective
Provenance	…	(almost)	for	free!	
cassette_id
sample_score_cutoff
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_id energy frame_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
• URI-templates	link conceptual	entities	
to	runtime	provenance	“left	behind”	by	
the	script	author	…	
• …	facilitating	provenance	reconstructionProvenance	@	SBBD'16
YW (prospective)	and	
YW-Recon	(retrospective)	Provenance
• 1.	YW:	Annotate	Script	=>	YW	Model
– Annotate	@BEGIN..@END,	@IN,	@OUT
– Visualize,	share,	be	happy	J
• 2.	Run	script
– Files	are	read	and	written
– Folder- &	Filenames	have	metadata
• 3.	YW-Recon
– Use	@URI	tags	that	link	YW	Model	ó Persisted	Data
– Run	URI-template	queries	
• cf.	“ls -R”	&	RegEx matching
• 4.	YW-Query
– Answer	the	user’s	provenance	queries	
Provenance	@	SBBD'16
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
cassette_id
sample_score_cutoff
Data	collection	workflow:	runtime	data
run/  
├──  raw  
│      └──  q55  
│              ├──  DRT240  
│              │      ├──  e10000  
│              │      │      ├──  image_001.raw  
...          ...  ...  ...  
│              │      │      └──  image_037.raw  
│              │      └──  e11000  
│              │              ├──  image_001.raw  
...          ...          ...  
│              │              └──  image_037.raw  
│              └──  DRT322  
│                      ├──  e10000  
│                      │      ├──  image_001.raw  
...                  ...  ...  
│                      │      └──  image_030.raw  
│                      └──  e11000  
│                              ├──  image_001.raw  
...                          ...  
│                              └──  image_030.raw  
├──  data  
│      ├──  DRT240  
│      │      ├──  DRT240_10000eV_001.img  
...  ...  ...  
│      │      └──  DRT240_11000eV_037.img  
│      └──  DRT322  
│              ├──  DRT322_10000eV_001.img  
...          ...  
│              └──  DRT322_11000eV_030.img  
│  
├──  collected_images.csv  
├──  rejected_samples.txt  
└──  run_log.txt  
  
1. YW	annotations	=>	YW	model
2. Files	&	Folders	left	by	a	run	=>	runtime	(meta-)data
Provenance	@	SBBD'16
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
cassette_id
sample_score_cutoff
Q1:	What	samples did	the	script	run	collect	images	
from?
run/  
├──  raw  
│      └──  q55  
│              ├──  DRT240  
│              │      ├──  e10000  
│              │      │      ├──  image_001.raw  
...          ...  ...  ...  
│              │      │      └──  image_037.raw  
│              │      └──  e11000  
│              │              ├──  image_001.raw  
...          ...          ...  
│              │              └──  image_037.raw  
│              └──  DRT322  
│                      ├──  e10000  
│                      │      ├──  image_001.raw  
...                  ...  ...  
│                      │      └──  image_030.raw  
│                      └──  e11000  
│                              ├──  image_001.raw  
...                          ...  
│                              └──  image_030.raw  
├──  data  
│      ├──  DRT240  
│      │      ├──  DRT240_10000eV_001.img  
...  ...  ...  
│      │      └──  DRT240_11000eV_037.img  
│      └──  DRT322  
│              ├──  DRT322_10000eV_001.img  
...          ...  
│              └──  DRT322_11000eV_030.img  
│  
├──  collected_images.csv  
├──  rejected_samples.txt  
└──  run_log.txt  
  
Provenance	@	SBBD'16
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
cassette_id
sample_score_cutoff
Q2:	What	energies were	used	for	image	collection	from	
sample	DRT322?
run/  
├──  raw  
│      └──  q55  
│              ├──  DRT240  
│              │      ├──  e10000  
│              │      │      ├──  image_001.raw  
...          ...  ...  ...  
│              │      │      └──  image_037.raw  
│              │      └──  e11000  
│              │              ├──  image_001.raw  
...          ...          ...  
│              │              └──  image_037.raw  
│              └──  DRT322  
│                      ├──  e10000  
│                      │      ├──  image_001.raw  
...                  ...  ...  
│                      │      └──  image_030.raw  
│                      └──  e11000  
│                              ├──  image_001.raw  
...                          ...  
│                              └──  image_030.raw  
├──  data  
│      ├──  DRT240  
│      │      ├──  DRT240_10000eV_001.img  
...  ...  ...  
│      │      └──  DRT240_11000eV_037.img  
│      └──  DRT322  
│              ├──  DRT322_10000eV_001.img  
...          ...  
│              └──  DRT322_11000eV_030.img  
│  
├──  collected_images.csv  
├──  rejected_samples.txt  
└──  run_log.txt  
  
Provenance	@	SBBD'16
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
cassette_id
sample_score_cutoff
Q3:	Where	is	the	raw	image	of	the	corrected	image	
DRT322_11000ev_030.img?	run/  
├──  raw  
│      └──  q55  
│              ├──  DRT240  
│              │      ├──  e10000  
│              │      │      ├──  image_001.raw  
...          ...  ...  ...  
│              │      │      └──  image_037.raw  
│              │      └──  e11000  
│              │              ├──  image_001.raw  
...          ...          ...  
│              │              └──  image_037.raw  
│              └──  DRT322  
│                      ├──  e10000  
│                      │      ├──  image_001.raw  
...                  ...  ...  
│                      │      └──  image_030.raw  
│                      └──  e11000  
│                              ├──  image_001.raw  
...                          ...  
│                              └──  image_030.raw  
├──  data  
│      ├──  DRT240  
│      │      ├──  DRT240_10000eV_001.img  
...  ...  ...  
│      │      └──  DRT240_11000eV_037.img  
│      └──  DRT322  
│              ├──  DRT322_10000eV_001.img  
...          ...  
│              └──  DRT322_11000eV_030.img  
│  
├──  collected_images.csv  
├──  rejected_samples.txt  
└──  run_log.txt  
  
Provenance	@	SBBD'16
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
cassette_id
sample_score_cutoff
run/  
├──  raw  
│      └──  q55  
│              ├──  DRT240  
│              │      ├──  e10000  
│              │      │      ├──  image_001.raw  
...          ...  ...  ...  
│              │      │      └──  image_037.raw  
│              │      └──  e11000  
│              │              ├──  image_001.raw  
...          ...          ...  
│              │              └──  image_037.raw  
│              └──  DRT322  
│                      ├──  e10000  
│                      │      ├──  image_001.raw  
...                  ...  ...  
│                      │      └──  image_030.raw  
│                      └──  e11000  
│                              ├──  image_001.raw  
...                          ...  
│                              └──  image_030.raw  
├──  data  
│      ├──  DRT240  
│      │      ├──  DRT240_10000eV_001.img  
...  ...  ...  
│      │      └──  DRT240_11000eV_037.img  
│      └──  DRT322  
│              ├──  DRT322_10000eV_001.img  
...          ...  
│              └──  DRT322_11000eV_030.img  
│  
├──  collected_images.csv  
├──  rejected_samples.txt  
└──  run_log.txt  
  
Q5:	What	cassette-id	had	the	sample	leading	to	
DRT240_10000ev_001.img?
Provenance	@	SBBD'16
João	F.	Pimentel,	Saumen	Dey,	Timothy	McPhillips,	
Khalid	Belhajjame,	David	Koop,	Leonardo	Murta,	
Vanessa	Braganholo,	Bertram	Ludäscher
Yin	&	Yang:	Demonstrating complementary	
provenance	from	noWorkflow &	
YesWorkflow
Provenance	of	the	Yin &	Yang	Demo	
Dagstuhl’16	
Reproducibility	Seminar
Provenance-Week’16	Demo!
was_derived_from__via
TaPP’15,	Edinburgh
Provenance	@	SBBD'16
Using	Provenance	from	Script	Runs	
Example	from	the	log-file:	
2016-06-07 20:32:36 Wrote run/data/DRT240/DRT240_11000eV_002.img
But	how	was	that	image	derived??	(“Provenance	for	Self!”)Provenance	@	SBBD'16
module.__build_class__
module.__build_class__
simulate_data_collection
180 return
180 run_logger
201 return
201 new_image_file
230 parser
231 cassette_id
236 add_option
241 add_option
246 add_option
248 set_usage
251 parse_args
251 args
251 options
254 module.len
24 cassette_id
24 sample_score_cutoff
24 data_redundancy
24 calibration_image_file
30 exists
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
36 run_log
37 write
38 str(sample_score_cutoff)
38 write
38 str(sample_score_cutoff)
49 str.format
49 sample_spreadsheet_file
50 spreadsheet_rows
cassette_q55_spreadsheet.csv
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format 51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
72 str.format
72 write
73 open
73 rejection_log
74 str.format
74 TextIOWrapper.write
50 spreadsheet_rows
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format
51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
calibration.img
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format 93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format 106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format 93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
50 spreadsheet_rows
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format
51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file 120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open 119 collection_log_file 120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file 120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
50 spreadsheet_rows
128 return
run/run_log.txt
run/rejected_samples.txt
run/raw/q55/DRT240/e10000/image_001.raw
run/data/DRT240/DRT240_10000eV_001.img
run/collected_images.csv
run/raw/q55/DRT240/e10000/image_002.raw
run/data/DRT240/DRT240_10000eV_002.img
run/raw/q55/DRT240/e11000/image_001.raw
run/data/DRT240/DRT240_11000eV_001.img
run/raw/q55/DRT240/e11000/image_002.raw
run/data/DRT240/DRT240_11000eV_002.img
run/raw/q55/DRT240/e12000/image_001.raw
run/data/DRT240/DRT240_12000eV_001.img
run/raw/q55/DRT240/e12000/image_002.raw
run/data/DRT240/DRT240_12000eV_002.img
run/raw/q55/DRT322/e10000/image_001.raw
run/data/DRT322/DRT322_10000eV_001.img
run/raw/q55/DRT322/e10000/image_002.raw
run/data/DRT322/DRT322_10000eV_002.img
run/raw/q55/DRT322/e11000/image_001.raw
run/data/DRT322/DRT322_11000eV_001.img
run/raw/q55/DRT322/e11000/image_002.raw
run/data/DRT322/DRT322_11000eV_002.img
noWorkflow:
not only
Workflow!
• Scripts	have	provenance,	too!
• Transparently capture	some/all	
provenance	from	Python	script	
runs.
• Use	filter	queries to	“zoom”	into	
relevant	parts	..		
Provenance	@	SBBD'16
simulate_data_collection
230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8>
251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55'])
251 args = ['q55']
251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}>
24 cassette_id = 'q55'
24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0
24 calibration_image_file = 'calibration.img'
49 str.format
49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv'
50 spreadsheet_rows(sample_spreadsheet_file)
50 sample_name = 'DRT240'50 sample_quality = 45
61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000])
61 accepted_sample = 'DRT240'61 num_images = 2
61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240'
92 collect_next_image(casset ... _{frame_number:03d}.raw')
92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw'
106 str.format
106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img')
calibration.img
run/data/DRT240/DRT240_11000eV_002.img
$	now dataflow	-f	"run/data/DRT240/DRT240_11000eV_002.img"
$(NW_FILTERED_LINEAGE_GRAPH).gv: $(NW_FACTS)
now helper df_style.py
now dataflow -v 55 -f
$(RETROSPECTIVE_LINEAGE_VALUE) -m simulation
| python df_style.py -d BT -e >
$(NW_FILTERED_LINEAGE_GRAPH).gv
..	auto-“make” this!
noWorkflow lineage	
of	an	image	file
Provenance	information	
about	Python	function	calls,	
variable assignments,	etc.
Provenance	@	SBBD'16
simulate_data_collection
initialize_run
run_log load_screening_results
sample_namesample_quality
calculate_strategy
accepted_samplerejected_sample num_imagesenergies
log_rejected_sample
rejection_log
collect_data_set
sample_id energyframe_number raw_image
transform_images
corrected_imagetotal_intensitypixel_count
log_average_image_intensity
collection_log
sample_spreadsheet
calibration_image
sample_score_cutoffdata_redundancy
cassette_id
YesWorkflow:	Yes,	scripts	are	Workflows,	too!
• Use	YW annotations
@begin...@end, @in,
@out to	reveal	hidden	
conceptual workflow
(prospective provenance)	
• Script	isn't	changed:
– annotations	via	comments
(=>	language	independent)
• For	understanding	and	
sharing	the	“big	picture”
• Query and	visualize!
Provenance	@	SBBD'16
Alternate	YW	Views	
simulate_data_collection
initialize_run
load_screening_results calculate_strategy
log_rejected_sample
collect_data_set transform_images log_average_image_intensity
simulate_data_collection
initialize_run
run_log load_screening_results
sample_namesample_quality
calculate_strategy
accepted_samplerejected_sample num_imagesenergies
log_rejected_sample
rejection_log
collect_data_set
sample_id energyframe_number raw_image
transform_images
corrected_imagetotal_intensitypixel_count
log_average_image_intensity
collection_log
sample_spreadsheet
calibration_image
sample_score_cutoffdata_redundancy
cassette_id
Process	view
Data	view
Workflow	view
Provenance	@	SBBD'16
simulate_data_collection
initialize_run
run_log load_screening_results
sample_namesample_quality
calculate_strategy
accepted_samplerejected_sample num_imagesenergies
log_rejected_sample
rejection_log
collect_data_set
sample_id energyframe_number raw_image
transform_images
corrected_imagetotal_intensitypixel_count
log_average_image_intensity
collection_log
sample_spreadsheet
calibration_image
sample_score_cutoffdata_redundancy
cassette_id
What	is	the	lineage of	“corrected_image”?
From	here	on	“upwards”:
What	led	(leads)	to	this?	
..	and	what	is	irrelevant
and	should	be	pruned??
Provenance	@	SBBD'16
simulate_data_collection
collect_data_set
sample_id energy frame_number raw_image
calculate_strategy
accepted_sample num_imagesenergies
load_screening_results
sample_namesample_quality
transform_images
corrected_image
sample_spreadsheet
calibration_image
sample_score_cutoff data_redundancy
cassette_id
Subgraph	
resulting	from	
lineage	query		
on	YW	workflow	
model	
What	is	the	lineage	of	
corrected_image?
Provenance	@	SBBD'16
simulate_data_collection
initialize_run
run_log load_screening_results
sample_namesample_quality
calculate_strategy
accepted_samplerejected_sample num_imagesenergies
log_rejected_sample
rejection_log
collect_data_set
sample_id energyframe_number raw_image
transform_images
corrected_imagetotal_intensitypixel_count
log_average_image_intensity
collection_log
sample_spreadsheet
calibration_image
sample_score_cutoffdata_redundancy
cassette_id
simulate_data_collection
collect_data_set
sample_id energy frame_number raw_image
calculate_strategy
accepted_sample num_imagesenergies
load_screening_results
sample_namesample_quality
transform_images
corrected_image
sample_spreadsheet
calibration_image
sample_score_cutoff data_redundancy
cassette_id
module.__build_class__
module.__build_class__
simulate_data_collection
180 return
180 run_logger
201 return
201 new_image_file
230 parser
231 cassette_id
236 add_option
241 add_option
246 add_option
248 set_usage
251 parse_args
251 args
251 options
254 module.len
24 cassette_id
24 sample_score_cutoff
24 data_redundancy
24 calibration_image_file
30 exists
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
36 run_log
37 write
38 str(sample_score_cutoff)
38 write
38 str(sample_score_cutoff)
49 str.format
49 sample_spreadsheet_file
50 spreadsheet_rows
cassette_q55_spreadsheet.csv
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format 51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
72 str.format
72 write
73 open
73 rejection_log
74 str.format
74 TextIOWrapper.write
50 spreadsheet_rows
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format
51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
calibration.img
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format 93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format 106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format 93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
50 spreadsheet_rows
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format
51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file 120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open 119 collection_log_file 120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file 120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
50 spreadsheet_rows
128 return
run/run_log.txt
run/rejected_samples.txt
run/raw/q55/DRT240/e10000/image_001.raw
run/data/DRT240/DRT240_10000eV_001.img
run/collected_images.csv
run/raw/q55/DRT240/e10000/image_002.raw
run/data/DRT240/DRT240_10000eV_002.img
run/raw/q55/DRT240/e11000/image_001.raw
run/data/DRT240/DRT240_11000eV_001.img
run/raw/q55/DRT240/e11000/image_002.raw
run/data/DRT240/DRT240_11000eV_002.img
run/raw/q55/DRT240/e12000/image_001.raw
run/data/DRT240/DRT240_12000eV_001.img
run/raw/q55/DRT240/e12000/image_002.raw
run/data/DRT240/DRT240_12000eV_002.img
run/raw/q55/DRT322/e10000/image_001.raw
run/data/DRT322/DRT322_10000eV_001.img
run/raw/q55/DRT322/e10000/image_002.raw
run/data/DRT322/DRT322_10000eV_002.img
run/raw/q55/DRT322/e11000/image_001.raw
run/data/DRT322/DRT322_11000eV_001.img
run/raw/q55/DRT322/e11000/image_002.raw
run/data/DRT322/DRT322_11000eV_002.img
simulate_data_collection
230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8>
251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55'])
251 args = ['q55']
251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}>
24 cassette_id = 'q55'
24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0
24 calibration_image_file = 'calibration.img'
49 str.format
49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv'
50 spreadsheet_rows(sample_spreadsheet_file)
50 sample_name = 'DRT240'50 sample_quality = 45
61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000])
61 accepted_sample = 'DRT240'61 num_images = 2
61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240'
92 collect_next_image(casset ... _{frame_number:03d}.raw')
92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw'
106 str.format
106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img')
calibration.img
run/data/DRT240/DRT240_11000eV_002.img
lineage	query
lineage	query
YesWorkflow:
Conceptual workflow	model
noWorkflow:	
Python trace	model
But	how	do	we	
bridge	this	gap???
Would	like	to	use	YW	
model	to	query	NW	
data!
Provenance	@	SBBD'16
We’re	off	to	see	the	Wizard	of	Prov	...
We're off to see the Wizard,
The wonderful Wizard of Prov!
--
We hear he is a wiz of a wiz
If ever a wiz there was.
--
If ever, oh ever, a wiz there was,
The Wizard of Prov is one because,
Because, because, because, because, because,
Because of the wonderful things he does.
• Enrich	YW	conceptual	view	
with	NW	Python	provenance!
• Get	the	best	of	both	worlds!
• How	hard	can	it	be	to	bridge	
YW	and	NW	…	
(cf. TaPP’15	prototype)		
Provenance	@	SBBD'16
Diamonds	are	forever	
Bridges	aren’t	…	
Provenance	@	SBBD'16
…	new	bridge-building	can	be	stressful
…	even	if	just	painting	over.	
Provenance	@	SBBD'16
Habemus	Pons!
We’ve	got	the	Bridge!	
The	bridge	is	the	journey..		
(The	journey	is	the	destination)
Lineage	of	image	file
in	terms	of	YW	
model,	with	details	
from	NW	provenance
Provenance	@	SBBD'16
Secret Reproducible	Sauce
• Combining	provenance	information	from	
noWorkflow and	YesWorkflow
• Using	all	the	good	stuff:	
– make,	docker,	Prolog,	SQL,	Graphviz
• Open	source
– github.com/yesworkflow-org/yw-noworkflow
– github.com/gems-uff/yin-yang-demo
• Have	a	closer	look	at	the	demo!
Provenance	@	SBBD'16
Demo	Time	…	
Provenance	@	SBBD'16
Demo	Time
Provenance	@	SBBD'16

Más contenido relacionado

La actualidad más candente

Carmen O'Dell and Barbara Sen JIBS-RLUK event July 2012
Carmen O'Dell and Barbara Sen JIBS-RLUK event July 2012Carmen O'Dell and Barbara Sen JIBS-RLUK event July 2012
Carmen O'Dell and Barbara Sen JIBS-RLUK event July 2012sherif user group
 
The Experimental Project of DOI Registration for Research Data at Japan Link...
The Experimental Project of DOI Registration for Research Data at Japan Link...The Experimental Project of DOI Registration for Research Data at Japan Link...
The Experimental Project of DOI Registration for Research Data at Japan Link...National Institute of Informatics (NII)
 
The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...
The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...
The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...John Kunze
 
Research Data-DOI Experiment in Japanese DOI Registration Agency (Japan Link ...
Research Data-DOI Experiment in Japanese DOI Registration Agency (Japan Link ...Research Data-DOI Experiment in Japanese DOI Registration Agency (Japan Link ...
Research Data-DOI Experiment in Japanese DOI Registration Agency (Japan Link ...National Institute of Informatics (NII)
 
What do you want to discover today? / Janet Aucock, University of St Andrews
What do you want to discover today? / Janet Aucock, University of St AndrewsWhat do you want to discover today? / Janet Aucock, University of St Andrews
What do you want to discover today? / Janet Aucock, University of St AndrewsCIGScotland
 
Ala cspace aspace rep services demo 2015
Ala cspace aspace rep services demo 2015Ala cspace aspace rep services demo 2015
Ala cspace aspace rep services demo 2015LYRASIS
 
As bibliotecas do mundo conectadas. A um mundo conectado!
As bibliotecas do mundo conectadas. A um mundo conectado!As bibliotecas do mundo conectadas. A um mundo conectado!
As bibliotecas do mundo conectadas. A um mundo conectado!OCLC LAC
 
Linked Data at the BNE. Elena Escolano Rodríguez, Daniel Vila Suero
Linked Data at the BNE. Elena Escolano Rodríguez, Daniel Vila SueroLinked Data at the BNE. Elena Escolano Rodríguez, Daniel Vila Suero
Linked Data at the BNE. Elena Escolano Rodríguez, Daniel Vila SueroBiblioteca Nacional de España
 
ReEnvisioning E-Resource Holdings Management
ReEnvisioning E-Resource Holdings ManagementReEnvisioning E-Resource Holdings Management
ReEnvisioning E-Resource Holdings ManagementNASIG
 
Building an institutional repository using dspace
Building an institutional repository using dspaceBuilding an institutional repository using dspace
Building an institutional repository using dspaceBharat Chaudhari
 
Next Steps for IMLS's National Digital Platform
Next Steps for IMLS's National Digital PlatformNext Steps for IMLS's National Digital Platform
Next Steps for IMLS's National Digital PlatformTrevor Owens
 
Access to and specifics of detailed national LFS data – the case of Slovenia
Access to and specifics of detailed national LFS data – the case of SloveniaAccess to and specifics of detailed national LFS data – the case of Slovenia
Access to and specifics of detailed national LFS data – the case of SloveniaArhiv družboslovnih podatkov
 

La actualidad más candente (16)

Carmen O'Dell and Barbara Sen JIBS-RLUK event July 2012
Carmen O'Dell and Barbara Sen JIBS-RLUK event July 2012Carmen O'Dell and Barbara Sen JIBS-RLUK event July 2012
Carmen O'Dell and Barbara Sen JIBS-RLUK event July 2012
 
The Experimental Project of DOI Registration for Research Data at Japan Link...
The Experimental Project of DOI Registration for Research Data at Japan Link...The Experimental Project of DOI Registration for Research Data at Japan Link...
The Experimental Project of DOI Registration for Research Data at Japan Link...
 
The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...
The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...
The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...
 
Research Data-DOI Experiment in Japanese DOI Registration Agency (Japan Link ...
Research Data-DOI Experiment in Japanese DOI Registration Agency (Japan Link ...Research Data-DOI Experiment in Japanese DOI Registration Agency (Japan Link ...
Research Data-DOI Experiment in Japanese DOI Registration Agency (Japan Link ...
 
What do you want to discover today? / Janet Aucock, University of St Andrews
What do you want to discover today? / Janet Aucock, University of St AndrewsWhat do you want to discover today? / Janet Aucock, University of St Andrews
What do you want to discover today? / Janet Aucock, University of St Andrews
 
Ala cspace aspace rep services demo 2015
Ala cspace aspace rep services demo 2015Ala cspace aspace rep services demo 2015
Ala cspace aspace rep services demo 2015
 
NISO/DCMI Webinar: Metadata for Managing Scientific Research Data
NISO/DCMI Webinar: Metadata for Managing Scientific Research DataNISO/DCMI Webinar: Metadata for Managing Scientific Research Data
NISO/DCMI Webinar: Metadata for Managing Scientific Research Data
 
Escena Digital, a new product to dissemination of heritage
Escena Digital, a new product to dissemination of heritageEscena Digital, a new product to dissemination of heritage
Escena Digital, a new product to dissemination of heritage
 
As bibliotecas do mundo conectadas. A um mundo conectado!
As bibliotecas do mundo conectadas. A um mundo conectado!As bibliotecas do mundo conectadas. A um mundo conectado!
As bibliotecas do mundo conectadas. A um mundo conectado!
 
Why Link?
Why Link?Why Link?
Why Link?
 
Linked Data at the BNE. Elena Escolano Rodríguez, Daniel Vila Suero
Linked Data at the BNE. Elena Escolano Rodríguez, Daniel Vila SueroLinked Data at the BNE. Elena Escolano Rodríguez, Daniel Vila Suero
Linked Data at the BNE. Elena Escolano Rodríguez, Daniel Vila Suero
 
Just Digitise It! - Daniel Wilksch
Just Digitise It! - Daniel WilkschJust Digitise It! - Daniel Wilksch
Just Digitise It! - Daniel Wilksch
 
ReEnvisioning E-Resource Holdings Management
ReEnvisioning E-Resource Holdings ManagementReEnvisioning E-Resource Holdings Management
ReEnvisioning E-Resource Holdings Management
 
Building an institutional repository using dspace
Building an institutional repository using dspaceBuilding an institutional repository using dspace
Building an institutional repository using dspace
 
Next Steps for IMLS's National Digital Platform
Next Steps for IMLS's National Digital PlatformNext Steps for IMLS's National Digital Platform
Next Steps for IMLS's National Digital Platform
 
Access to and specifics of detailed national LFS data – the case of Slovenia
Access to and specifics of detailed national LFS data – the case of SloveniaAccess to and specifics of detailed national LFS data – the case of Slovenia
Access to and specifics of detailed national LFS data – the case of Slovenia
 

Similar a Provenance in Databases and Scientific Workflows: Part I

Transparent Licenses: Making user rights clear (OLA Super Conference 2015)
Transparent Licenses: Making user rights clear (OLA Super Conference 2015)Transparent Licenses: Making user rights clear (OLA Super Conference 2015)
Transparent Licenses: Making user rights clear (OLA Super Conference 2015)Hong (Jenny) Jing
 
Research Data Management in GLAM: Managing Data for Cultural Heritage
Research Data Management in GLAM: Managing Data for Cultural HeritageResearch Data Management in GLAM: Managing Data for Cultural Heritage
Research Data Management in GLAM: Managing Data for Cultural HeritageSarah Anna Stewart
 
Presentation - First International Library Staff Exchange Week, Zagreb
Presentation - First International Library Staff Exchange Week, ZagrebPresentation - First International Library Staff Exchange Week, Zagreb
Presentation - First International Library Staff Exchange Week, ZagrebIva Vrkic
 
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IU
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IUOne Discovery Layer, Eight Front Doors: Implementing Blacklight @ IU
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IUCourtney McDonald
 
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and RemedyHIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and RemedyPRELIDA Project
 
Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Beth Plale
 
Workshop 4: Open Science & Open Data for Librarians/Ina Smith
Workshop 4: Open Science & Open Data for Librarians/Ina SmithWorkshop 4: Open Science & Open Data for Librarians/Ina Smith
Workshop 4: Open Science & Open Data for Librarians/Ina SmithAfrican Open Science Platform
 
Curation Service Models - Michael Witt - RDAP12
Curation Service Models - Michael Witt - RDAP12Curation Service Models - Michael Witt - RDAP12
Curation Service Models - Michael Witt - RDAP12ASIS&T
 
RDAP13 John Kunze: The Data Management Ecosystem
RDAP13 John Kunze: The Data Management EcosystemRDAP13 John Kunze: The Data Management Ecosystem
RDAP13 John Kunze: The Data Management EcosystemASIS&T
 
The Data Management Ecosystem
The Data Management EcosystemThe Data Management Ecosystem
The Data Management EcosystemJohn Kunze
 
Open Science, Open Data: towards a new transparent and reproducible ecosystem
Open Science, Open Data:   towards a new transparent and reproducible ecosystemOpen Science, Open Data:   towards a new transparent and reproducible ecosystem
Open Science, Open Data: towards a new transparent and reproducible ecosystemLIBER Europe
 
Staffing Research Data Services at University of Edinburgh
Staffing Research Data Services at University of EdinburghStaffing Research Data Services at University of Edinburgh
Staffing Research Data Services at University of EdinburghRobin Rice
 
NISO access related projects (presented at the Charleston conference 2016)
NISO access related projects (presented at the Charleston conference 2016)NISO access related projects (presented at the Charleston conference 2016)
NISO access related projects (presented at the Charleston conference 2016)Christine Stohn
 

Similar a Provenance in Databases and Scientific Workflows: Part I (20)

Transparent Licenses: Making user rights clear (OLA Super Conference 2015)
Transparent Licenses: Making user rights clear (OLA Super Conference 2015)Transparent Licenses: Making user rights clear (OLA Super Conference 2015)
Transparent Licenses: Making user rights clear (OLA Super Conference 2015)
 
Reference Rot and Linked Data: Threat and Remedy
Reference Rot and Linked Data: Threat and RemedyReference Rot and Linked Data: Threat and Remedy
Reference Rot and Linked Data: Threat and Remedy
 
Research Data Management in GLAM: Managing Data for Cultural Heritage
Research Data Management in GLAM: Managing Data for Cultural HeritageResearch Data Management in GLAM: Managing Data for Cultural Heritage
Research Data Management in GLAM: Managing Data for Cultural Heritage
 
Presentation - First International Library Staff Exchange Week, Zagreb
Presentation - First International Library Staff Exchange Week, ZagrebPresentation - First International Library Staff Exchange Week, Zagreb
Presentation - First International Library Staff Exchange Week, Zagreb
 
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
 
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IU
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IUOne Discovery Layer, Eight Front Doors: Implementing Blacklight @ IU
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IU
 
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and RemedyHIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
 
Where's the Data?
Where's the Data?Where's the Data?
Where's the Data?
 
Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014
 
NISO Webinar: Library Linked Data: From Vision to Reality
NISO Webinar: Library Linked Data: From Vision to RealityNISO Webinar: Library Linked Data: From Vision to Reality
NISO Webinar: Library Linked Data: From Vision to Reality
 
Workshop 4: Open Science & Open Data for Librarians/Ina Smith
Workshop 4: Open Science & Open Data for Librarians/Ina SmithWorkshop 4: Open Science & Open Data for Librarians/Ina Smith
Workshop 4: Open Science & Open Data for Librarians/Ina Smith
 
Access to Digital Back Copy
Access to Digital Back CopyAccess to Digital Back Copy
Access to Digital Back Copy
 
Curation Service Models - Michael Witt - RDAP12
Curation Service Models - Michael Witt - RDAP12Curation Service Models - Michael Witt - RDAP12
Curation Service Models - Michael Witt - RDAP12
 
RDAP13 John Kunze: The Data Management Ecosystem
RDAP13 John Kunze: The Data Management EcosystemRDAP13 John Kunze: The Data Management Ecosystem
RDAP13 John Kunze: The Data Management Ecosystem
 
The Data Management Ecosystem
The Data Management EcosystemThe Data Management Ecosystem
The Data Management Ecosystem
 
Digitization and public libraries
Digitization and public librariesDigitization and public libraries
Digitization and public libraries
 
Open Science, Open Data: towards a new transparent and reproducible ecosystem
Open Science, Open Data:   towards a new transparent and reproducible ecosystemOpen Science, Open Data:   towards a new transparent and reproducible ecosystem
Open Science, Open Data: towards a new transparent and reproducible ecosystem
 
Staffing Research Data Services at University of Edinburgh
Staffing Research Data Services at University of EdinburghStaffing Research Data Services at University of Edinburgh
Staffing Research Data Services at University of Edinburgh
 
"In the Early Days of a Better Nation": Enhancing the power of metadata today...
"In the Early Days of a Better Nation": Enhancing the power of metadata today..."In the Early Days of a Better Nation": Enhancing the power of metadata today...
"In the Early Days of a Better Nation": Enhancing the power of metadata today...
 
NISO access related projects (presented at the Charleston conference 2016)
NISO access related projects (presented at the Charleston conference 2016)NISO access related projects (presented at the Charleston conference 2016)
NISO access related projects (presented at the Charleston conference 2016)
 

Más de Bertram Ludäscher

Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionBertram Ludäscher
 
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Bertram Ludäscher
 
[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database RulesBertram Ludäscher
 
[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database RulesBertram Ludäscher
 
Answering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsAnswering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsBertram Ludäscher
 
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Bertram Ludäscher
 
Which Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueWhich Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueBertram Ludäscher
 
From Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science TalesFrom Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science TalesBertram Ludäscher
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesBertram Ludäscher
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsBertram Ludäscher
 
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseDeduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseBertram Ludäscher
 
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...Bertram Ludäscher
 
Dissecting Reproducibility: A case study with ecological niche models in th...
Dissecting Reproducibility:  A case study with ecological niche models  in th...Dissecting Reproducibility:  A case study with ecological niche models  in th...
Dissecting Reproducibility: A case study with ecological niche models in th...Bertram Ludäscher
 
Incremental Recomputation: Those who cannot remember the past are condemned ...
Incremental Recomputation:  Those who cannot remember the past are condemned ...Incremental Recomputation:  Those who cannot remember the past are condemned ...
Incremental Recomputation: Those who cannot remember the past are condemned ...Bertram Ludäscher
 
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsValidation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsBertram Ludäscher
 
An ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsAn ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsBertram Ludäscher
 
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachKnowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachBertram Ludäscher
 
Whole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchWhole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchBertram Ludäscher
 
ETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatBertram Ludäscher
 
From Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable ProvenanceFrom Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable ProvenanceBertram Ludäscher
 

Más de Bertram Ludäscher (20)

Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
 
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
 
[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules
 
[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules
 
Answering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsAnswering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query Patterns
 
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
 
Which Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueWhich Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A Dialogue
 
From Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science TalesFrom Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science Tales
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science Tales
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
 
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseDeduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
 
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
 
Dissecting Reproducibility: A case study with ecological niche models in th...
Dissecting Reproducibility:  A case study with ecological niche models  in th...Dissecting Reproducibility:  A case study with ecological niche models  in th...
Dissecting Reproducibility: A case study with ecological niche models in th...
 
Incremental Recomputation: Those who cannot remember the past are condemned ...
Incremental Recomputation:  Those who cannot remember the past are condemned ...Incremental Recomputation:  Those who cannot remember the past are condemned ...
Incremental Recomputation: Those who cannot remember the past are condemned ...
 
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsValidation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
 
An ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsAn ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflows
 
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachKnowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
 
Whole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchWhole-Tale: The Experience of Research
Whole-Tale: The Experience of Research
 
ETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatETC & Authors in the Driver's Seat
ETC & Authors in the Driver's Seat
 
From Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable ProvenanceFrom Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable Provenance
 

Último

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 

Último (20)

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 

Provenance in Databases and Scientific Workflows: Part I

  • 2. Welcome & Bem-Vindo! • Boa tarde e bem-vindo ao tutorial sobre proveniência! • Welcome to the Tutorial on: Provenance in Databases and Scientific Workflows – Proveniência em bancos (bases) de dados e fluxos de trabalho (workflows) científicos • Desculpas ... – Back to English (my 2nd language) • Feel free to interrupt and ask questions! • (You can also ask questions in German or Spanish ...) Provenance @ SBBD'16
  • 3. Introductions should come first … • MS Computer Science, U Karlsruhe (K.I.T.) • PhD Computer Science, U Freiburg, Germany • Research Scientist, UC San Diego, SDSC • Dept. of Computer Science, UC Davis • University of Illinois at Urbana-Champaign – School of Information Sciences – Natl. Center for Supercomputing Applications – Dept. of Computer Science Provenance @ SBBD'16
  • 4. • Part I: Provenance in Scientific Workflows – Alta Vista: Provenance everywhere! – Provenance & Scientific Workflows – Provenance Models and Standards (not so much) – Provenance Tools • Example & Demo: YesWorkflow • Part II: Provenance in Databases – Foundations of provenance in databases – Why-, How-, and Why-Not provenance Outline of the Tutorial: A “Tour de Provenance” Provenance @ SBBD'16
  • 5. 1st Tour Stop: The Fine Arts • One of these is has been sold for nearly $180 million. • The other could be worth as much or more. • Which is which? • What is the difference? Provenance @ SBBD'16
  • 6. Provenance - Proveniência • Oxford English Dictionary – The place of origin or earliest known history of something: • an orange rug of Iranian provenance – The beginning of something’s existence; its origin: • they try to understand the whole universe, its provenance and fate – A record of ownership of a work of art or an antique, used as a guide to authenticity or quality: • the manuscript has a distinguished provenance • What is the origin (provenance!) of “provenance” ? Provenance @ SBBD'16
  • 7. The Many Faces of Provenance • What are those? • Cosmology • Geology, Stratigraphy • Phylogeny – the Tree of Life • Genealogy – your family: literally • Academic Pedigree – “Doktorvater” (Doktor-Mutter?) • Etymology • Chain of custody – of art(ifacts) • Yes: all about origins and history … Provenance @ SBBD'16
  • 11. 2nd Stop: Liberal Arts & Sciences • Can you “see provenance” in this image? • Grand Canyon’s rock layers are a record of the early geologic history of North America. The ancestral puebloan granaries at Nankoweap Creek tell archaeologists about more recent human history. (By Drenaline, licensed under CC BY-SA 3.0) Provenance @ SBBD'16
  • 13. Computational Provenance • Origin and processing history of an artifact – usually: data (products), figures, ... – sometimes: workflow (and script) evolution … • Different sub-communities: – Provenance in (scientific) workflows (Tutorial Part I) – Provenance in databases (Tutorial Part II) – Wait, there is more: • ... programming languages, systems/security, … Provenance @ SBBD'16
  • 14. Why should you care about provenance? • It’s an important problem: – reproducibility crisis, transparency, data sharing, … • There are (still) many deeply technical and practical challenges: – Efficient capture, management, use of provenance – Models, semantics, query languages – Provenance .. for others? Or provenance for self! – Interdisciplinary work; cross-fertilization: databases, workflows, programming languages, security, …, various scientific communities (bioinformatics, ...) • You have a head start here! – Marta Mattoso, Daniel de Oliveira, Vanessa Braganholo, Juliana Freire, ... (e.g. SBBD proceedings ..) • … oh, and it’s also a fun topic ... Provenance @ SBBD'16
  • 16. OneProblem: Reproducibility Crisis • Different sciences have different reproducibility crises … • Focus here: Computational Reproducibility – R, Matlab, Python, .. scripts – Scientific workflows, ... • How to facilitate reproducibility for computational and data scientists? Provenance @ SBBD'16
  • 17. Use Provenance for Transparency, Reproducibility • What input data went into this study? • What methods were used? • … with what parameter settings, calibrations, …? • Can we trust the data and methods? § Provenance (lineage): track origin and processing history of data è trust, data quality ~ audit trail for attribution, credit § Discovery of data, methodologies, experiments Provenance @ SBBD'16
  • 20. Provenance today: Important but hard èmany research projects, groups conduct R&D on provenance methods, tools, … Example: Climate Change Impacts in the United States U.S. National Climate Assessment U.S. Global Change Research Program “This report is the result of a three-year analytical effort by a team of over 300 experts, overseen by a broadly constituted Federal Advisory Committee of 60 members. It was developed from information and analyses gathered in over 70 workshops and listening sessions held across the country.” Provenance @ SBBD'16
  • 22. DataONE Cyberinfrastucture: Coordinating Nodes www.dataone.org/coordinating-nodes Coordinating Nodes • retain complete metadata catalog • indexing for search • network-wide services • ensure content availability (preservation) • replication services Components for a flexible, scalable, sustainable network Provenance @ SBBD'16
  • 23. Components for a flexible, scalable, sustainable network DataONE Cyberinfrastructure: Member Nodes www.dataone.org/member-nodes Coordinating Nodes • retain complete metadata catalog • indexing for search • network-wide services • ensure content availability (preservation) • replication services Member Nodes • diverse institutions • serve local community • provide resources for managing their data • retain copies of data Provenance @ SBBD'16
  • 24. DataONE Cyberinfrastructure: Investigator Toolkit www.dataone.org/investigator-toolkit Coordinating Nodes • retain complete metadata catalog • indexing for search • network-wide services • ensure content availability (preservation) • replication services Member Nodes • diverse institutions • serve local community • provide resources for managing their data • retain copies of data Investigator Toolkit Components for a flexible, scalable, sustainable network Provenance @ SBBD'16
  • 25. Provenance in Action: Benefits & Impact A DataONE search (here: “grass”) yields different packages with provenance Provenance @ SBBD'16
  • 26. DataONE: Support for Provenance Yaxing’s script with inputs & output products Christopher’s YesWorkflow model Christopher using Yaxing’s outputs as inputs for his script Christopher’s results can be traced back all the way to Yaxing’s input Provenance @ SBBD'16
  • 27. REWIND: From Provenance to Reproducible Science … Capturing provenance is crucial for transparency, interpretation, debugging, … => repeatable experiments, => reproducible science => need workflow-system agnostic model Provenance @ SBBD'16
  • 28. ... via scientific workflows (… and scripts) Provenance @ SBBD'16
  • 29. … 3rd stop: Scientific Workflows: ASAP • Automation – wfs to automate computational aspects of science • Scaling (exploit and optimize machine cycles) – wfs should make use of parallel compute resources – wfs should be able handle large data • Abstraction, Evolution, Reuse (human cycles) – wfs should be easy to (re-)use, evolve, share • Provenance – wfs should capture processing history, data lineage è traceable data- and wf-evolution è Reproducible Science Trident Workbench VisTrails Es war einmal … Provenance @ SBBD'16
  • 30. 10 Essential Functions of a Scientific Workflow System1. Automate programs and services scientists already use. 2. Schedule invocations of programs and services correctly and efficiently – in parallel where possible. 3. Manage dataflow to, from, and between programs and services. 4. Enable scientists (not just developers) to author or modify workflows easily. 5. Predict what a workflow will do when executed: prospective provenance. 6. Record what happened during workflow execution: retrospective provenance. 7. Reveal and query provenance – how workflow products were derived from inputs via programs and services. 8. Organize intermediate and final data products as desired by users. 9. Enable scientists to version, share and publish their workflows. 10. Empower scientists who wish to automate additional programs and services themselves. These functions (not just dataflow & actors) distinguish scientific workflow automation from general (scientific) software development. Src: Tim McPhillipsProvenance @ SBBD'16
  • 32. Executable WATERS Workflow in Kepler Provenance @ SBBD'16
  • 34. Motif-Catcher workflow, implemented in Kepler S Köhler et al. Improved Motif Detection in Large Sequence Sets with Random Sampling in a Kepler workflow, ICCS-WS, 2012 Provenance @ SBBD'16
  • 35. Kepler Workflows & Decision Making (Kruger Natl. Park, South Africa) SANParks Matt Jones, NCEAS @ UC Santa Barbara Provenance @ SBBD'16
  • 36. A Data-Streaming Workflow over Sensor Data Provenance @ SBBD'16
  • 37. • Monitor and control supercomputer simulations – 50+ composite actors (subworkflows) – 4 levels of hierarchy – 1000+ atomic (Java) actors 43 actors, 3 levels 196 actors, 4 levels 30 actors 206 actors, 4 levels 137 actors 33 actors 150 123 actors 66 actors 12 actors 243 actors, 4 levels Norbert Podhorszki ORNL (then: UC Davis) “Plumbing” workflow Provenance @ SBBD'16
  • 38. Data Curation Workflows (Filtered-Push … Kepler … Kurator projects) Provenance @ SBBD'16
  • 39. So what is “provenance” (sensu W3C) ? • Provenance refers to the sources of information, including entities and processes, involved in producing or delivering an artifact (*) • Provenance is a description of how things came to be, and how they came to be in the state they are in today (*) • Provenance is a record that describes the people, institutions, entities, and activities, involved in producing, influencing, or delivering a piece of data or a thing in the world Provenance @ SBBD'16
  • 40. W3C PROV Family of Specifications: Modeling Provenance Analysis and RDF Query Processing, Satya S. Sahoo, Praveen Rao, ISWC, October, 2015. W3C PROV Family of Specifications: Provenance Modeling •  W3C Recommendations o  PROV Data Model (PROV- DM) o  PROV Ontology (PROV-O) o  PROV-Constraints o  PROV Notations (PROV- N) •  PROV Working Group Notes (selected) o  PROV-Access and Querying (AQ) o  PROV Dictionary o  PROV XML o  PROV and Dublin Core Mappings (PROV-DC) o  PROV Semantics (using first-order logic) (PROV-SEM) Provenance @ SBBD'16
  • 41. Provenance & Semantic Web Layer Cake Provenance Analysis and RDF Query Processing, Satya S. Sahoo, Praveen Rao, ISWC, October, 2015. Provenance and Semantic Web Layer Cake •  Proof layer aka Provenance •  Trust is derived from provenance information Provenance @ SBBD'16
  • 47. Provenance Standards vs Tools • Do we need more standards to sort this out? – … or do we already have too many “standards”? • How should we think about provenance? – ... in workflows and databases? • What can we do with provenance? – ... in workflows and databases? • Tools to create, share, use provenance – … not just for “provenance for others” – ... need more “provenance for self” Provenance @ SBBD'16
  • 48. A simple (simplistic?) World View: Workflow-Land ó Trace-Land • Not a “standard” – but helps (me!) think about workflows (prospective provenance) and traces (retrospective provenance). Provenance @ SBBD'16
  • 49. ProvONE: PROV for scientific workflows (Transfer station to any of several other “standard extensions”) “Trace-Land” (retrospective provenance) “Data-Land” Yang Cao1 , Christopher Jones2 , Víctor Cuevas- Vicenttín3 , Matthew B. Jones2 , Bertram Ludäscher1 , Timothy McPhillips1 , Paolo Missier4 , Christopher Schwalm5 , Peter Slaughter2 , Dave Vieglais6 , Lauren Walker2 , Yaxing Wei7 1 University of Illinois, Urbana-Champaign, 2 National Center for Ecological Analysis and Synthesis, UCSB, 3 Universidad Popular Autónoma del Estado de Puebla, Mexico, 4 School of Computing Science, Newcastle University, UK, 5 Woods Hole Research Center, Falmouth, MA, 6 University of Kansas, Lawrence, 7 Environmental Sciences Division, Oak Ridge National Lab, TN Also: A. Marinho, L. Murta, C.Werner, V.Braganholo, S. Serra da Cruz, E.Ogasawara, M. Mattoso. “ProvManager: A Provenance Management System for Scientific Workflows.” Concurrency and Computation: Practice and Experience 24, no. 13 (2012): 1513–1530 … “Workflow-Land” (prospective prov.) Provenance @ SBBD'16
  • 50. Provenance Sleuth or Engineer? • Scientists are Provenance (i.e., Natural History) Sleuths • {Computational, Computer, Information}-Scientists should (also) be Provenance Engineers – Ensure your “Data Tree of Life” (data provenance) is correct! – What is the origin and processing history of your data? • With great provenance come great questions! – “We store everything!” – Huh? Yes, provenance is the answer… (yawn..) – But what is the question?? • Engineer’s Stance: – What questions do you want to answer? – Let’s find out what observables we need to capture, what query language we should use, how we do that efficiently (later), … Provenance @ SBBD'16
  • 51. Drilling down into “Trace-Land”: From MoC to MoP via Observables • Model of Computation MoC – specification/algorithm to compute Outputs = MoC(Wf,Params,Inputs) – a director or scheduler implements MoC – gives rise to formal notions of • computation (aka run) R – Formalisms to define M? • Model of Provenance MoP – associate with a MoC a “default” MoP (= MoC ± Δ) – the MoP is a “trimmed” MoC • T = R – I + M – Trace = Run – Ignored-observables + Modeled-observables • Observables (of a MoC / MoP) – functional observables (may influence output o) • token rate, notions of firing, … – non-functional observables (not part of M, do not influence o) • token timestamp, size, … (unless the MoC cares about those) Provenance @ SBBD'16
  • 52. M. Anand, S. Bowers, et al., SSDBM’09 From MoCs to Models of Provenance (MoPs) Provenance @ SBBD'16
  • 53. Fine-grained, Data & MoC-aware MoP M. Anand, S. Bowers, et al., SSDBM’09 Provenance @ SBBD'16
  • 54. Types of Data Provenance • Black-box – know (next to) nothing at compile-time – at runtime: keep some data lineage – most prov sensu WF work use this • White-box – statically (compile-time) analyzable – q(Y1,Y2) :- p(X1,X2), r(X1,Y1), s(X2,Y2) – Most prov sensu DB work use this • Grey-box – can “look inside” (some black boxes) – … e.g. b/c they have subworkflows – … or FP signatures: A :: t1, t2à t3,t4 – … or semantic annotations (sem.types) f A q t1 t2 t3 t4 X1 X2 Y1 Y2 Provenance @ SBBD'16
  • 55. ✔ Provenance capture (Matlab, R, Python, … scientific workflow systems) ✔ Uploading, sharing, linking provenance through various provenance tools ✗ Tools for scientists to exploit (≠ capture, share, link) provenance for their own day-to-day work. è Prime the provenance pump and increase provenance generation è Scientists accelerate their work via new, active uses of provenance. But … how to prime the provenance pump?? Must support “Provenance for Self” ! Provenance for Self?! Provenance for Others Provenance @ SBBD'16
  • 56. From Workflows & Provenance to Provenance for Script-based Workflows … • What workflow tools are (most) scientists using? – Workflow systems – … vs scripts (Python, R, MATLAB, ...) • What provenance tools are their? – Workflow system support – Tools for “workflow” scripts!? Provenance @ SBBD'16
  • 57. Yes, scripts are (can be) workflows too! Interactive Visualization Provenance @ SBBD'16
  • 58. SKOPE: Synthesized Knowledge Of Past Environments Bocinsky, Kohler et al. study rain-fed maize of Anasazi – Four Corners; AD 600–1500. Climate change influenced Mesa Verde Migrations; late 13th century AD. Uses network of tree-ring chronologies to reconstruct a spatio- temporal climate field at a fairly high resolution (~800 m) from AD 1–2000. Algorithm estimates joint information in tree-rings and a climate signal to identify “best” tree-ring chronologies for climate reconstructing. K. Bocinsky, T. Kohler, A 2000-year reconstruction of the rain-fed maize agricultural niche in the US Southwest. Nature Communications. doi:10.1038/ncomms6618 … implemented as an R Script … Provenance @ SBBD'16
  • 59. Provenance Support for Reproducible Science Example: Paleoclimate Reconstruction Science paper (OA) uses: • open source code: – R, PaleoCAR, … • Is that all we need? • What was the “workflow”? • Is there prospective and/or retrospective provenance? Provenance @ SBBD'16
  • 60. GetModernClimate PRISM_annual_growing_season_precipitation SubsetAllData dendro_series_for_calibration dendro_series_for_reconstruction CAR_Analysis_unique cellwise_unique_selected_linear_models CAR_Analysis_union cellwise_union_selected_linear_models CAR_Reconstruction_union raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors CAR_Reconstruction_union_output ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif master_data_directory prism_directory tree_ring_datacalibration_years retrodiction_years ? 5th Stop: YesWorkflow: Yes, scripts are workflows, too! • Script vs Workflows/ASAP: – Automation: ***** – Scaling: ** – Abstraction: * – Provenance: ** Provenance @ SBBD'16
  • 61. YesWorkflow.org • YesWorkflow (YW) – Started as a grass-roots effort (Kurator, SKOPE, ..) – … meeting the scientists/users where they R! • R, Matlab, (i)Python, Jupyter, … – Scripts + simple user annotations • => Reveal the workflow model/abstraction … that underlies the (script) implementation • => YW can give us more of ASAP! – First YW: ASAP (Abstraction)... – Then YW-recon: ASAP (reconstructing runtime Provenance) Provenance @ SBBD'16
  • 62. YW (prospective) and YW-Recon (retrospective) Provenance • 1. YW: Annotate Script => YW Model – Annotate @BEGIN..@END, @IN, @OUT – Visualize, share, be happy J • 2. Run script – Files are read and written – Folder- & Filenames have metadata • 3. YW-Recon – Use @URI tags that link YW Model ó Persisted Data – Run URI-template queries • cf. “ls -R” & RegEx matching • 4. YW-Query – Answer the user’s provenance queries Provenance @ SBBD'16
  • 64. YesWorkflow: Prospective & Retrospective Provenance … (almost) for free! • YW annotations in the script (R, Python, Matlab) are used to recreate the workflow view from the script … cassette_id sample_score_cutoff sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img initialize_run run_log file:run/run_log.txt load_screening_results sample_namesample_quality calculate_strategy rejected_sample accepted_sample num_images energies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_id energy frame_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv YW! Provenance @ SBBD'16
  • 65. GetModernClimate PRISM_annual_growing_season_precipitation SubsetAllData dendro_series_for_calibration dendro_series_for_reconstruction CAR_Analysis_unique cellwise_unique_selected_linear_models CAR_Analysis_union cellwise_union_selected_linear_models CAR_Reconstruction_union raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors CAR_Reconstruction_union_output ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif master_data_directory prism_directory tree_ring_datacalibration_years retrodiction_years Paleoclimate Reconstruction (EnviRecon.org) • … explained using YesWorkflow! Kyle B., (computational) archaeologist: "It took me about 20 minutes to comment. Less than an hour to learn and YW-annotate, all-told." Provenance @ SBBD'16
  • 66. main fetch_mask input_mask_file load_data input_data_file standardize_with_mask land_water_mask NEE_data simple_diagnose standardized_NEE_data result_NEE_pdf Get 3 views for the price of 1! result_NEE_pdf input_mask_file land_water_mask fetch_mask input_data_file NEE_data load_data standardized_NEE_data standardize_with_mask standardize_with_mask simple_diagnose fetch_mask land_water_mask load_data NEE_data standardize_with_mask standardized_NEE_data simple_diagnose result_NEE_pdf input_mask_file input_data_file Process view Data view Combined view Provenance @ SBBD'16
  • 67. Multi-Scale Synthesis and Terrestrial Model Intercomparison Project (MsTMIP) fetch_drought_variable drought_variable_1 fetch_effect_variable effect_variable_1 convert_effect_variable_units effect_variable_2 create_land_water_mask land_water_mask init_data_variables predrought_effect_variable_1 drought_value_variable_1 recovery_time_variable_1 drought_number_variable_1 define_droughts sigma_dv_event month_dv_length detrend_deseasonalize_effect_variable effect_variable_3 calculate_data_variables recovery_time_variable_2 drought_value_variable_2 predrought_effect_variable_2 drought_number_variable_2 export_recovery_time_figure output_recovery_time_figure export_drought_value_variable_figure output_drought_value_variable_figure export_predrought_effect_variable_figure output_predrought_effect_variable_figure export_drought_number_variable_figure output_drought_number_figure input_drough_variable input_effect_variable Christopher Schwalm, Yaxing Wei Provenance @ SBBD'16
  • 68. Figure 4: Process workflow view of an A↵ymetrix analysis script (in R). 4 YesWorkflow Examples In the following we show YesWorkflow views extracted from real-world scientific use cases. The scripts were annoted with YW tags by scientists and script authors, using a very modest training and mark-up e↵ort.1 Due to lack of space, the actual MATLAB and R scripts with their YW markup are not included here. However, they are all available from the yw-idcc-15 repository on the YW GitHub site [Yes15]. Gene Expression Microarray Data Analysis • [Normalize] – Normalization of data across microarray datasets • [SelectDEGs] – Selection of differentially expressed genes between conditions • [GO Analysis] – determination of gene ontology statistics for the resulting datasets • [MakeHeatmap] – creation of a heatmap of the differentially expressed genes. Tyler Kolisnik, Mark Bieda Provenance @ SBBD'16
  • 69. initialize_run run_log file:run/run_log.txt load_screening_results sample_name sample_quality calculate_strategy rejected_sample accepted_sample num_imagesenergies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_idenergyframe_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img cassette_id sample_score_cutoff Data collection workflow (X-ray diffraction) Provenance @ SBBD'16
  • 70. run/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     YW-RECON: Prospective & Retrospective Provenance … (almost) for free! cassette_id sample_score_cutoff sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img initialize_run run_log file:run/run_log.txt load_screening_results sample_namesample_quality calculate_strategy rejected_sample accepted_sample num_images energies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_id energy frame_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv • URI-templates link conceptual entities to runtime provenance “left behind” by the script author … • … facilitating provenance reconstructionProvenance @ SBBD'16
  • 71. YW (prospective) and YW-Recon (retrospective) Provenance • 1. YW: Annotate Script => YW Model – Annotate @BEGIN..@END, @IN, @OUT – Visualize, share, be happy J • 2. Run script – Files are read and written – Folder- & Filenames have metadata • 3. YW-Recon – Use @URI tags that link YW Model ó Persisted Data – Run URI-template queries • cf. “ls -R” & RegEx matching • 4. YW-Query – Answer the user’s provenance queries Provenance @ SBBD'16
  • 72. initialize_run run_log file:run/run_log.txt load_screening_results sample_name sample_quality calculate_strategy rejected_sample accepted_sample num_imagesenergies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_idenergyframe_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img cassette_id sample_score_cutoff Data collection workflow: runtime data run/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     1. YW annotations => YW model 2. Files & Folders left by a run => runtime (meta-)data Provenance @ SBBD'16
  • 73. initialize_run run_log file:run/run_log.txt load_screening_results sample_name sample_quality calculate_strategy rejected_sample accepted_sample num_imagesenergies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_idenergyframe_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img cassette_id sample_score_cutoff Q1: What samples did the script run collect images from? run/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     Provenance @ SBBD'16
  • 74. initialize_run run_log file:run/run_log.txt load_screening_results sample_name sample_quality calculate_strategy rejected_sample accepted_sample num_imagesenergies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_idenergyframe_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img cassette_id sample_score_cutoff Q2: What energies were used for image collection from sample DRT322? run/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     Provenance @ SBBD'16
  • 75. initialize_run run_log file:run/run_log.txt load_screening_results sample_name sample_quality calculate_strategy rejected_sample accepted_sample num_imagesenergies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_idenergyframe_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img cassette_id sample_score_cutoff Q3: Where is the raw image of the corrected image DRT322_11000ev_030.img? run/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     Provenance @ SBBD'16
  • 76. initialize_run run_log file:run/run_log.txt load_screening_results sample_name sample_quality calculate_strategy rejected_sample accepted_sample num_imagesenergies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_idenergyframe_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img cassette_id sample_score_cutoff run/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     Q5: What cassette-id had the sample leading to DRT240_10000ev_001.img? Provenance @ SBBD'16
  • 79. Using Provenance from Script Runs Example from the log-file: 2016-06-07 20:32:36 Wrote run/data/DRT240/DRT240_11000eV_002.img But how was that image derived?? (“Provenance for Self!”)Provenance @ SBBD'16
  • 80. module.__build_class__ module.__build_class__ simulate_data_collection 180 return 180 run_logger 201 return 201 new_image_file 230 parser 231 cassette_id 236 add_option 241 add_option 246 add_option 248 set_usage 251 parse_args 251 args 251 options 254 module.len 24 cassette_id 24 sample_score_cutoff 24 data_redundancy 24 calibration_image_file 30 exists 33 exists 32 filepath 34 module.remove 33 exists 32 filepath 34 module.remove 33 exists 32 filepath 34 module.remove 36 run_log 37 write 38 str(sample_score_cutoff) 38 write 38 str(sample_score_cutoff) 49 str.format 49 sample_spreadsheet_file 50 spreadsheet_rows cassette_q55_spreadsheet.csv 50 spreadsheet_rows(sample_spreadsheet_file) 51 str.format 51 write 50 sample_name 50 sample_quality 61 calculate_strategy 61 rejected_sample 61 energies 61 accepted_sample 61 num_images 72 str.format 72 write 73 open 73 rejection_log 74 str.format 74 TextIOWrapper.write 50 spreadsheet_rows 50 spreadsheet_rows(sample_spreadsheet_file) 51 str.format 51 write 50 sample_name 50 sample_quality 61 calculate_strategy 61 rejected_sample 61 energies 61 accepted_sample 61 num_images 90 str.format 90 write 91 sample_id 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image calibration.img 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 50 spreadsheet_rows 50 spreadsheet_rows(sample_spreadsheet_file) 51 str.format 51 write 50 sample_name 50 sample_quality 61 calculate_strategy 61 rejected_sample 61 energies 61 accepted_sample 61 num_images 90 str.format 90 write 91 sample_id 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 50 spreadsheet_rows 128 return run/run_log.txt run/rejected_samples.txt run/raw/q55/DRT240/e10000/image_001.raw run/data/DRT240/DRT240_10000eV_001.img run/collected_images.csv run/raw/q55/DRT240/e10000/image_002.raw run/data/DRT240/DRT240_10000eV_002.img run/raw/q55/DRT240/e11000/image_001.raw run/data/DRT240/DRT240_11000eV_001.img run/raw/q55/DRT240/e11000/image_002.raw run/data/DRT240/DRT240_11000eV_002.img run/raw/q55/DRT240/e12000/image_001.raw run/data/DRT240/DRT240_12000eV_001.img run/raw/q55/DRT240/e12000/image_002.raw run/data/DRT240/DRT240_12000eV_002.img run/raw/q55/DRT322/e10000/image_001.raw run/data/DRT322/DRT322_10000eV_001.img run/raw/q55/DRT322/e10000/image_002.raw run/data/DRT322/DRT322_10000eV_002.img run/raw/q55/DRT322/e11000/image_001.raw run/data/DRT322/DRT322_11000eV_001.img run/raw/q55/DRT322/e11000/image_002.raw run/data/DRT322/DRT322_11000eV_002.img noWorkflow: not only Workflow! • Scripts have provenance, too! • Transparently capture some/all provenance from Python script runs. • Use filter queries to “zoom” into relevant parts .. Provenance @ SBBD'16
  • 81. simulate_data_collection 230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8> 251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55']) 251 args = ['q55'] 251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}> 24 cassette_id = 'q55' 24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0 24 calibration_image_file = 'calibration.img' 49 str.format 49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv' 50 spreadsheet_rows(sample_spreadsheet_file) 50 sample_name = 'DRT240'50 sample_quality = 45 61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000]) 61 accepted_sample = 'DRT240'61 num_images = 2 61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240' 92 collect_next_image(casset ... _{frame_number:03d}.raw') 92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw' 106 str.format 106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img') calibration.img run/data/DRT240/DRT240_11000eV_002.img $ now dataflow -f "run/data/DRT240/DRT240_11000eV_002.img" $(NW_FILTERED_LINEAGE_GRAPH).gv: $(NW_FACTS) now helper df_style.py now dataflow -v 55 -f $(RETROSPECTIVE_LINEAGE_VALUE) -m simulation | python df_style.py -d BT -e > $(NW_FILTERED_LINEAGE_GRAPH).gv .. auto-“make” this! noWorkflow lineage of an image file Provenance information about Python function calls, variable assignments, etc. Provenance @ SBBD'16
  • 82. simulate_data_collection initialize_run run_log load_screening_results sample_namesample_quality calculate_strategy accepted_samplerejected_sample num_imagesenergies log_rejected_sample rejection_log collect_data_set sample_id energyframe_number raw_image transform_images corrected_imagetotal_intensitypixel_count log_average_image_intensity collection_log sample_spreadsheet calibration_image sample_score_cutoffdata_redundancy cassette_id YesWorkflow: Yes, scripts are Workflows, too! • Use YW annotations @begin...@end, @in, @out to reveal hidden conceptual workflow (prospective provenance) • Script isn't changed: – annotations via comments (=> language independent) • For understanding and sharing the “big picture” • Query and visualize! Provenance @ SBBD'16
  • 83. Alternate YW Views simulate_data_collection initialize_run load_screening_results calculate_strategy log_rejected_sample collect_data_set transform_images log_average_image_intensity simulate_data_collection initialize_run run_log load_screening_results sample_namesample_quality calculate_strategy accepted_samplerejected_sample num_imagesenergies log_rejected_sample rejection_log collect_data_set sample_id energyframe_number raw_image transform_images corrected_imagetotal_intensitypixel_count log_average_image_intensity collection_log sample_spreadsheet calibration_image sample_score_cutoffdata_redundancy cassette_id Process view Data view Workflow view Provenance @ SBBD'16
  • 84. simulate_data_collection initialize_run run_log load_screening_results sample_namesample_quality calculate_strategy accepted_samplerejected_sample num_imagesenergies log_rejected_sample rejection_log collect_data_set sample_id energyframe_number raw_image transform_images corrected_imagetotal_intensitypixel_count log_average_image_intensity collection_log sample_spreadsheet calibration_image sample_score_cutoffdata_redundancy cassette_id What is the lineage of “corrected_image”? From here on “upwards”: What led (leads) to this? .. and what is irrelevant and should be pruned?? Provenance @ SBBD'16
  • 85. simulate_data_collection collect_data_set sample_id energy frame_number raw_image calculate_strategy accepted_sample num_imagesenergies load_screening_results sample_namesample_quality transform_images corrected_image sample_spreadsheet calibration_image sample_score_cutoff data_redundancy cassette_id Subgraph resulting from lineage query on YW workflow model What is the lineage of corrected_image? Provenance @ SBBD'16
  • 86. simulate_data_collection initialize_run run_log load_screening_results sample_namesample_quality calculate_strategy accepted_samplerejected_sample num_imagesenergies log_rejected_sample rejection_log collect_data_set sample_id energyframe_number raw_image transform_images corrected_imagetotal_intensitypixel_count log_average_image_intensity collection_log sample_spreadsheet calibration_image sample_score_cutoffdata_redundancy cassette_id simulate_data_collection collect_data_set sample_id energy frame_number raw_image calculate_strategy accepted_sample num_imagesenergies load_screening_results sample_namesample_quality transform_images corrected_image sample_spreadsheet calibration_image sample_score_cutoff data_redundancy cassette_id module.__build_class__ module.__build_class__ simulate_data_collection 180 return 180 run_logger 201 return 201 new_image_file 230 parser 231 cassette_id 236 add_option 241 add_option 246 add_option 248 set_usage 251 parse_args 251 args 251 options 254 module.len 24 cassette_id 24 sample_score_cutoff 24 data_redundancy 24 calibration_image_file 30 exists 33 exists 32 filepath 34 module.remove 33 exists 32 filepath 34 module.remove 33 exists 32 filepath 34 module.remove 36 run_log 37 write 38 str(sample_score_cutoff) 38 write 38 str(sample_score_cutoff) 49 str.format 49 sample_spreadsheet_file 50 spreadsheet_rows cassette_q55_spreadsheet.csv 50 spreadsheet_rows(sample_spreadsheet_file) 51 str.format 51 write 50 sample_name 50 sample_quality 61 calculate_strategy 61 rejected_sample 61 energies 61 accepted_sample 61 num_images 72 str.format 72 write 73 open 73 rejection_log 74 str.format 74 TextIOWrapper.write 50 spreadsheet_rows 50 spreadsheet_rows(sample_spreadsheet_file) 51 str.format 51 write 50 sample_name 50 sample_quality 61 calculate_strategy 61 rejected_sample 61 energies 61 accepted_sample 61 num_images 90 str.format 90 write 91 sample_id 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image calibration.img 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 50 spreadsheet_rows 50 spreadsheet_rows(sample_spreadsheet_file) 51 str.format 51 write 50 sample_name 50 sample_quality 61 calculate_strategy 61 rejected_sample 61 energies 61 accepted_sample 61 num_images 90 str.format 90 write 91 sample_id 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 50 spreadsheet_rows 128 return run/run_log.txt run/rejected_samples.txt run/raw/q55/DRT240/e10000/image_001.raw run/data/DRT240/DRT240_10000eV_001.img run/collected_images.csv run/raw/q55/DRT240/e10000/image_002.raw run/data/DRT240/DRT240_10000eV_002.img run/raw/q55/DRT240/e11000/image_001.raw run/data/DRT240/DRT240_11000eV_001.img run/raw/q55/DRT240/e11000/image_002.raw run/data/DRT240/DRT240_11000eV_002.img run/raw/q55/DRT240/e12000/image_001.raw run/data/DRT240/DRT240_12000eV_001.img run/raw/q55/DRT240/e12000/image_002.raw run/data/DRT240/DRT240_12000eV_002.img run/raw/q55/DRT322/e10000/image_001.raw run/data/DRT322/DRT322_10000eV_001.img run/raw/q55/DRT322/e10000/image_002.raw run/data/DRT322/DRT322_10000eV_002.img run/raw/q55/DRT322/e11000/image_001.raw run/data/DRT322/DRT322_11000eV_001.img run/raw/q55/DRT322/e11000/image_002.raw run/data/DRT322/DRT322_11000eV_002.img simulate_data_collection 230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8> 251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55']) 251 args = ['q55'] 251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}> 24 cassette_id = 'q55' 24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0 24 calibration_image_file = 'calibration.img' 49 str.format 49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv' 50 spreadsheet_rows(sample_spreadsheet_file) 50 sample_name = 'DRT240'50 sample_quality = 45 61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000]) 61 accepted_sample = 'DRT240'61 num_images = 2 61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240' 92 collect_next_image(casset ... _{frame_number:03d}.raw') 92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw' 106 str.format 106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img') calibration.img run/data/DRT240/DRT240_11000eV_002.img lineage query lineage query YesWorkflow: Conceptual workflow model noWorkflow: Python trace model But how do we bridge this gap??? Would like to use YW model to query NW data! Provenance @ SBBD'16
  • 87. We’re off to see the Wizard of Prov ... We're off to see the Wizard, The wonderful Wizard of Prov! -- We hear he is a wiz of a wiz If ever a wiz there was. -- If ever, oh ever, a wiz there was, The Wizard of Prov is one because, Because, because, because, because, because, Because of the wonderful things he does. • Enrich YW conceptual view with NW Python provenance! • Get the best of both worlds! • How hard can it be to bridge YW and NW … (cf. TaPP’15 prototype) Provenance @ SBBD'16
  • 91. Secret Reproducible Sauce • Combining provenance information from noWorkflow and YesWorkflow • Using all the good stuff: – make, docker, Prolog, SQL, Graphviz • Open source – github.com/yesworkflow-org/yw-noworkflow – github.com/gems-uff/yin-yang-demo • Have a closer look at the demo! Provenance @ SBBD'16