1. Provenance Central:
More Mileage from Provenance Metadata
Bertram Ludäscher
UC Davis, USA
ludaesch@ucdavis.edu
Paolo Missier
Newcastle University, UK
paolo.missier@ncl.ac.uk
Members of the DataONE Provenance Working Group
CAMP-4-DATA workshop @IPres 2013
Sept, 6, 2013
Lisbon, Portugal
Friday, 6 September 13
2. Outline
• A foundation for Provenance management: the PROV data model
– From the W3C. Recommendation as of Spring, 2013
– generic, extensible model
• The role of provenance in the DataONE project
– Provenance enables search and discovery, reuse, reproducibility
– PBase: Provenance warehousing
– Integration with the DataONE architecture
– Provenance mining: the social life of research data
2
Friday, 6 September 13
3. PROV: scope and structure
3 source: http://www.w3.org/TR/prov-overview/
Recommendation
track
Prov-dictionaryplus:
Friday, 6 September 13
4. PROV: scope and structure
3 source: http://www.w3.org/TR/prov-overview/
Recommendation
track
Prov-dictionaryplus:
Friday, 6 September 13
5. PROV Core Elements (graph depiction)
4
An entity is a physical, digital, conceptual, or other kind of thing with some fixed
aspects; entities may be real or imaginary.
Entity
Activity
Agent
An activity is something that occurs over a period of time and acts upon or with entities; it
may include consuming, processing, transforming, ..., using, or generating entities.
An agent is something that bears some form of responsibility for an activity taking place,
for the existence of an entity, or for another agent's activity.
drafting commenting
paper1
paper2
used
draft
v1
wasGeneratedBy used draft
comments
wasGeneratedBy
Alice
Bob
wasAssociatedWith
actedOnBehalfOf
Remote past Recent past
distribution=internal
status=draft
version=0.1
ex:role=main_editor
type=person
ex:role=sr_editor
prov:role=editor
time=...
time=...
Friday, 6 September 13
6. Summary of the PROV Core model
5
– PROV-DC mapping available
– Recent Tutorial @EDBT’13 (June, 2013) [1]
• Model, Constraints, Applications
[1] Missier, Paolo, Khalid Belhajjame, and James Cheney. “The W3C PROV Family of Specifications for
Modelling Provenance Metadata.” In Procs. EDBT’13 (Tutorial). Genova, Italy: ACM, 2013.
Friday, 6 September 13
8. Context: ProvWG@DataONE
• DataONE: Data Observation Network for Earth
– 5yr NSF DataNet data preservation project (current phase)
– Provides a large scale, federated data infrastructure to the Earth Sciences
community
• Provenance Working Group
– Active until July, 2014 (current phase, looking at extending)
– One/two interns per year since 2010
– One dedicated researcher (postdoc) since 2012
– 12 core members, additional guest members on a rotation
• specific focus on the provenance of workflow-based e-science data
7
Friday, 6 September 13
9. DataONE collaboration scenario - 2012
8
Alice’s Workflow: generates benchmark climate data for model comparison
Input is retrieved from DataONE to generate an output file
Friday, 6 September 13
10. DataONE collaboration scenario - 2012
8
."."." ."."." ."."."
The workflow, provenance, and other metadata are uploaded to DataONE
A data package is created and indexed
Friday, 6 September 13
11. Searching
9
Bob: Search based on keywords in the metadata
➡ including provenance terms
Bob discovers Alice’s workflow. He may be able to execute it again
Friday, 6 September 13
13. DataOne Provenance components I: D-PROV
11
D-PROV extends PROV - Connects trace metadata to workflow structure
Missier, Paolo, Saumen Dey, Khalid Belhajjame, Victor Cuevas, and Bertram Ludaescher. “D-PROV: Extending the PROV
Provenance Model with Workflow Structure.” In Procs. TAPP’13. Lombard, IL, 2013.
Friday, 6 September 13
14. DataOne Provenance components I: D-PROV
onOutPort
T1Inv
d
onInPort
T2Inv
wasAssociatedWith
T1
wasAssociatedWith
T2
op1
ip1
wf
isTaskOf
isTaskOf
hasInputPort
hasOutputPort
wfInv
wasAssociatedWith
wasStartedBy
wasStartedBy
dataLink
12
D-PROV extends PROV
Connects trace metadata to workflow structure
Missier, Paolo, Saumen Dey, Khalid Belhajjame, Victor Cuevas, and Bertram Ludaescher. “D-PROV: Extending the
PROV Provenance Model with Workflow Structure.” In Procs. TAPP’13. Lombard, IL, 2013.
Friday, 6 September 13
15. DataOne Provenance components II: PBase
13
R ➞ DProv
T ➞ DProv
V ➞ DProv
eSc ➞ DProv
Tr ➞ DProv
K ➞ DProv
Neo4J&loader& Graph&
storage&
Query&layer&
indexing&
Analy8cal&services&
Friday, 6 September 13
16. DataOne Provenance components II: PBase
13
R ➞ DProv
T ➞ DProv
V ➞ DProv
eSc ➞ DProv
Tr ➞ DProv
K ➞ DProv
In-house components
Neo4J&loader& Graph&
storage&
Query&layer&
indexing&
Analy8cal&services&
Neo4J graph DBMS
[AllegroGraph]
[Graph-*] Can we do better
than the built-in Neo
indexing?
Friday, 6 September 13
17. DataOne Provenance components II: PBase
13
R ➞ DProv
T ➞ DProv
V ➞ DProv
eSc ➞ DProv
Tr ➞ DProv
K ➞ DProv
In-house components
Neo4J&loader& Graph&
storage&
Query&layer&
indexing&
Analy8cal&services&
Neo4J graph DBMS
[AllegroGraph]
[Graph-*]
Cypher (Neo, declarative)
[Gremlin (procedural)]
can we do better? scaling
graph queries
Can we do better
than the built-in Neo
indexing?
to be developed
Friday, 6 September 13
18. Baseline provenance queries in PBase
14
Ancestors and descendents (lineage): [2,3]
– Which datasets were involved in the production of data at node “e33”?
– Reachability: was task “e11_miny” involved in producing data at node “e38”?
Execution analysis: [3]
– Which tasks did not execute to completion for execution X of a given workflow?
– Find all inputs [outputs] of a given workflow across all its executions
– Given a data item, find all workflows / tasks that have used it as input
– Suppose we discover that service S has a bug, which data products were impacted by it?
– How many times was task T activated across a pool of workflow executions?
Provenance differencing: [4]
– Why do the results from two executions of the same workflow differ?
Attribution: [5]
– Who was responsible for this {data {usage, production}, service invocation}?
[2] Dey, Saumen, Víctor Cuevas-Vicenttín, Sven Köhler, Eric Gribkoff, Michael Wang, and Bertram Ludäscher. "On
implementing provenance-aware regular path queries with relational query engines." In Proceedings of the Joint
EDBT/ICDT 2013 Workshops, pp. 214-223. ACM, 2013.
[3] Dey, Saumen, Sven Köhler, Shawn Bowers, and Bertram Ludäscher. "Datalog as a lingua franca for
provenance querying and reasoning." In Workshop on the theory and practice of provenance (TaPP). 2012.
[4] Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing for
Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience, 2013
[5] Missier, Paolo, Bertram Ludäscher, Saumen Dey, Michael Wang, Tim McPhillips, Shawn Bowers, Michael Agun,
and Ilkay Altintas. "Golden Trail: Retrieving the Data History that Matters from a Comprehensive Provenance
Repository." International Journal of Digital Curation 7, no. 1 (2012): 139-150.
Friday, 6 September 13
19. Application - The social life of research data
• We know all about searching in the publications space
– who else is working on problems similar to mine?
– which results are available?
• In the data and process space:
1.Search and discovery
• Who else has used the {datasets, services, workflows,...} I am using?
– how do others rate them?
• Who used my {datasets, services, workflows,...}? How were they used?
2.Reuse, reproduction, validation
• Can I reproduce these results?
– using the same exact method
– using a variation of the method
• How do I apply this method to my data?
• ...
15
Social provenance for community building
Friday, 6 September 13
20. From Pull (client queries) to Push (notifications)
• Uncovering latent connections amongst services / data / people:
– Ranking, clustering, association rules
– Requires new similarity metrics
• A recommender system for scientists
– Analytics layer activated when new traces are added
• Challenges:
– How large a corpus of provenance graphs is needed?
– How global should the community be?
• Little new to discover in a small community
– Requires graphs with rich attribution / association relations
16
Graph&
storage&
Query&layer&
indexing&
Analy5cal&services&
Friday, 6 September 13
21. Credits
17
Current members of the DataONE Provenance Working Group:
In the USA:
Bertram Ludaescher, UC Davis (co-lead)
Victor Cuevas Vicenttin, UC Davis (DataONE postdoc researcher)
Saumen Dey, UC Davis (researcher)
Parisa Kianmajd, UC Davis (intern)
Juliana Freire, NYU-Poly
David Koop, NYU-Poly
Fernando Chirigati, NYU-Poly
Shawn Bowers, Gonzaga University
Ilkay Altintas, SDSC/UCSD
Karthik Ram, UC Berkeley
Yolanda Gil,USC - ISI
Yaxing Wei, ORNL
Dave Vieglais, DataONE Technical Lead
In the UK:
Paolo Missier, Newcastle University
James Cheney, University of Edinburgh
Khalid Belhajjame, University of Manchester
Friday, 6 September 13