Since its inception, the PROV standard has been widely adopted as a standardized exchange format for provenance information. Surprisingly, this standard is currently not supported by provenance- aware database systems limiting their interoperability with other provenance-aware systems. In this work we introduce techniques for exporting database provenance as PROV documents, importing PROV graphs alongside data, and linking outputs of an SQL operation to the imported provenance for its inputs. Our implementation in the GProM system offloads generation of PROV documents to the backend database. This implementation enables provenance tracking for applications that use a relational database for managing (part of) their data, but also execute some non-database operations.
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
1. Interoperability for
Provenance-aware Databases
using PROV and JSON
Dieter Gawlick, Zhen Hua Liu, Vasudha Krishnaswamy
Oracle Corporation
Raghav Kapoor, Boris Glavic
Illinois Institute of Technology
Venkatesh Radhakrishnan
Facebook
Xing Niu
Illinois Institute of Technology
xniu7@hawk.iit.edu
3. Introduction
• The PROV standards
A standardized, extensible representation of provenance
graphs
Exchange of provenance information between systems
• Provenance-aware DBMS
Computing the provenance of database operations
E.g., Perm[1], GProM [2], DBNotes[3], Orchestra[4],
LogicBlox[5]
3
[1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. In In Search
of Elegance in the Theory and Practice of Computation, pages 291–320. Springer, 2013..
[2] YB. Arab, D. Gawlick, V. Radhakrishnan, H. Guo, and B. Glavic. A generic provenance middleware for database queries,
updates, and transactions. In TaPP, 2014.
[3] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An Annotation Management System for Relational Databases.
VLDB Journal, 14(4):373–396, 2005.
[4] G. Karvounarakis, T. J. Green, Z. G. Ives, and V. Tannen. Collaborative data sharing via update exchange and provenance.
TODS, 38(3):19, 2013.
[5] Huang, S., Green, T., Loo, B.: Datalog and emerging applications: an interactive tutorial. In: SIGMOD, pp.
1213–1216 (2011)
5. Introduction
• Problem:
No relational database system supports tracking of
database provenance as well as import and export of
provenance in PROV
Not capable of exporting provenance into standardized
formats
• E.g., GProM:
Essentially produces wasDerivedFrom edges
• Between the output tuples of a query Q and its inputs.
However, not available as PROV graphs
• No way to track the derivation back to non-database entities
5
6. Introduction
• GProM System
6
Computes provenance for database
operations
• Queries, updates, transactions
Using SQL language extensions
• e.g., PROVENANCE OF (SELECT ...)
7. Introduction
• Example of GProM in action
The result of PROVENANCE OF for query Q
Each tuple in this result represents one wasDerivedFrom
assertion
• E.g., tuple to1 was derived from tuple t1
7
8. Introduction
• Goal: make databases interoperable with other
provenance systems
• Approach:
Export and import of provenance
• PROV-JSON
Propagation of imported provenance
Implemented in GProM using SQL
8
10. Related Work
• How to integrate provenance graphs by identifying common
elements? [6]
• Address interoperability problem between databases and other
provenance-aware systems through
– Common model for both types of provenance [7][8][9]
– Monitoring database access to link database provenance with other
provenance systems [10][11]
10
[6] A. Gehani and D. Tariq. Provenance integration. In TaPP, 2014.
[7] U. Acar, P. Buneman, J. Cheney, J. van den Bussche, N. Kwasnikowska, and S. Vansummeren. A graph model of data and workflow
provenance. In TaPP, 2010.
[8] Y. Amsterdamer, S. Davidson, D. Deutch, T. Milo, J. Stoyanovich, and V. Tannen. Putting Lipstick on Pig: Enabling Database-style
Workflow Provenance. PVLDB, 5(4):346–357, 2011.
[9] D. Deutch, Y. Moskovitch, and V. Tannen. A provenance framework for data-dependent process analysis. PVLDB, 7(6), 2014.
[10] F. Chirigati and J. Freire. Towards integrating workflow and database provenance. In IPAW, pages 11–23, 2012.
[11] Q. Pham, T. Malik, B. Glavic, and I. Foster. LDV: Light-weight Database Virtualization. In ICDE, pages 1179–1190, 2015.
12. Overview
• We introduce techniques for exporting database provenance
as PROV documents
• Importing PROV graphs alongside data
• Linking outputs of SQL operations to imported provenance
for their inputs
– Implementation in GProM offloads generation of PROV documents
to backend database
• SQL and string concatenation
12
14. Export and Import
• Export
– Added TRANSLATE AS clause
• e.g., PROVENANCE OF (SELECT ...) TRANSLATE
AS …
– Construct PROV-JSON document from database
provenance
① Running several projections over the provenance
computation
– E.g., ‘”_:wgb(’ || F0.STATE || ‘|’ || F0.”AVG(AGE)” || ‘)’…
② Uses aggregation to concatenate all snippets of a certain
type
– E.g., entity nodes, wasGeneratedBy edges, allUsed edges
③ Uses string concatenation to create final document
14
15. Export and Import
• Example: part of the final PROV document
15
Red dotted lines in DB
16. Export and Import
• Import
Import PROV for an existing relation
Provide a language construct IMPORT PROV FOR ...
Import available PROV graphs for imported tuples and
store them alongside the data
Add three columns to each table to store imported
provenance
• prov doc: store a PROV-JSON snippet representing its
provenance
• Prov_eid: indicates which of the entities in this snippet
represents the imported tuple
• Prov_time: stores a timestamp as of the time when the tuple was
imported
16
17. Export and Import
• Import:example
Relation user with imported provenance
Attribute value d is the PROV graph from running
example without database activities and entities
17
18. Export and Import
• Using Imported Provenance During Export
Include the imported provenance as bundles in the
generated PROV graph
• Bundles [13] enable nesting of PROV graphs within
PROV graphs, treating a nested graph as a new entity.
Connect the entities representing input tuples in the
imported provenance to the query activity and output
tuple entities
18
[13] P. Missier, K. Belhajjame, and J. Cheney. The W3C PROV family of specifications for modelling
provenance metadata. In EDBT, pages 773–776, 2013.
20. Export and Import
• Handling Updates
If a tuple is modified, that should be reflected when
provenance is exported
• E.g., by running an SQL UPDATE statement
• Example
Assume the user has run an update to correct tuple t1’s age value
(setting age to 70) before running the query
20
21. Export and Import
• Challenge
How to track the provenance of updates under
transactional semantics
• Solution
GProM using the novel concept of reenactment
queries
• User can request the provenance of an past update,
transaction, or set of updates executed within a given
time interval
• Construct PROV document using provenance
for updates computed on-the-fly
21
23. Experimental Results
• TPC-H [14] benchmark datasets
Scale factor from 0.01 to 10 (10MB up to 10GB size)
• Run on a machine with
2 x AMD Opteron 3.3Ghz Processors
128GB RAM
4 x 1 TB 7.2K RPM disks configured in RAID 5
• Queries
Provenance of a three way join between relations customer,
order, and nation
With additional selection conditions to control selectivity (and,
thus, the size of the exported PROV-JSON document).
23
[14] TPC. TPC-H Benchmark Specification, 2009.
26. Conclusions and Future Work
Conclusions
• Integrated import and export of provenance represented as
PROV-JSON into/from provenance-aware databases
• Construct PROV graphs on-the-fly using SQL
• Connect database provenance to imported PROV data
Future Work
• Full implementation for updates
• Automatic storage management (e.g., deduplication) for
imported provenance
• Automatic cross-referencing
26
27. Questions
• My Webpage
– http://www.cs.iit.edu/~dbgroup/people/xniu.php
• Our Group’s Webpage
– http://cs.iit.edu/~dbgroup/research/index.html
• GProM
– http://www.cs.iit.edu/~dbgroup/research/gprom.ph
p
27
Hello everyone, I am Xing, I am Ph.D. student at Illinois Institute of Technology the database group.
I am presenting Interoperability for Provenance-aware Databases using PROV and JSON. It is join work with Raghav and Boris together.
This is outline of my talk, I will first give an introduction of the problem I am solving here in particular provenance-aware system and explain why it is hard to get interoperability of provenance-aware systems.
Then I will give an overview of related work. Afterwards I will present Overview of our solution.
Then discuss in detail how we realize/implement/solve export and import of provenance in our approach.
Finally I will show some experimental results and finish with conclusions and future work.
Introduce the problem and explain why it is hard to get interoperability of provenance-aware systems
Given an overview of our solution for interoperability for provenance-aware databases
Then discuss in detail how we realize/implement/solve export and import of provenance in our approach
The PROV standards is a standardized, extensible representation of provenance graphs which is very useful for exchange provenance information between systems.
Provenance-aware database manage system computes the provenance of database operations.
Such as perm, GProM, DBNotes, Orchestra and LogicBlox.
Here we use an example to explain a simplified PROV graph which will help us to illustrate the interoperability problems that arise when data is created by multiple systems
The example is about extracting demographic information from tweets (for twitter users from monthly log of tweets.)
For our example pipeline, we have two input files, Jan and Feb, which are monthly logs of tweets. Then they are fed into extractor, that is E1 and E2 over here, which extract single tweets from each file.
(For our example pipeline, we have two input files which are monthly logs of tweets.
For example, here we have Jan and Feb logs, and this one are fed into extractor that is E1 and E2 over here which extract single tweets from this files. )
For this case, we get tw1 to tw3.
Then the three tweets are fed into classifiers which predicts the demographic information from user, this case would be the state, age and gender of user.
For each tweet, we are inserting a tuple, the result of classifier, into the database. For example it would be t1 for the first tweet.
Then we run a query that showing in top here which computes the average age of twitter user per state.
And this gives us two result tuples to1 and to2, you can see the result here in the bottom.
Such a provenance graph is useful for example such as for tracing erroneous outputs back into the inputs
For example, this dot red line here represent data dependency which are wasDeriveFrom edges in the prov graph.
Of course such graph requires that we can track provenance outside and inside of the database and connect the two types of provenance
- This requires that we can track provenance outside and inside of the database and connect the two types of provenance
Relational database systems the provenance-aware support tracking of data provenance, but they don't support import and export of provenance in PRVO. That means particularly, for example, in GProM we can produce wasDeriveFrom edges between the output tuples and the input tuples.
For example, between to1 and t1.(go back to previous slide)
However, we can not export the information as PROV graph and we loose track of provenance how tuples were created by processes outside the database. Like we wouldn't know t1 was actually derived through the whole twriter extraction workflow.
The GProM system using in this work is generic provenance middleware which is Computing provenance over multiple database backends by
translating query SQL language with provenance extension into the SQL dialect of backend database system to compute provenance.
The left graph shows the simplified flow of the GProM system. At first the parser received the provenance requirement and translates SQL code to the algebra tree which will be the input of the provenance rewriter. then the provenance rewriter rewrites the Algebra tree and generate a new one which can compute the provenance of the query. At last, SQL code generator translates the algebra tree back to SQL code which can be run at database.
Now let’s see how we would compute the database side provenance in system like GProM. We would use the PROVENANCE OF SQL language construct which computes the provenance of the query.
Now let’s see the result of provenance of query Q get from GProM, Q is the query shows in previous slide.
It gives us a specific table that essentially represents wasDeriveFrom sessions
For example,
consider the first row here, the result tuple here is illinois 50, that is tuple to2 which was derived from our illinois tuple, in this case is tuple t2.
So this line represents the wasDerivedFrom edge between to2 and t2.
Our goal here is to make provenance-aware databases interoperable with other provenance-aware systems to be able to create this kind of PROV graphs I show you in the beginning.
And our approach to achieve this goal is follows.
So we implement export and import of provenance information stored in PROV-JSON.
JSON is JavaScript Object Notation, a light-weight data-interchange format.
PROV-JSON is a JSON serialization of the PROV data model.
So this allows me to import provenance when I am importing tuples to not use the link to how tuple is created and
it allows me to export the provenance of queries to be used in other provenance-aware systems.
This is only really worked if I also propagating imported provenance doing query processing, because otherwise loosing connection of query output tuple to provenance of the input tuple that we are used.
We implemented it in GProM system using SQL
We propose to make provenance-aware databases interoperable with other provenance-aware systems
through the approach
1) Exports database provenance into PROV
2) Import provenance stored as PROV-JSON alongside with data from a relational database
JSON is JavaScript Object Notation, a light-weight data-interchange format.
PROV-JSON is a JSON serialization of the PROV data model.
3) Propagates imported provenance during query processing
4) We implemented it in GProM system using SQL
There is one approach which trys to link provenance that was recorded by independent systems by identifying common nodes independently develop provenance graphs.
So this approach differs from ours is that it addresses like the problem of finding connection between these provenance graphs but not realize how integrate phycisally into one object. This is what we are addressing. (this is just look how to connect the graph, this not solving to represent the same model how to propagate, just address linking problem)
Essentially two other approaches have been developed for addressing some of the interoperability problem between database and other provenance-aware systems by either introducing some common model for both type of provenance or
monitoring the access to the database and linking database and other provenance-aware systems.
Our approach differs from these approaches, we also use common model by using PROV, however, we don't require central authority for monitoring and recording provenance as both (first and second) type of systems require. the provenance can be independently developed and resupport this interopelly by importing and exporting provenance. So this makes less coupled with other systems.
The first one Gehani et al. [6] try to identify nodes in two provenance graph that represent the same real world entity, activity, or actor and discuss
how to integrate such provenance graphs.
Some approaches try to address the interoperability problem between database and other provenance-aware systems by introducing a common model, such as
The model for both types of provenance or by monitoring database access to link database provenance with other provenance systems.
With PROV we also rely on a common model for provenance, but instead of requiring a central authority for monitoring and provenance recording, we support interoperability through import and export of provenance in PROV.
And our approach for tracking provenance of JSON path expressions is similar to work on tracking the provenance of path expressions in XML query languages
Tracking the provenance of path expressions in XML query languages [12]
[12] J. N. Foster, T. J. Green, and V. Tannen. Annotated XML: Queries and Provenance. In PODS, pages 271–280, 2008.
We introduce techniques for exporting database provenance as PROV documents and Importing PROV graphs alongside data
we Link outputs of SQL operation to imported provenance for its inputs
We implemented it in GProM offloads generation of PROV documents to backend database by using SQL and string concatenation methods.
(add grpom graph in the 5-6)
We introduce techniques for exporting database provenance as PROV documents
Importing PROV graphs alongside data
Linking outputs of SQL operation to imported provenance for its inputs
We implemented it in GProM offloads generation of PROV documents to backend database by using SQL and string concatenation methods.
For export, we add this new TRANSLATE AS clauses which allows users to specify whether to translate the provenance in different format such as
PROV-JSON.
If the user asks for translation what we will do is we would construct prov-json document on the fly by adding new SQL operations on top of
the database provenance construction.
The step is like
We are running several projections over the result of provenance computation, each of this projection creates element of one particular type.
For example, entity nodes, "wasDerivedFrom, wasGeneratedBy edge.
How do we do this is essentially concatenate some fix strings with infromations from the tuples. Then for example, this is like snippet how we would create the JSON fragment for single wasGeneratedBy edge. As You can see we refer to actual attributes and concatenate this with some fix strings.
At last, we have a final projection which does string concatenation to create the final json document .
1) each of the projection create element in the prov graph with certain type such as enity nodes or certain type of edges.
For example, this is like partially how we would generate wasDerivedFrom edges. We are using some fix kind of string, templetes and insert values from the actually tuples. And we do this essentailly for all types of edges and nodes.
THen in the next step, we Uses aggregation to concatenate all snippets of a certain type, for example allUsed edges are being aggregated into one string representing this used edges.
Then we have a final projection which does string concatenation to create the final json document .
GProM System can compute provenance for database operations, such as queries, updates and transactions.
It uses the SQL language extensions, add the PROVENANCE OF in front of the query.
So for the export, we add Translate as … in the end to export the provenance in PROV.
We implement it from constructing snippets of the PROV-JSON document.
If the user imports data, then she can store any available PROV graphs alongside the data.
Here we add “IMPORT PROV FOR” in the end of the query.
We add three columns to each table to store imported provenance.
prov doc: store a PROV-JSON snippet representing its provenance
Prov_eid: indicates which of the entities in this snippet represents the imported tuple
Prov_time: stores a timestamp as of the time when the tuple was imported
These attributes are
useful for querying imported provenance and, even more important, to correctly include it in exported provenance.
Figure in the 4th Slide without the query and query outputs
To export the provenance of a query over data with imported provenance, we include the imported provenance as bundles in the generated PROV graph and connect the entities representing input tuples in the imported provenance to the query activity and output tuple entities.
Bundles enable nesting of PROV graphs within PROV graphs, treating a nested graph as a new entity.
We also handle updates
If a tuple is modified, e.g., by running an SQL UPDATE statement, then this should be reflected when provenance
is exported. For instance, assume the user has run an update to correct tuple t1’s age value (setting age to 70) before running the query.
This update should be reflected in the exported provenance
So the activity update should be shown in the graph, here the u means the update activity. There should be two versions of tuple t1 entity, the t’1 is updated entity of t1.
- GProM uses temporal database
The main challenge in supporting updates for export is how to track the provenance of updates under transactional semantics.
This problem has been recently addressed in GProM using the novel concept of reenactment queries.
which are queires which simulate the history of part of the history and have the same provenance as the history, so we can use them to compute provenance.
Using GProM the user can request the provenance of an past update, transaction, or set of updates executed with a given time interval.
construct PROV document provenance for updates computed on-the-fly
We used TPC-H benchmark datasets with scale factors from 0:01 to 10 (10MB up to 10GB size).
Experiments were run on the following configure.
Here we computed the provenance of a three way join between relations customer, order, and nation with additional selection
conditions to control selectivity (and, thus, the size of the exported PROV-JSON document).
Provenance export for queries is fully functional while import of PROV
is done manually for now. Exporting of propagated imported provenance is supported, but we only support the default storage layout.
The two figures show the runtime of these experiments averaged over 100 runs for database sizes of 1GB and 10GB varying the number of
result tuples between 15 and 15K tuples.
Generating the PROV-JSON document comes at some additional cost over just computing the provenance.
However, this cost is linear in the size of the generated document and independent on the size of the database.
To conclude, I show you our approach of integrateing import and export of provenance represented as PROV-JSON into/from provenance-aware databases to arise the goal to support Interoperability between the provenance-aware database and other proveance-aware system, This solves the problem I show in the beginning of the graph.
we construct prove graphs on the fly using SQL and prov construction is part of the query that generate the provenance.
We connect database provenance to imported PROV data by propagating it doing provenance computation. So that create the links between final query results and the provenance of the imported data.
The future work we plan to finalize our implementation for updates, we know already how to do this ,but this is not implement yet.
It would be nice to support automatic storage management, for example, deduplication ,for imported provenance. There database side technique
we could use , such oracle secure file, deduplication feature or we could make more like design problem.
It would be nice if we have some automatic support for cross-referencing like the reference talk about in the related work section to support the usecase whether the user is not to where how the provenance relate to each other.