SlideShare una empresa de Scribd logo
1 de 28
Interoperability for
Provenance-aware Databases
using PROV and JSON
Dieter Gawlick, Zhen Hua Liu, Vasudha Krishnaswamy
Oracle Corporation
Raghav Kapoor, Boris Glavic
Illinois Institute of Technology
Venkatesh Radhakrishnan
Facebook
Xing Niu
Illinois Institute of Technology
xniu7@hawk.iit.edu
Outline
① Introduction
② Related work
③ Overview
④ Export and Import
⑤ Experimental Results
⑥ Conclusions and Future Work
Introduction
• The PROV standards
 A standardized, extensible representation of provenance
graphs
 Exchange of provenance information between systems
• Provenance-aware DBMS
 Computing the provenance of database operations
 E.g., Perm[1], GProM [2], DBNotes[3], Orchestra[4],
LogicBlox[5]
3
[1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. In In Search
of Elegance in the Theory and Practice of Computation, pages 291–320. Springer, 2013..
[2] YB. Arab, D. Gawlick, V. Radhakrishnan, H. Guo, and B. Glavic. A generic provenance middleware for database queries,
updates, and transactions. In TaPP, 2014.
[3] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An Annotation Management System for Relational Databases.
VLDB Journal, 14(4):373–396, 2005.
[4] G. Karvounarakis, T. J. Green, Z. G. Ives, and V. Tannen. Collaborative data sharing via update exchange and provenance.
TODS, 38(3):19, 2013.
[5] Huang, S., Green, T., Loo, B.: Datalog and emerging applications: an interactive tutorial. In: SIGMOD, pp.
1213–1216 (2011)
Introduction
• Example: extracting demographic information
from tweets
4
Introduction
• Problem:
No relational database system supports tracking of
database provenance as well as import and export of
provenance in PROV
Not capable of exporting provenance into standardized
formats
• E.g., GProM:
Essentially produces wasDerivedFrom edges
• Between the output tuples of a query Q and its inputs.
However, not available as PROV graphs
• No way to track the derivation back to non-database entities
5
Introduction
• GProM System
6
Computes provenance for database
operations
• Queries, updates, transactions
Using SQL language extensions
• e.g., PROVENANCE OF (SELECT ...)
Introduction
• Example of GProM in action
The result of PROVENANCE OF for query Q
Each tuple in this result represents one wasDerivedFrom
assertion
• E.g., tuple to1 was derived from tuple t1
7
Introduction
• Goal: make databases interoperable with other
provenance systems
• Approach:
Export and import of provenance
• PROV-JSON
Propagation of imported provenance
Implemented in GProM using SQL
8
Outline
① Introduction
② Related work
③ Overview
④ Export and Import
⑤ Experimental Results
⑥ Conclusion and future work
Related Work
• How to integrate provenance graphs by identifying common
elements? [6]
• Address interoperability problem between databases and other
provenance-aware systems through
– Common model for both types of provenance [7][8][9]
– Monitoring database access to link database provenance with other
provenance systems [10][11]
10
[6] A. Gehani and D. Tariq. Provenance integration. In TaPP, 2014.
[7] U. Acar, P. Buneman, J. Cheney, J. van den Bussche, N. Kwasnikowska, and S. Vansummeren. A graph model of data and workflow
provenance. In TaPP, 2010.
[8] Y. Amsterdamer, S. Davidson, D. Deutch, T. Milo, J. Stoyanovich, and V. Tannen. Putting Lipstick on Pig: Enabling Database-style
Workflow Provenance. PVLDB, 5(4):346–357, 2011.
[9] D. Deutch, Y. Moskovitch, and V. Tannen. A provenance framework for data-dependent process analysis. PVLDB, 7(6), 2014.
[10] F. Chirigati and J. Freire. Towards integrating workflow and database provenance. In IPAW, pages 11–23, 2012.
[11] Q. Pham, T. Malik, B. Glavic, and I. Foster. LDV: Light-weight Database Virtualization. In ICDE, pages 1179–1190, 2015.
Outline
① Introduction
② Related works
③ Overview
④ Export and Import
⑤ Experimental Results
⑥ Conclusion and future work
Overview
• We introduce techniques for exporting database provenance
as PROV documents
• Importing PROV graphs alongside data
• Linking outputs of SQL operations to imported provenance
for their inputs
– Implementation in GProM offloads generation of PROV documents
to backend database
• SQL and string concatenation
12
Outline
① Introduction
② Related works
③ Overview
④ Export and Import
⑤ Experimental Results
⑥ Conclusion and future work
Export and Import
• Export
– Added TRANSLATE AS clause
• e.g., PROVENANCE OF (SELECT ...) TRANSLATE
AS …
– Construct PROV-JSON document from database
provenance
① Running several projections over the provenance
computation
– E.g., ‘”_:wgb(’ || F0.STATE || ‘|’ || F0.”AVG(AGE)” || ‘)’…
② Uses aggregation to concatenate all snippets of a certain
type
– E.g., entity nodes, wasGeneratedBy edges, allUsed edges
③ Uses string concatenation to create final document
14
Export and Import
• Example: part of the final PROV document
15
Red dotted lines in DB
Export and Import
• Import
Import PROV for an existing relation
Provide a language construct IMPORT PROV FOR ...
Import available PROV graphs for imported tuples and
store them alongside the data
Add three columns to each table to store imported
provenance
• prov doc: store a PROV-JSON snippet representing its
provenance
• Prov_eid: indicates which of the entities in this snippet
represents the imported tuple
• Prov_time: stores a timestamp as of the time when the tuple was
imported
16
Export and Import
• Import:example
Relation user with imported provenance
Attribute value d is the PROV graph from running
example without database activities and entities
17
Export and Import
• Using Imported Provenance During Export
Include the imported provenance as bundles in the
generated PROV graph
• Bundles [13] enable nesting of PROV graphs within
PROV graphs, treating a nested graph as a new entity.
Connect the entities representing input tuples in the
imported provenance to the query activity and output
tuple entities
18
[13] P. Missier, K. Belhajjame, and J. Cheney. The W3C PROV family of specifications for modelling
provenance metadata. In EDBT, pages 773–776, 2013.
Export and Import
• Example of Bundles:
19
Export and Import
• Handling Updates
If a tuple is modified, that should be reflected when
provenance is exported
• E.g., by running an SQL UPDATE statement
• Example
 Assume the user has run an update to correct tuple t1’s age value
(setting age to 70) before running the query
20
Export and Import
• Challenge
 How to track the provenance of updates under
transactional semantics
• Solution
GProM using the novel concept of reenactment
queries
• User can request the provenance of an past update,
transaction, or set of updates executed within a given
time interval
• Construct PROV document using provenance
for updates computed on-the-fly
21
Outline
① Introduction
② Related works
③ Overview
④ Export and Import
⑤ Experimental Results
⑥ Conclusion and future work
Experimental Results
• TPC-H [14] benchmark datasets
 Scale factor from 0.01 to 10 (10MB up to 10GB size)
• Run on a machine with
 2 x AMD Opteron 3.3Ghz Processors
 128GB RAM
 4 x 1 TB 7.2K RPM disks configured in RAID 5
• Queries
 Provenance of a three way join between relations customer,
order, and nation
 With additional selection conditions to control selectivity (and,
thus, the size of the exported PROV-JSON document).
23
[14] TPC. TPC-H Benchmark Specification, 2009.
Experimental Results
24
1 GB
10 GB
Outline
① Introduction
② Related works
③ Overview
④ Export and Import
⑤ Experimental Results
⑥ Conclusions and Future Work
Conclusions and Future Work
Conclusions
• Integrated import and export of provenance represented as
PROV-JSON into/from provenance-aware databases
• Construct PROV graphs on-the-fly using SQL
• Connect database provenance to imported PROV data
Future Work
• Full implementation for updates
• Automatic storage management (e.g., deduplication) for
imported provenance
• Automatic cross-referencing
26
Questions
• My Webpage
– http://www.cs.iit.edu/~dbgroup/people/xniu.php
• Our Group’s Webpage
– http://cs.iit.edu/~dbgroup/research/index.html
• GProM
– http://www.cs.iit.edu/~dbgroup/research/gprom.ph
p
27
Others
• Provenance querying
• Provenance for JSON
28

Más contenido relacionado

La actualidad más candente

Web Data Extraction Como2010
Web Data Extraction Como2010Web Data Extraction Como2010
Web Data Extraction Como2010Giorgio Orsi
 
8 query processing and optimization
8 query processing and optimization8 query processing and optimization
8 query processing and optimizationKumar
 
Complex Matching of RDF Datatype Properties
Complex Matching of RDF Datatype PropertiesComplex Matching of RDF Datatype Properties
Complex Matching of RDF Datatype PropertiesBesnik Fetahu
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets robertlz
 
Effective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataEffective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataRoi Blanco
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsCarl Lu
 
[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety
[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety
[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema VarietyUniversity of Bologna
 

La actualidad más candente (7)

Web Data Extraction Como2010
Web Data Extraction Como2010Web Data Extraction Como2010
Web Data Extraction Como2010
 
8 query processing and optimization
8 query processing and optimization8 query processing and optimization
8 query processing and optimization
 
Complex Matching of RDF Datatype Properties
Complex Matching of RDF Datatype PropertiesComplex Matching of RDF Datatype Properties
Complex Matching of RDF Datatype Properties
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
 
Effective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataEffective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF data
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
 
[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety
[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety
[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety
 

Similar a 2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON

RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Big Linked Data Federation - ExtremeEarth Open Workshop
Big Linked Data Federation - ExtremeEarth Open WorkshopBig Linked Data Federation - ExtremeEarth Open Workshop
Big Linked Data Federation - ExtremeEarth Open WorkshopExtremeEarth
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in SparkDigital Vidya
 
Exactpro Systems for KSTU Students in Kostroma
Exactpro Systems for KSTU Students in KostromaExactpro Systems for KSTU Students in Kostroma
Exactpro Systems for KSTU Students in KostromaIosif Itkin
 
Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Praveen Penumathsa
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applicationsLars Albertsson
 
Data science workflows: from notebooks to production
Data science workflows: from notebooks to productionData science workflows: from notebooks to production
Data science workflows: from notebooks to productionMarissa Saunders
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsMichael Häusler
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataGiorgos Santipantakis
 
DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning
DeepAM: Migrate APIs with Multi-modal Sequence to Sequence LearningDeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning
DeepAM: Migrate APIs with Multi-modal Sequence to Sequence LearningSung Kim
 
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Neo4j
 
An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEWShiyong Lu
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 
Testing of Object-Oriented Software
Testing of Object-Oriented SoftwareTesting of Object-Oriented Software
Testing of Object-Oriented SoftwarePraveen Penumathsa
 
GEN: A Database Interface Generator for HPC Programs
GEN: A Database Interface Generator for HPC ProgramsGEN: A Database Interface Generator for HPC Programs
GEN: A Database Interface Generator for HPC ProgramsTanu Malik
 
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Gabriele Bartolini
 

Similar a 2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON (20)

RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme Scales
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Big Linked Data Federation - ExtremeEarth Open Workshop
Big Linked Data Federation - ExtremeEarth Open WorkshopBig Linked Data Federation - ExtremeEarth Open Workshop
Big Linked Data Federation - ExtremeEarth Open Workshop
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
 
Exactpro Systems for KSTU Students in Kostroma
Exactpro Systems for KSTU Students in KostromaExactpro Systems for KSTU Students in Kostroma
Exactpro Systems for KSTU Students in Kostroma
 
Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
Data science workflows: from notebooks to production
Data science workflows: from notebooks to productionData science workflows: from notebooks to production
Data science workflows: from notebooks to production
 
GraphQL 101
GraphQL 101GraphQL 101
GraphQL 101
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data Applications
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
 
DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning
DeepAM: Migrate APIs with Multi-modal Sequence to Sequence LearningDeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning
DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning
 
Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
 
An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEW
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Testing of Object-Oriented Software
Testing of Object-Oriented SoftwareTesting of Object-Oriented Software
Testing of Object-Oriented Software
 
GEN: A Database Interface Generator for HPC Programs
GEN: A Database Interface Generator for HPC ProgramsGEN: A Database Interface Generator for HPC Programs
GEN: A Database Interface Generator for HPC Programs
 
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
 
Reactive Spring 5
Reactive Spring 5Reactive Spring 5
Reactive Spring 5
 

Más de Boris Glavic

2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...Boris Glavic
 
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...Boris Glavic
 
2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata Generator2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata GeneratorBoris Glavic
 
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...Boris Glavic
 
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...Boris Glavic
 
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-AnswersBoris Glavic
 
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-AnswersTaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-AnswersBoris Glavic
 
ICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationBoris Glavic
 
TaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of ProvenanceTaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of ProvenanceBoris Glavic
 
EDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested SubqueriesEDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested SubqueriesBoris Glavic
 
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...Boris Glavic
 
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...Boris Glavic
 
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"Boris Glavic
 
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"Boris Glavic
 
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"Boris Glavic
 
TaPP 2013 - Provenance for Data Mining
TaPP 2013 - Provenance for Data MiningTaPP 2013 - Provenance for Data Mining
TaPP 2013 - Provenance for Data MiningBoris Glavic
 
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...Boris Glavic
 
Ipaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanIpaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanBoris Glavic
 

Más de Boris Glavic (18)

2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
 
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
 
2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata Generator2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata Generator
 
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
 
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
 
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
 
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-AnswersTaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
 
ICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database Virtualization
 
TaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of ProvenanceTaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
 
EDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested SubqueriesEDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested Subqueries
 
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
 
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
 
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
 
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
 
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
 
TaPP 2013 - Provenance for Data Mining
TaPP 2013 - Provenance for Data MiningTaPP 2013 - Provenance for Data Mining
TaPP 2013 - Provenance for Data Mining
 
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
 
Ipaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanIpaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, Ian
 

Último

High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 

Último (20)

High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 

2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON

  • 1. Interoperability for Provenance-aware Databases using PROV and JSON Dieter Gawlick, Zhen Hua Liu, Vasudha Krishnaswamy Oracle Corporation Raghav Kapoor, Boris Glavic Illinois Institute of Technology Venkatesh Radhakrishnan Facebook Xing Niu Illinois Institute of Technology xniu7@hawk.iit.edu
  • 2. Outline ① Introduction ② Related work ③ Overview ④ Export and Import ⑤ Experimental Results ⑥ Conclusions and Future Work
  • 3. Introduction • The PROV standards  A standardized, extensible representation of provenance graphs  Exchange of provenance information between systems • Provenance-aware DBMS  Computing the provenance of database operations  E.g., Perm[1], GProM [2], DBNotes[3], Orchestra[4], LogicBlox[5] 3 [1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. In In Search of Elegance in the Theory and Practice of Computation, pages 291–320. Springer, 2013.. [2] YB. Arab, D. Gawlick, V. Radhakrishnan, H. Guo, and B. Glavic. A generic provenance middleware for database queries, updates, and transactions. In TaPP, 2014. [3] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An Annotation Management System for Relational Databases. VLDB Journal, 14(4):373–396, 2005. [4] G. Karvounarakis, T. J. Green, Z. G. Ives, and V. Tannen. Collaborative data sharing via update exchange and provenance. TODS, 38(3):19, 2013. [5] Huang, S., Green, T., Loo, B.: Datalog and emerging applications: an interactive tutorial. In: SIGMOD, pp. 1213–1216 (2011)
  • 4. Introduction • Example: extracting demographic information from tweets 4
  • 5. Introduction • Problem: No relational database system supports tracking of database provenance as well as import and export of provenance in PROV Not capable of exporting provenance into standardized formats • E.g., GProM: Essentially produces wasDerivedFrom edges • Between the output tuples of a query Q and its inputs. However, not available as PROV graphs • No way to track the derivation back to non-database entities 5
  • 6. Introduction • GProM System 6 Computes provenance for database operations • Queries, updates, transactions Using SQL language extensions • e.g., PROVENANCE OF (SELECT ...)
  • 7. Introduction • Example of GProM in action The result of PROVENANCE OF for query Q Each tuple in this result represents one wasDerivedFrom assertion • E.g., tuple to1 was derived from tuple t1 7
  • 8. Introduction • Goal: make databases interoperable with other provenance systems • Approach: Export and import of provenance • PROV-JSON Propagation of imported provenance Implemented in GProM using SQL 8
  • 9. Outline ① Introduction ② Related work ③ Overview ④ Export and Import ⑤ Experimental Results ⑥ Conclusion and future work
  • 10. Related Work • How to integrate provenance graphs by identifying common elements? [6] • Address interoperability problem between databases and other provenance-aware systems through – Common model for both types of provenance [7][8][9] – Monitoring database access to link database provenance with other provenance systems [10][11] 10 [6] A. Gehani and D. Tariq. Provenance integration. In TaPP, 2014. [7] U. Acar, P. Buneman, J. Cheney, J. van den Bussche, N. Kwasnikowska, and S. Vansummeren. A graph model of data and workflow provenance. In TaPP, 2010. [8] Y. Amsterdamer, S. Davidson, D. Deutch, T. Milo, J. Stoyanovich, and V. Tannen. Putting Lipstick on Pig: Enabling Database-style Workflow Provenance. PVLDB, 5(4):346–357, 2011. [9] D. Deutch, Y. Moskovitch, and V. Tannen. A provenance framework for data-dependent process analysis. PVLDB, 7(6), 2014. [10] F. Chirigati and J. Freire. Towards integrating workflow and database provenance. In IPAW, pages 11–23, 2012. [11] Q. Pham, T. Malik, B. Glavic, and I. Foster. LDV: Light-weight Database Virtualization. In ICDE, pages 1179–1190, 2015.
  • 11. Outline ① Introduction ② Related works ③ Overview ④ Export and Import ⑤ Experimental Results ⑥ Conclusion and future work
  • 12. Overview • We introduce techniques for exporting database provenance as PROV documents • Importing PROV graphs alongside data • Linking outputs of SQL operations to imported provenance for their inputs – Implementation in GProM offloads generation of PROV documents to backend database • SQL and string concatenation 12
  • 13. Outline ① Introduction ② Related works ③ Overview ④ Export and Import ⑤ Experimental Results ⑥ Conclusion and future work
  • 14. Export and Import • Export – Added TRANSLATE AS clause • e.g., PROVENANCE OF (SELECT ...) TRANSLATE AS … – Construct PROV-JSON document from database provenance ① Running several projections over the provenance computation – E.g., ‘”_:wgb(’ || F0.STATE || ‘|’ || F0.”AVG(AGE)” || ‘)’… ② Uses aggregation to concatenate all snippets of a certain type – E.g., entity nodes, wasGeneratedBy edges, allUsed edges ③ Uses string concatenation to create final document 14
  • 15. Export and Import • Example: part of the final PROV document 15 Red dotted lines in DB
  • 16. Export and Import • Import Import PROV for an existing relation Provide a language construct IMPORT PROV FOR ... Import available PROV graphs for imported tuples and store them alongside the data Add three columns to each table to store imported provenance • prov doc: store a PROV-JSON snippet representing its provenance • Prov_eid: indicates which of the entities in this snippet represents the imported tuple • Prov_time: stores a timestamp as of the time when the tuple was imported 16
  • 17. Export and Import • Import:example Relation user with imported provenance Attribute value d is the PROV graph from running example without database activities and entities 17
  • 18. Export and Import • Using Imported Provenance During Export Include the imported provenance as bundles in the generated PROV graph • Bundles [13] enable nesting of PROV graphs within PROV graphs, treating a nested graph as a new entity. Connect the entities representing input tuples in the imported provenance to the query activity and output tuple entities 18 [13] P. Missier, K. Belhajjame, and J. Cheney. The W3C PROV family of specifications for modelling provenance metadata. In EDBT, pages 773–776, 2013.
  • 19. Export and Import • Example of Bundles: 19
  • 20. Export and Import • Handling Updates If a tuple is modified, that should be reflected when provenance is exported • E.g., by running an SQL UPDATE statement • Example  Assume the user has run an update to correct tuple t1’s age value (setting age to 70) before running the query 20
  • 21. Export and Import • Challenge  How to track the provenance of updates under transactional semantics • Solution GProM using the novel concept of reenactment queries • User can request the provenance of an past update, transaction, or set of updates executed within a given time interval • Construct PROV document using provenance for updates computed on-the-fly 21
  • 22. Outline ① Introduction ② Related works ③ Overview ④ Export and Import ⑤ Experimental Results ⑥ Conclusion and future work
  • 23. Experimental Results • TPC-H [14] benchmark datasets  Scale factor from 0.01 to 10 (10MB up to 10GB size) • Run on a machine with  2 x AMD Opteron 3.3Ghz Processors  128GB RAM  4 x 1 TB 7.2K RPM disks configured in RAID 5 • Queries  Provenance of a three way join between relations customer, order, and nation  With additional selection conditions to control selectivity (and, thus, the size of the exported PROV-JSON document). 23 [14] TPC. TPC-H Benchmark Specification, 2009.
  • 25. Outline ① Introduction ② Related works ③ Overview ④ Export and Import ⑤ Experimental Results ⑥ Conclusions and Future Work
  • 26. Conclusions and Future Work Conclusions • Integrated import and export of provenance represented as PROV-JSON into/from provenance-aware databases • Construct PROV graphs on-the-fly using SQL • Connect database provenance to imported PROV data Future Work • Full implementation for updates • Automatic storage management (e.g., deduplication) for imported provenance • Automatic cross-referencing 26
  • 27. Questions • My Webpage – http://www.cs.iit.edu/~dbgroup/people/xniu.php • Our Group’s Webpage – http://cs.iit.edu/~dbgroup/research/index.html • GProM – http://www.cs.iit.edu/~dbgroup/research/gprom.ph p 27
  • 28. Others • Provenance querying • Provenance for JSON 28

Notas del editor

  1. Hello everyone, I am Xing, I am Ph.D. student at Illinois Institute of Technology the database group. I am presenting Interoperability for Provenance-aware Databases using PROV and JSON. It is join work with Raghav and Boris together.
  2. This is outline of my talk, I will first give an introduction of the problem I am solving here in particular provenance-aware system and explain why it is hard to get interoperability of provenance-aware systems. Then I will give an overview of related work. Afterwards I will present Overview of our solution. Then discuss in detail how we realize/implement/solve export and import of provenance in our approach. Finally I will show some experimental results and finish with conclusions and future work. Introduce the problem and explain why it is hard to get interoperability of provenance-aware systems Given an overview of our solution for interoperability for provenance-aware databases Then discuss in detail how we realize/implement/solve export and import of provenance in our approach
  3. The PROV standards is a standardized, extensible representation of provenance graphs which is very useful for exchange provenance information between systems. Provenance-aware database manage system computes the provenance of database operations. Such as perm, GProM, DBNotes, Orchestra and LogicBlox.
  4. Here we use an example to explain a simplified PROV graph which will help us to illustrate the interoperability problems that arise when data is created by multiple systems The example is about extracting demographic information from tweets (for twitter users from monthly log of tweets.) For our example pipeline, we have two input files, Jan and Feb, which are monthly logs of tweets. Then they are fed into extractor, that is E1 and E2 over here, which extract single tweets from each file. (For our example pipeline, we have two input files which are monthly logs of tweets. For example, here we have Jan and Feb logs, and this one are fed into extractor that is E1 and E2 over here which extract single tweets from this files. ) For this case, we get tw1 to tw3. Then the three tweets are fed into classifiers which predicts the demographic information from user, this case would be the state, age and gender of user. For each tweet, we are inserting a tuple, the result of classifier, into the database. For example it would be t1 for the first tweet. Then we run a query that showing in top here which computes the average age of twitter user per state. And this gives us two result tuples to1 and to2, you can see the result here in the bottom. Such a provenance graph is useful for example such as for tracing erroneous outputs back into the inputs For example, this dot red line here represent data dependency which are wasDeriveFrom edges in the prov graph. Of course such graph requires that we can track provenance outside and inside of the database and connect the two types of provenance - This requires that we can track provenance outside and inside of the database and connect the two types of provenance
  5. Relational database systems the provenance-aware support tracking of data provenance, but they don't support import and export of provenance in PRVO. That means particularly, for example, in GProM we can produce wasDeriveFrom edges between the output tuples and the input tuples. For example, between to1 and t1.(go back to previous slide) However, we can not export the information as PROV graph and we loose track of provenance how tuples were created by processes outside the database. Like we wouldn't know t1 was actually derived through the whole twriter extraction workflow.
  6. The GProM system using in this work is generic provenance middleware which is Computing provenance over multiple database backends by translating query SQL language with provenance extension into the SQL dialect of backend database system to compute provenance. The left graph shows the simplified flow of the GProM system. At first the parser received the provenance requirement and translates SQL code to the algebra tree which will be the input of the provenance rewriter. then the provenance rewriter rewrites the Algebra tree and generate a new one which can compute the provenance of the query. At last, SQL code generator translates the algebra tree back to SQL code which can be run at database. Now let’s see how we would compute the database side provenance in system like GProM. We would use the PROVENANCE OF SQL language construct which computes the provenance of the query.
  7. Now let’s see the result of provenance of query Q get from GProM, Q is the query shows in previous slide. It gives us a specific table that essentially represents wasDeriveFrom sessions For example, consider the first row here, the result tuple here is illinois 50, that is tuple to2 which was derived from our illinois tuple, in this case is tuple t2. So this line represents the wasDerivedFrom edge between to2 and t2.
  8. Our goal here is to make provenance-aware databases interoperable with other provenance-aware systems to be able to create this kind of PROV graphs I show you in the beginning. And our approach to achieve this goal is follows. So we implement export and import of provenance information stored in PROV-JSON. JSON is JavaScript Object Notation, a light-weight data-interchange format. PROV-JSON is a JSON serialization of the PROV data model. So this allows me to import provenance when I am importing tuples to not use the link to how tuple is created and it allows me to export the provenance of queries to be used in other provenance-aware systems. This is only really worked if I also propagating imported provenance doing query processing, because otherwise loosing connection of query output tuple to provenance of the input tuple that we are used. We implemented it in GProM system using SQL We propose to make provenance-aware databases interoperable with other provenance-aware systems through the approach 1) Exports database provenance into PROV 2) Import provenance stored as PROV-JSON alongside with data from a relational database JSON is JavaScript Object Notation, a light-weight data-interchange format. PROV-JSON is a JSON serialization of the PROV data model. 3) Propagates imported provenance during query processing 4) We implemented it in GProM system using SQL
  9. There is one approach which trys to link provenance that was recorded by independent systems by identifying common nodes independently develop provenance graphs. So this approach differs from ours is that it addresses like the problem of finding connection between these provenance graphs but not realize how integrate phycisally into one object. This is what we are addressing. (this is just look how to connect the graph, this not solving to represent the same model how to propagate, just address linking problem) Essentially two other approaches have been developed for addressing some of the interoperability problem between database and other provenance-aware systems by either introducing some common model for both type of provenance or monitoring the access to the database and linking database and other provenance-aware systems.   Our approach differs from these approaches, we also use common model by using PROV, however, we don't require central authority for monitoring and recording provenance as both (first and second) type of systems require. the provenance can be independently developed and resupport this interopelly by importing and exporting provenance. So this makes less coupled with other systems.   The first one Gehani et al. [6] try to identify nodes in two provenance graph that represent the same real world entity, activity, or actor and discuss how to integrate such provenance graphs. Some approaches try to address the interoperability problem between database and other provenance-aware systems by introducing a common model, such as The model for both types of provenance or by monitoring database access to link database provenance with other provenance systems. With PROV we also rely on a common model for provenance, but instead of requiring a central authority for monitoring and provenance recording, we support interoperability through import and export of provenance in PROV. And our approach for tracking provenance of JSON path expressions is similar to work on tracking the provenance of path expressions in XML query languages Tracking the provenance of path expressions in XML query languages [12] [12] J. N. Foster, T. J. Green, and V. Tannen. Annotated XML: Queries and Provenance. In PODS, pages 271–280, 2008.
  10. We introduce techniques for exporting database provenance as PROV documents and Importing PROV graphs alongside data we Link outputs of SQL operation to imported provenance for its inputs We implemented it in GProM offloads generation of PROV documents to backend database by using SQL and string concatenation methods. (add grpom graph in the 5-6) We introduce techniques for exporting database provenance as PROV documents Importing PROV graphs alongside data Linking outputs of SQL operation to imported provenance for its inputs We implemented it in GProM offloads generation of PROV documents to backend database by using SQL and string concatenation methods.
  11. For export, we add this new TRANSLATE AS clauses which allows users to specify whether to translate the provenance in different format such as PROV-JSON. If the user asks for translation what we will do is we would construct prov-json document on the fly by adding new SQL operations on top of the database provenance construction. The step is like We are running several projections over the result of provenance computation, each of this projection creates element of one particular type. For example, entity nodes, "wasDerivedFrom, wasGeneratedBy edge. How do we do this is essentially concatenate some fix strings with infromations from the tuples. Then for example, this is like snippet how we would create the JSON fragment for single wasGeneratedBy edge. As You can see we refer to actual attributes and concatenate this with some fix strings. At last, we have a final projection which does string concatenation to create the final json document .   1) each of the projection create element in the prov graph with certain type such as enity nodes or certain type of edges. For example, this is like partially how we would generate wasDerivedFrom edges. We are using some fix kind of string, templetes and insert values from the actually tuples. And we do this essentailly for all types of edges and nodes. THen in the next step, we Uses aggregation to concatenate all snippets of a certain type, for example allUsed edges are being aggregated into one string representing this used edges. Then we have a final projection which does string concatenation to create the final json document . GProM System can compute provenance for database operations, such as queries, updates and transactions. It uses the SQL language extensions, add the PROVENANCE OF in front of the query. So for the export, we add Translate as … in the end to export the provenance in PROV. We implement it from constructing snippets of the PROV-JSON document.
  12. If the user imports data, then she can store any available PROV graphs alongside the data. Here we add “IMPORT PROV FOR” in the end of the query. We add three columns to each table to store imported provenance. prov doc: store a PROV-JSON snippet representing its provenance Prov_eid: indicates which of the entities in this snippet represents the imported tuple Prov_time: stores a timestamp as of the time when the tuple was imported These attributes are useful for querying imported provenance and, even more important, to correctly include it in exported provenance.
  13. Figure in the 4th Slide without the query and query outputs
  14. To export the provenance of a query over data with imported provenance, we include the imported provenance as bundles in the generated PROV graph and connect the entities representing input tuples in the imported provenance to the query activity and output tuple entities. Bundles enable nesting of PROV graphs within PROV graphs, treating a nested graph as a new entity.
  15. We also handle updates If a tuple is modified, e.g., by running an SQL UPDATE statement, then this should be reflected when provenance is exported. For instance, assume the user has run an update to correct tuple t1’s age value (setting age to 70) before running the query. This update should be reflected in the exported provenance So the activity update should be shown in the graph, here the u means the update activity. There should be two versions of tuple t1 entity, the t’1 is updated entity of t1. - GProM uses temporal database
  16. The main challenge in supporting updates for export is how to track the provenance of updates under transactional semantics. This problem has been recently addressed in GProM using the novel concept of reenactment queries. which are queires which simulate the history of part of the history and have the same provenance as the history, so we can use them to compute provenance. Using GProM the user can request the provenance of an past update, transaction, or set of updates executed with a given time interval. construct PROV document provenance for updates computed on-the-fly
  17. We used TPC-H benchmark datasets with scale factors from 0:01 to 10 (10MB up to 10GB size). Experiments were run on the following configure. Here we computed the provenance of a three way join between relations customer, order, and nation with additional selection conditions to control selectivity (and, thus, the size of the exported PROV-JSON document). Provenance export for queries is fully functional while import of PROV is done manually for now. Exporting of propagated imported provenance is supported, but we only support the default storage layout.
  18. The two figures show the runtime of these experiments averaged over 100 runs for database sizes of 1GB and 10GB varying the number of result tuples between 15 and 15K tuples. Generating the PROV-JSON document comes at some additional cost over just computing the provenance. However, this cost is linear in the size of the generated document and independent on the size of the database.
  19. To conclude, I show you our approach of integrateing import and export of provenance represented as PROV-JSON into/from provenance-aware databases to arise the goal to support Interoperability between the provenance-aware database and other proveance-aware system, This solves the problem I show in the beginning of the graph. we construct prove graphs on the fly using SQL and prov construction is part of the query that generate the provenance. We connect database provenance to imported PROV data by propagating it doing provenance computation. So that create the links between final query results and the provenance of the imported data.   The future work we plan to finalize our implementation for updates, we know already how to do this ,but this is not implement yet. It would be nice to support automatic storage management, for example, deduplication ,for imported provenance. There database side technique we could use , such oracle secure file, deduplication feature or we could make more like design problem. It would be nice if we have some automatic support for cross-referencing like the reference talk about in the related work section to support the usecase whether the user is not to where how the provenance relate to each other.