SWPM12 report on the dagstuhl seminar on Semantic Data Management

Provenance at the Dagstuhl seminar on
Semantic Data Management, April 2012

Paolo Missier, Jose Manuel Gómez-Perez,
Dagstuhl repost @ SWPM 12 - P.Missier

Satya Sahoo

SWPM’12, June. 2012

1

previously at Dagstuhl...

Much provenance, not much semantics

- final report to be published soon

Interim Seminar wiki

2

The provenance day @Dagstuhl
Tuesday (main topic: provenance, person in charge: Grigoris Antoniou)
Session 1. Provenance in semantic data management
■ Tutorial: Provenance some useful concepts (Paul, 20 minutes)
■ An introduction to the W3C PROV family of specs (Paul Groth / Luc Moreau / Paolo Missier / Olaf, 30 minutes)
■ Presentations from other attendees.
■ Manuel Salvadores: "Access Control in SPARQL: The BioPortal Use Case (15-20 min)"
■ Bryan Thompson: Simple and effective provenance mechanism for triples or quads based on composition

Session 2. Presentations
■ Kerry Taylor: Reaping the rewards: what is the provenance saying? (20 min)
■ Martin Theobald: Reasoning in Uncertain RDF Knowledge Bases with Lineage (20 min)
■ James Cheney: Database Wiki and provenance for SPARQL updates (10-15 min).

Session 3. Working groups and wrap-up
■ Objective: obtain roadmaps about typical problems on provenance
■ Working groups
■ Frank Van Harmelen: Provenance and scalability

■ Paolo Missier: Provenance-speciﬁc benchmarks and corpora
■ José Manuel Gómez-Pérez: Novel usages of provenance information
■ Norbert Fuhr: Provenance and uncertainty

3

WG: Novel usages of provenance information (José Manuel Gómez-Pérez)
• Data integration
– assisted analysis, exploration along different dimensions of quality
– SmartCities, OpenStreetMap
• Analytics in social networks
– detect cool members in social networks
• Provenance diff (hard in general)
• Billing / Privacy
– emerging pay-per-query models
• Credit, attribution, citation and licensing
• Result reproducibility (e.g., Executable Paper Challenge)
• Determining quality in the report that has been generated by 3rd
parties for an organisation (e.g., Government report)

4

WG: creating provenance-specific benchmarks
• Another one of the spontaneous Working Group activities at
Dagstuhl
• Not strictly “semantic”
– but PROV-RDF one of the expected encodings
• Led by Satya Sahoo, PM
• A community initiative

Goal:

To collect a corpus of reference provenance traces
from multiple contributors
from multiple domains

and make it available as a community resource

5

Collecting reference provenance datasets
Why:
• to better understand actual usages of provenance
• for analysing properties of provenance graphs
– patterns in graphs
• to create a level field for performance comparison
– storage, compression methods
– query models, query processing
• SPARQL
• Datalog
• Graph query languages
• to test algorithms that prove interesting hypotheses
– “prov(D) contains valid indicators for quality(D)”

How:
• By collecting submissions from the community
• By generating synthetic provenance

6

What: submissions
Submission:
- a collection of traces
- a collection of queries
hopefully from a variety of different domains

• Interesting properties of each trace:
• Graph structure -- regularity, recognizable patterns
• Graph size
• Scaling factors
• what is it to be used for

Submission:

• Diversity of structure and size within the family
• Numerosity of traces

7

What: Traces format
• The PROV assumptions:
– uptake: PROV will be successful (!)
– interoperability: PROV will be sufficiently expressive to provide interoperability

• Thus, expecting PROV encoding for submissions seems
reasonable

• Advantages:
– tools are being built to parse, visualize, validate, analyse PROV-compliant traces
– multiple encodings available
• especially good if RDF is your thing
• Issues:

– Conversion: existing traces are not natively PROV
– is there a need to dereference data at the end of URIs?
– licensing: multiple tiers? specific to each dataset?

8

What: Queries
• Hypothesis: Some queries are generic, in the sense that they apply across
multiple collections of traces
Single trace queries:
• Reachability queries over data and activity dependencies
– backwards (diagnosis)
– forwards (impact analysis)
• “chains of responsibility” (delegation)
Aggregation queries:
• production/usages of data, activities across traces
– assumes uniformity within a collection

• Do graph mining problems apply? do they have interesting interpretations?
– eg. subgraph discovery

• Feature extraction for learning, mining

• Pairwise trace comparison:
– “earliest divergence” queries between pairs of "nearly isomorphic" traces
– differencing (complex)
9

A provenance repository
• If traces are submitted in one of the PROV standard encodings,
then the P-rep can provide validation services upon admission

• PROV is expected to support the following encodings:
– PROV-N -- the technology-neutral notation
– RDF -- the main official encoding
– XML -- unofficial XSD available
– JSON -- unofficial
– (Datalog? -- even more unofficial but syntactically very close to PROV-N)

Available validations: PROV-N
• Syntax:

– PROV-N syntax
N 2 JSON N 2 RDF N 2 XML
– XML schema validation
• Consistency:
– validation wrt PROV-constraints PROV- PROV- PROV-
JSON RDF XML

10

Low-hanging fruits
• Wikipedia history pages
– dumps freely available
– or, through the Wikipedia REST API
• OpenStreetMap history pages
– very similar structure

• ...any other?

11

Can we learn from similar initiatives?
• Well-established repositories for testing Machine Learning methods
– the UCI Machine Learning repositories
– the KDD Cup datasets
– ... and more

• “Building better RDF benchmarks”: Kavitha Srinivas @Dagstuhl
– DBpedia, UniProt -- large but no representative query workload
– YAGO: Wikipedia <-> Wordnet, 8 queries
– Barton Library, 7 queries
– Linked Sensor Dataset, no queries
– TPC-H as RDF
– Berlin SPARQL Benchmark (BSBM), 12 queries + mixes
– Lehigh University Benchmark (LUBM), 14 queries

– SP2Bench (DBLP) 12 queries

– Original approach:
• Turn every dataset into a benchmark
• by editing the dataset to enforce measures of
12 – Coverage and Coherence

WG: Provenance and uncertainty (Norbert Fuhr)
• Uncertainty in the data
– Sensor data, Customer reviews
• Issues
– Reliability (“is this the original painting?”)
– Authenticity
• Sources of uncertain provenance
– Information extraction / NLP methods
– Human errors
– Inferences
– Instruments
• Challenges
– We need a data model for uncertainty in provenance
• probabilistic dependency relations

– Explanation of the derivation of uncertain results
• Limitations
– Hard rules vs soft rules
– Knowledge acquisition process of those rules
– provenance incompleteness vs uncertainty
13
•

SWPM12 report on the dagstuhl seminar on Semantic Data Management

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (10)

Similar a SWPM12 report on the dagstuhl seminar on Semantic Data Management

Similar a SWPM12 report on the dagstuhl seminar on Semantic Data Management (20)

Más de Paolo Missier

Más de Paolo Missier (20)

Último

Último (20)

SWPM12 report on the dagstuhl seminar on Semantic Data Management

Notas del editor