ICT Role in 21st Century Education & its Challenges.pptx
SWPM12 report on the dagstuhl seminar on Semantic Data Management
1. Provenance at the Dagstuhl seminar on
Semantic Data Management, April 2012
Paolo Missier, Jose Manuel Gómez-Perez,
Dagstuhl repost @ SWPM 12 - P.Missier
Satya Sahoo
SWPM’12, June. 2012
1
2. previously at Dagstuhl...
Much provenance, not much semantics
- final report to be published soon
Interim Seminar wiki
Dagstuhl repost @ SWPM 12 - P.Missier
2
3. The provenance day @Dagstuhl
Tuesday (main topic: provenance, person in charge: Grigoris Antoniou)
Session 1. Provenance in semantic data management
■ Tutorial: Provenance some useful concepts (Paul, 20 minutes)
■ An introduction to the W3C PROV family of specs (Paul Groth / Luc Moreau / Paolo Missier / Olaf, 30 minutes)
■ Presentations from other attendees.
■ Manuel Salvadores: "Access Control in SPARQL: The BioPortal Use Case (15-20 min)"
■ Bryan Thompson: Simple and effective provenance mechanism for triples or quads based on composition
Session 2. Presentations
■ Kerry Taylor: Reaping the rewards: what is the provenance saying? (20 min)
■ Martin Theobald: Reasoning in Uncertain RDF Knowledge Bases with Lineage (20 min)
■ James Cheney: Database Wiki and provenance for SPARQL updates (10-15 min).
Session 3. Working groups and wrap-up
■ Objective: obtain roadmaps about typical problems on provenance
■ Working groups
■ Frank Van Harmelen: Provenance and scalability
Dagstuhl repost @ SWPM 12 - P.Missier
■ Paolo Missier: Provenance-specific benchmarks and corpora
■ José Manuel Gómez-Pérez: Novel usages of provenance information
■ Norbert Fuhr: Provenance and uncertainty
3
4. WG: Novel usages of provenance information (José Manuel Gómez-Pérez)
• Data integration
– assisted analysis, exploration along different dimensions of quality
– SmartCities, OpenStreetMap
• Analytics in social networks
– detect cool members in social networks
• Provenance diff (hard in general)
• Billing / Privacy
– emerging pay-per-query models
• Credit, attribution, citation and licensing
• Result reproducibility (e.g., Executable Paper Challenge)
• Determining quality in the report that has been generated by 3rd
parties for an organisation (e.g., Government report)
Dagstuhl repost @ SWPM 12 - P.Missier
4
5. WG: creating provenance-specific benchmarks
• Another one of the spontaneous Working Group activities at
Dagstuhl
• Not strictly “semantic”
– but PROV-RDF one of the expected encodings
• Led by Satya Sahoo, PM
• A community initiative
Goal:
To collect a corpus of reference provenance traces
from multiple contributors
from multiple domains
Dagstuhl repost @ SWPM 12 - P.Missier
and make it available as a community resource
5
6. Collecting reference provenance datasets
Why:
• to better understand actual usages of provenance
• for analysing properties of provenance graphs
– patterns in graphs
• to create a level field for performance comparison
– storage, compression methods
– query models, query processing
• SPARQL
• Datalog
• Graph query languages
• to test algorithms that prove interesting hypotheses
– “prov(D) contains valid indicators for quality(D)”
Dagstuhl repost @ SWPM 12 - P.Missier
How:
• By collecting submissions from the community
• By generating synthetic provenance
6
7. What: submissions
Submission:
- a collection of traces
- a collection of queries
hopefully from a variety of different domains
• Interesting properties of each trace:
• Graph structure -- regularity, recognizable patterns
• Graph size
• Scaling factors
• what is it to be used for
Submission:
Dagstuhl repost @ SWPM 12 - P.Missier
• Diversity of structure and size within the family
• Numerosity of traces
7
8. What: Traces format
• The PROV assumptions:
– uptake: PROV will be successful (!)
– interoperability: PROV will be sufficiently expressive to provide interoperability
• Thus, expecting PROV encoding for submissions seems
reasonable
• Advantages:
– tools are being built to parse, visualize, validate, analyse PROV-compliant traces
– multiple encodings available
• especially good if RDF is your thing
• Issues:
Dagstuhl repost @ SWPM 12 - P.Missier
– Conversion: existing traces are not natively PROV
– is there a need to dereference data at the end of URIs?
– licensing: multiple tiers? specific to each dataset?
8
9. What: Queries
• Hypothesis: Some queries are generic, in the sense that they apply across
multiple collections of traces
Single trace queries:
• Reachability queries over data and activity dependencies
– backwards (diagnosis)
– forwards (impact analysis)
• “chains of responsibility” (delegation)
Aggregation queries:
• production/usages of data, activities across traces
– assumes uniformity within a collection
• Do graph mining problems apply? do they have interesting interpretations?
– eg. subgraph discovery
Dagstuhl repost @ SWPM 12 - P.Missier
• Feature extraction for learning, mining
• Pairwise trace comparison:
– “earliest divergence” queries between pairs of "nearly isomorphic" traces
– differencing (complex)
9
10. A provenance repository
• If traces are submitted in one of the PROV standard encodings,
then the P-rep can provide validation services upon admission
• PROV is expected to support the following encodings:
– PROV-N -- the technology-neutral notation
– RDF -- the main official encoding
– XML -- unofficial XSD available
– JSON -- unofficial
– (Datalog? -- even more unofficial but syntactically very close to PROV-N)
Available validations: PROV-N
• Syntax:
Dagstuhl repost @ SWPM 12 - P.Missier
– PROV-N syntax
N 2 JSON N 2 RDF N 2 XML
– XML schema validation
• Consistency:
– validation wrt PROV-constraints PROV- PROV- PROV-
JSON RDF XML
10
11. Low-hanging fruits
• Wikipedia history pages
– dumps freely available
– or, through the Wikipedia REST API
• OpenStreetMap history pages
– very similar structure
• ...any other?
Dagstuhl repost @ SWPM 12 - P.Missier
11
12. Can we learn from similar initiatives?
• Well-established repositories for testing Machine Learning methods
– the UCI Machine Learning repositories
– the KDD Cup datasets
– ... and more
• “Building better RDF benchmarks”: Kavitha Srinivas @Dagstuhl
– DBpedia, UniProt -- large but no representative query workload
– YAGO: Wikipedia <-> Wordnet, 8 queries
– Barton Library, 7 queries
– Linked Sensor Dataset, no queries
– TPC-H as RDF
– Berlin SPARQL Benchmark (BSBM), 12 queries + mixes
– Lehigh University Benchmark (LUBM), 14 queries
Dagstuhl repost @ SWPM 12 - P.Missier
– SP2Bench (DBLP) 12 queries
– Original approach:
• Turn every dataset into a benchmark
• by editing the dataset to enforce measures of
12 – Coverage and Coherence
13. WG: Provenance and uncertainty (Norbert Fuhr)
• Uncertainty in the data
– Sensor data, Customer reviews
• Issues
– Reliability (“is this the original painting?”)
– Authenticity
• Sources of uncertain provenance
– Information extraction / NLP methods
– Human errors
– Inferences
– Instruments
• Challenges
– We need a data model for uncertainty in provenance
• probabilistic dependency relations
Dagstuhl repost @ SWPM 12 - P.Missier
– Explanation of the derivation of uncertain results
• Limitations
– Hard rules vs soft rules
– Knowledge acquisition process of those rules
– provenance incompleteness vs uncertainty
13
•