This document discusses formalizing and expressing completeness information about RDF data sources to enable assessing query completeness. It presents a framework for completeness statements and using them to check if a data source can fully answer a query. Specifically:
1. Completeness statements express patterns that a data source claims to contain using SPARQL-like syntax.
2. Query completeness is checked by seeing if the query results can be reconstructed from the completeness statements' patterns.
3. An example shows DBpedia is incomplete for a query about Tarantino movies and actors, while LinkedMDB is complete as its statements cover movies and actors.
1. Completeness Statements about RDF Data Sources
and Their Use for Query Answering
Fariz Darari
joint work with Werner Nutt, Giuseppe Pirrò, and Simon Razniewski
KRDB, Free University of Bozen-Bolzano, Italy
Context
Problem
Thousands of RDF data sources are today
available on the Web.
Machine-readable qualitative descriptions
of their content are crucial.
We focus on data completeness,
an important aspect of data quality.
Contributions
How to formalize and express in
a machine-readable way
completeness information
about RDF data sources?
How to leverage
such completeness information?
Completeness statement on the Web
1. Formal framework for expressing
completeness information.
2. Study of query completeness from
completeness information
in various settings.
Completeness statement on the Semantic Web
lv:lmdbdataset rdf:type void:Dataset.
lv:lmdbdataset c:hasComplStmt lv:st1.
lv:st1 c:hasPattern
[c:subject[spin:varName "m"]; c:predicate schema:actor; c:object[spin:varName "a"]].
lv:st1 c:hasCondition
[c:subject [spin:varName "m"]; c:predicate rdf:type; c:object schema:Movie].
lv:st1 c:hasCondition
[c:subject [spin:varName "m"]; c:predicate schema:director; c:object dbp:Tarantino].
Semantics of completeness statements
For each completeness statement, all the triple patterns defined
via hasPattern are collected into a set P1 and all the triple patterns defined
via hasCondition are collected into a set P2. A completeness statement is
interpreted as: CONSTRUCT {P1} WHERE {P1 . P2}
When a data source has a completeness statement (defined via
hasComplStmt), it means that if the query above is evaluated over
an “ideal” graph then all the results are in the data source.
Users visiting this source can prefer it
to other sources.
Checking query completeness
Given a query Q and a data source with completeness statements S:
1. Create a template answer graph GQ of Q.
2. Over GQ , evaluate all CONSTRUCT queries derived from S
3. Check whether GQ can be obtained after the evaluation.
If yes, the query is complete, otherwise might be incomplete.
However, the completeness
statement verified as complete is
only human readable!
Query completeness in a single data source scenario
@prefix
@prefix
@prefix
@prefix
@prefix
@prefix
@prefix
@prefix
c: <http://inf.unibz.it/ontologies/completeness#>
rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
spin: <http://spinrdf.org/sp#>
void: <http://rdfs.org/ns/void#>
dv: <http://dbpedia.org/void/>
lv: <http://linkedmdb.org/void/>
dbp: <http://dbpedia.org/resource/>
schema: <http://schema.org>
dv:dbpdataset rdf:type void:Dataset;
dv:dbpdataset c:hasComplStmt dv:st1.
dv:st1 c:hasPattern [c:subject [spin:varName "m"];
c:predicate rdf:type; c:object schema:Movie
].
dv:st1 c:hasPattern [c:subject [spin:varName "m"];
c:predicate schema:director;c:object dbp:Tarantino].
Endpoint IRI
DBPe
lv:lmdbdataset rdf:type void:Dataset;
lv:lmdbdataset c:hasComplStmt lv:st1.
lv:st1 c:hasPattern [c:subject [spin:varName "m"];
c:predicate rdf:type; c:object schema:Movie
].
lv:st1 c:hasPattern [c:subject [spin:varName "m"];
c:predicate schema:director;c:object dbp:Tarantino ].
lv:lmdbdataset c:hasComplStmt lv:st2.
lv:st2 c:hasPattern
[c:subject[spin:varName "m"];
c:predicate schema:actor; c:object[spin:varName "a"]].
lv:st2 c:hasCondition [c:subject [spin:varName "m"];
c:predicate rdf:type; c:object schema:Movie].
lv:st2 c:hasCondition [c:subject [spin:varName "m"];
c:predicate schema:director; c:object dbp:Tarantino].
Select all the movies for which
Tarantino is the director and also an actor
SPARQL
endpoint
DBPedia is complete
for all Tarantino's movies
The answer is
incomplete
Endpoint IRI
LMDBe
SELECT ?m
SPARQL
WHERE {?m rdf:type schema:Movie. The answer is
endpoint
complete
?m schema:director dbp:Tarantino.
?m schema:actor dbp:Tarantino}
LinkedMDB is completeall Tarantino’s movies and
LMDB is complete for for all Tarantino's movies
Q
and also moviestheir actors. is an actor
all for which he
Extensions
SPARQL queries with OPT
Completeness with RDFS inference
Federated query completeness
Work In Progress
SPARQL queries with negations and comparisons
Live, Web-based CoRner
Empirical evaluation of query completeness checking
Why is DBpedia
not complete for the query ?
The completeness statement
in DBpedia says that
it is complete for Tarantino’s
movies (dv:st1). However, the
query asks about all movies for
which Tarantino is the director,
and also an actor.
It is not stated that DBpedia
includes all the actors of
Tarantino’s movies.
Therefore, DBpedia is possibly
not complete for this query.
Why is LinkedMDB
complete ?
The completeness statements in
LMDB say that they are complete
for Tarantino’s movies (lv:st1)
and also the actors (lv:st2).
Implementation
CoRner:
Completeness Reasoner
http://rdfcorner.wordpress.com