Presentation for 5th International Workshop on
Data Engineering meets the Semantic Web (DESWeb)
In conjunction with ICDE 2014, Chicago IL, USA, March 31, 2014 held by Kai Schlegel
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
1. DESWeb 2014
ICDE 2014, Chicago IL, USA, March 3
balloon Fusion
SPARQL Rewriting Based on
Unified Co-Reference Information
Kai Schlegel (kai.schlegel@googlemail.com)
Florian Stegmaier, Sebastian Bayerl, Michael Granitzer, Harald Kosch
2. 2
Motivation
SPARQL Rewriting & Federation
Intermediate Results
Outline
supported by the European Commission
under the Seventh Framework Program
6. • Easy access to Linked Data
• Query Linked Open Data with SPARQL
• Plethora of tools available
• Problems:
• Business oriented
• Complex setup
• Maintenance
• „Paper-only“
• Not developer friendly
• Simple and „instant“ SPARQL Query Federation (-as-a-Service)
6
Motivation
Nothing-as-a-Service
7. • How to get information about the German City „Passau“?
• Problem: LOD is not a single database!
7
Querying LOD
SPARQL SPARQL
RDF
RDFRDF
SELECT ?p ?o WHERE {
<http://de.dbpedia.org/resource/Passau> ?p ?o.
}
de.dbpedia.org
Relations, Coordinates, Leader, etc.
What about the population?
SPARQL
8. • Problem: Selection of appropriate endpoints
• Send query to some endpoints and aggregate the results?
8
Distributed Querying!
SPARQL SPARQL
RDF
RDFRDF
SELECT ?p ?o WHERE {
<http://de.dbpedia.org/resource/Passau> ?p ?o.
}
de.dbpedia.org
SPARQL
linkedgeodata.org
9. • Problem: Different identifier for the same semantic concept
9
Misunderstanding: Co-Referencing
SPARQL SPARQL
RDF
RDFRDF
SELECT ?p ?o WHERE {
<http://de.dbpedia.org/resource/Passau> ?p ?o.
}
de.dbpedia.org
SPARQL
linkedgeodata.org
Known problem in linguistic:
It’s a spud!“
What?“
I mean potato!“
Co-Referencing: Multiple expressions
refer to the same thing.
10. 10
Problem = Solution?
SPARQL-based crawling of co-reference information
Exploit co-reference information for
• accomplishing immediate SPARQL rewriting
• performing endpoint selection
• execute automatic query federation
Basic idea: Focusing distributed co-reference information
Main principle: Semantic entites over
identifier!
12. 12
balloon Overflight
• SPARQL based crawling of LOD endpoints
• Query: Ask for subjects and objects which are
related with special predicate
• Simplified global view on
• Equivalence: owl:SameAs, skos:exactMatch,
coref:coreferenceData, ...
• Graph-Database Neo4j
• Equivalence Cluster:
Multiple synonym URIs representing the same
semantic entity including Provenance
13. 13
balloon Fusion
SPARQL Federation setup using co-reference information
SPARQL Transformation for each BGP
1. Determine synonym URIs
2. Select suitable endpoints
3. Adapt sub-queries to endpoints
4. Federated querying
SELECT ?p ?o WHERE {
<http://de.dbpedia.org/resource/Passau> ?p ?o.
}
SPARQL
15. 15
2. Select suitable endpoints
• Provenance based selection (PBS)
• Endpoints which are involved in cluster composition
• Namespace based selection (NBS)
• Prefix and Namespace matching of synonym URLs
Summarized: origin of co-reference
information and origin of synonym URIs
16. 16
2. Select suitable endpoints (2)
Assumption:
• Provenance information only contains „linkedgeodata.org“
as co-reference origin
• Namespaces for freebase and dbpedia available (datahub.io)
PBS:
Linked-Geo-Data
Endpoint
NBS:
DBPedia
endpoint
NBS:
Freebase
endpoint
18. • W3C SPARQL 1.1 Federated Query Extension (SERVICE)
• (Partial) Query can be executed against a remote SPARQL
endpoint
• Distributed sub-queries don‘t contain SPARQL 1.1 features
18
4. Federated Querying
SPARQL
SELECT ?p ?o WHERE {
SERVICE <http://dbpedia.org/sparql> {
<http://de.dbpedia.org/resource/Passau> ?p ?o.
} UNION {
SERVICE <http://www.freebase.com/base/sparql> {
<http://rdf.freebase.com/ns/m.01h5td> ?p ? }
} UNION {
SERVICE <http://linkedgeodata.org/sparql/> {
{ <http://rdf.freebase.com/ns/m.01h5td> ?p ?o. }
UNION
{ <http://linkedgeodata.org/triplify/node240057351> ?p ?o. }
UNION
{ <http://de.dbpedia.org/resource/Passau> ?p ?o. }
}}}
19. • Endpoint status check
• Check routine in terms of availability and latency
• Minimize sub-queries
• Group sub-queries with common endpoint
• Push join to endpoint
• SPARQL Features
• Condense PBS UNION-construct of synonym URIs
• SPARQL 1.1 VALUES or FILTER with IN operator
• Not well implemented in Linked Data endpoints
19
Optimizations (ongoing)
23. 23
Statistics
• Datahub.io: Linked Open Data Cloud catalog
• 337 datasets in total
• 237 expose a SPARQL endpoint
• 112 successfully queried for co-reference information
• Balloon Dataset (first run)
• 17.6M co-reference statements
• 22.4M distinct URLs
• 8.4M equivalence cluster (~ 2.68 identifier per cluster)
• Pending Analysis
• Distribution of cluster sizes, Number of different Hosts per cluster
• Main representative per cluster & False-Friends
24. Open Source:
• Demo, information and sources available (MIT License)
• X as a Service
• SPARQL Rewriting (HTTP API)
• Query Federation (SPARQL)
24
http://schlegel.github.io/balloon
25. Summary:
• SPARQL-based crawling of distributed co-reference information
• Exploit co-reference information for SPARQL federation
25
Single Point of Access