A HYBRID FRAMEWORK FOR QUERYING

A HYBRID FRAMEWORK FOR QUERYING
LINKED DATA DYNAMICALLY

JÜRGEN UMBRICH

PhD Viva
November 26th, 2012

Classical Query Approach
MATERIALISED STORE
centralised
data warehousing

fast query
times

26/11/2012 PhD Viva, Jürgen Umbrich Slide 1 of 39
1

MOTIVATING EXAMPLE

GIVE ME THE CURRENT TEMPERATURE OF THE EUROPEAN CAPITALS.

2

Research Questions

How dynamic is linked data and what is the impact for store based
query processing?

Can the performance of live querying be improved by applying
lightweight reasoning?

How effective are hash-based data summaries for source
selection in live query processing?

How can live and store based processing be combined to obtain a
trade-off between fast and fresh results?

3

How dynamic is linked data and what is the impact
for store based query processing?

[LDOW 2010]
[DESWEB 2010]
[COLD 2011]
4

DYNAMIC LINKED DATA OBSERVATORY

Allows to study and assess the dynamics of Linked Data

 DataHub and BTC
 95K static URIs
 95K dynamic (2 hops)

 once a week
 started in March 2012

[http://world.yale.edu]

weekly dumps are freely available at
http://swse.deri.org/dyldo

5

DYNAMICS OF LINKED DATA

How fast does a source change? 15 weeks
rest
17%

only once
8%

no changes
58%
every week
17%

6

DYNAMICS OF LINKED DATA

Can we observe different types of changes? 15 weeks
others
only value
14%
updates
24%

adds/dels
19%

value
updates &
only adds adds/dels
20% 23%
7

IMPLICATIONS FOR CENTRALISED QUERYING

How coherent are
the results?

8

COHERENCE OF QUERIES

LOD cache SPARQL endpoints
complete coherent
1% 15%
35%
43% complete incoherent

56%
partially coherent
50%

9

PROBLEM WITH CLASSICAL QUERY APPROACH
MATERIALISED STORE
centralised
data warehousing

outdated
results

limited
coverage

fast query
times

26/11/2012 PhD Viva, Jürgen Umbrich Slide 10 of10
39

LTBQE: LINK TRAVERSAL BASED QUERY EXECUTION
ohDoc:
Exploiting Linked Data principles:
oh:olaf foaf:name Olaf Hartig  dereferencing URIs
owl:sameAs  following links
foaf:img
foaf:knows
foaf:knows dblpA:Olaf_Hartig
http://... cb:chris SELECT ?f ?img
rdfs:seeAlso WHERE {
oh:olaf foaf:knows ?f .
cbDoc: ?f foaf:depiction ?img .
}

cbDoc:

cb:chris
foaf:depiction
?f ?img
owl:sameAs cb:chris http://..
http://...
foaf:name dblpA:Christian
_Bizer
Chris Bizer

39

PERFORMANCE FACTORS OF LTBQE

 query time is influenced by
 source selection
 number of sequential lookups

 result recall is influenced by
 dereferenceability
 execution order
 connectivity

39

Can the performance of live querying be improved
by applying lightweight reasoning?

[RR 2012]
[SWJ submission]
39

OUR CONTRIBUTION TO LTBQE

 Improved recall with reasoning extensions
to make more raw data available
 subset of RDFS
 explicit owl:sameAs

39

HOW REASONING CAN HELP LTBQE
ohDoc:
SELECT ?label WHERE {
oh:olaf foaf:name Olaf Hartig oh:olaf foaf:knows ?f .
owl:sameAs ?f rdfs:label ?label .
foaf:img }
foaf:knows dblpA:Olaf_Hartig
http://... cb:chris ?label
rdfs:seeAlso Christian Bizer
Chris bizer
cbDoc:

foaf:name rdfs:subPropertyOf rdfs:label
cbDoc:
rdfs:label
cb:chris Christian Bizer
foaf:depiction dblpA:Christian
owl:sameAs _Bizer
http://...
foaf:name dblpA:Christian foaf:maker
_Bizer dblpP:Hartig09
Chris Bizer dblpADoc:Christian_Bizer
39

LTBQE ANALYSIS

Investigate how practical LTBQE is and how much more raw
data and results can be make available with our extensions?

How many URIs can be dereferenced?
How much additional data with our extensions?
How do our extensions perform in practice?

39

LTBQE ANALYSIS: EXPERIMENTS

How many URIs can be dereferenced?

position %URIs available data
<URI> ?p ?o . 85% 95%
BTC 2011
?s ?p <URI> . 46% 44%
25.4m URIs ?s <URI> ?o . 1% 0.00…%
?s rdf:type <URI> . 10% 0.2%
<URI> 44% 51%
Schema data
Improved query time by around 50%
by reducing number of lookup
39

LTBQE ANALYSIS: EXPERIMENTS

How much additional data with our extensions?

position %URIs available data
BTC 2011
<URI> rdfs:seeAlso ?o . 2% 1.006x
18.65m URIs <URI> owl:sameAs ?o . 16% 2.5x
RDFS reasoning* 81% 1.78 x

*rdfs:subClassOf, rdfs:subPropertyOf, rdfs:domain, rdfs:range
authoritativeTbox[Bonatti] extracted from BTC 2011

39

QUERY GENERATION

How do our extensions perform in practice?

Existing benchmarks target either a single domain or provide
only a few queries.

BTC 2011 1100 queries
100 each for
QWalk:
Random walk based
11 “typical”
query generation. shapes

39

THROUGHPUT: AVERAGE RESULT/TIME RATIO
worst best

LTBQE Core- seeAlso sameAs RDFS Comb
entity-s 1 1.68 1.67 2.15 1.29 1.53
entity-o 3.97 6.48 6.16 5.7 5.37 4.33
entity-so 2.02 2.82 2.66 3.71 3.73 4.8
star-3-0 0.11 0.16 0.15 0.15 0.24 0.2
star-2-1 0.58 1.12 1 1.04 2.14 1.75
star-1-2 0.17 1.6 1.35 1.6 70.97 58.85
star-0-3 0.18 0.35 0.33 0.94 0.24 0.68
s-path-2 0.44 0.72 0.68 0.7 0.83 0.78
s-path-3 1.76 2.45 2.56 2.46 2.43 2.1
o-path-2 1.38 8.39 7.76 10.55 6.36 6.89
o-path-3 0.95 5.7 5.84 6.08 5.04 4.68

Overall average query time of ~12 seconds.
39

LIMITATION OF LTBQE: JOIN OVER LITERALS
ohDoc: dblpADoc:Olaf_Hartig

foaf:name Olaf Hartig dblpP:Hartig09
oh:olaf Olaf Hartig
owl:sameAs foaf:name foaf:maker
foaf:img
foaf:knows dblpA:Olaf_Hartig dblpA:Olaf_Hartig
http://... cb:chris
rdfs:seeAlso

cbDoc:
join over Literal
materialised SELECT ?p2
LTBQE store WHERE {
oh:olaf foaf:name ?name .
? outdated
results ?p2 foaf:name ?name .
}

39

ALTERNATIVE: SOURCE SELECTION
ohDoc: dblpADoc:Olaf_Hartig

SOURCE INDEX QUERY
ENGINE

39

How effective are hash-based data summaries for
source selection in live query processing?

[WWW 2010]
[WWWJ 2011]

39

APPROXIMATE DATA SUMMARIES
 Combined description of
 schema and
 instance data

 Use approximation to reduce index size
(incurs false positives)

 Hash-based approach
 Space complexity: O(buckets * #sources)

 QTree: Combination of histograms and R-tree inheriting the
benefit of both data structures
 optimal for sparse data

39

HASH-BASED DATA SUMMARIES
ohDoc: ohDoc:

oh:olaf foaf:name Olaf Hartig
o
 Input: triple + source
p
 Hash: triple
 Insert: 3D point and save
source information
30 Data
oh:olaf foaf:name “Olaf Hartig” . ohDoc:
20
Hash:
o [ 24 , 5 , 2 ] , ohDoc:
10
Insert:
1 ([ 24 , 5 , 2 ] , ohDoc: )
1 10 20 30
s
39

EFFICIENT SOURCE SELECTION
 Summarise data with buckets and store cardinality and source
information
 Query: Lookup
{ oh:olaf ?p ?o } hash ( 24 , ? , ? )

equi-width histogram QTree
30

20
o
10

1
1 10 20 30
ohDoc:
s
39

EVALUATION
Number of estimated sources as the crucial performance factor
other approaches
Qtree
Number of sources (log)

actually relevant

39

TRADE-OFF: FRESH OR FAST
ACCESSING DATA
MATERIALISED
AT RUNTIME
STORE

fresh fast outdated
results query results
times
slow
query limited
times coverage

39

How can live and store query processing be
combined to obtain a trade-off between fast and
fresh results?

[DESWEB 2012]
[EKAW 2012]
[ISWC 2012]
39

HYBRID SPARQL EXECUTION IDEA
GIVE ME THE CURRENT TEMPERATURE OF THE EUROPEAN CAPITALS.

fresh fast query
results times

dynamic static

39

HYBRID SPARQL: ARCHITECTURE

coherence update
update

Index query
Live query monitor

interface
interface

query
planner

39

COHERENCE MONITOR

coherence update
update

Index query
Live query monitor

interface
interface

query
planner

computes and stores statistics about the freshness and coverage
of cache for individual query patterns

 store independent: can be applied to any store; no indication
of specific coverage or update rates
 store specific: more sensitive to the update patterns and
coverage of the store
39

COHERENCE OF PREDICATES

LOD cache SPARQL endpoints
complete coherent
10%
30%
23%
complete incoherent 46%

67%
partially coherent 24%

sioc:account_of swivt:creationDate foaf:knows
39

COHERENCE ESTIMATES

39

QUERY PLANNER

coherence update
update

Index query
Live query monitor

interface
interface

query
planner

 finding best query plan
 identifying dynamic/static patterns
 delegation and merging

39

QUERY PLANNING
selectivity-based coherence-based

tp4 tp3

tp3 tp2

tp1 tp2 tp4 tp1

Pattern Selectivity Coherence
tp1 0.98 0.86
tp2 0.43 0.32
tp3 0.21 0.00
tp4 0.15 0.91
39

REAL WORLD EXPERIMENTS
Evaluation of different hybrid query plan strategies

Methodology
 QWalk: Various types of SPARQL SELECT queries
 star-shaped, path-shaped, mixed
 different numbers of patterns
 at least one static and dynamic pattern
 Variable counting ordering
 Single split with threshold (e.g. 0.5)
 Static part is executed first
 Linked traversal based query execution

39

REAL WORLD EXPERIMENTS
Avg. of 43 queries
live ordering
1 coh
sel

0.8
live recall

split
rnd.
thres.
0.4 fixed
opt
store
0.3

1 2 6 12
speedup
39

CONCLUSION
 How dynamic is Linked Data and what is the impact for store based query
processing?
 We verified that Linked Data is dynamic and that it impacts the result
freshness and completeness of cache based query engines.
 Can the performance of live querying be improved by applying lightweight
reasoning?
 our source selection and reasoning optimisation improve query time and
result recall compared to the state of the art.
 How effective are hash-based data summaries for source selection in live
query processing?
 The QTree loosen the query restrictions of pure live querying and
outperforms similar source selection approaches.
 How can live and cache query processing be combined to obtain a trade-off
between fast and fresh results?
 Hybrid query execution with the knowledge of data dynamics for fast and
fresh results.
39

FUTURE WORK
 Dynamic Linked Data Observatory
 Extended experiments
 Data mining to discover dynamic relations

 Hybrid Query Execution
 Develop a cost model which combines selectivity and
coherence
 Automatically find best plan and split
 Combination of different query approaches

 SPARQL as the query language for the Web
 Navigational features

39

A HYBRID FRAMEWORK FOR QUERYING

Recomendados

Recomendados

Más contenido relacionado

Último

Último (20)

Destacado

Destacado (20)

A HYBRID FRAMEWORK FOR QUERYING

Notas del editor