Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
A HYBRID FRAMEWORK FOR QUERYING
1. A HYBRID FRAMEWORK FOR QUERYING
LINKED DATA DYNAMICALLY
JÜRGEN UMBRICH
PhD Viva
November 26th, 2012
2. Classical Query Approach
MATERIALISED STORE
centralised
data warehousing
fast query
times
26/11/2012 PhD Viva, Jürgen Umbrich Slide 1 of 39
1
3. MOTIVATING EXAMPLE
GIVE ME THE CURRENT TEMPERATURE OF THE EUROPEAN CAPITALS.
26/11/2012 PhD Viva, Jürgen Umbrich Slide 2 of 39
2
4. Research Questions
How dynamic is linked data and what is the impact for store based
query processing?
Can the performance of live querying be improved by applying
lightweight reasoning?
How effective are hash-based data summaries for source
selection in live query processing?
How can live and store based processing be combined to obtain a
trade-off between fast and fresh results?
26/11/2012 PhD Viva, Jürgen Umbrich Slide 3 of 39
3
5. How dynamic is linked data and what is the impact
for store based query processing?
[LDOW 2010]
[DESWEB 2010]
[COLD 2011]
26/11/2012 PhD Viva, Jürgen Umbrich Slide 4 of 39
4
6. DYNAMIC LINKED DATA OBSERVATORY
Allows to study and assess the dynamics of Linked Data
DataHub and BTC
95K static URIs
95K dynamic (2 hops)
once a week
started in March 2012
[http://world.yale.edu]
weekly dumps are freely available at
http://swse.deri.org/dyldo
26/11/2012 PhD Viva, Jürgen Umbrich Slide 5 of 39
5
7. DYNAMICS OF LINKED DATA
How fast does a source change? 15 weeks
rest
17%
only once
8%
no changes
58%
every week
17%
26/11/2012 PhD Viva, Jürgen Umbrich Slide 6 of 39
6
8. DYNAMICS OF LINKED DATA
Can we observe different types of changes? 15 weeks
others
only value
14%
updates
24%
adds/dels
19%
value
updates &
only adds adds/dels
20% 23%
26/11/2012 PhD Viva, Jürgen Umbrich Slide 7 of 39
7
9. IMPLICATIONS FOR CENTRALISED QUERYING
How coherent are
the results?
26/11/2012 PhD Viva, Jürgen Umbrich Slide 8 of 39
8
10. COHERENCE OF QUERIES
LOD cache SPARQL endpoints
complete coherent
1% 15%
35%
43% complete incoherent
56%
partially coherent
50%
26/11/2012 PhD Viva, Jürgen Umbrich Slide 9 of 39
9
11. PROBLEM WITH CLASSICAL QUERY APPROACH
MATERIALISED STORE
centralised
data warehousing
outdated
results
limited
coverage
fast query
times
26/11/2012 PhD Viva, Jürgen Umbrich Slide 10 of10
39
12. LTBQE: LINK TRAVERSAL BASED QUERY EXECUTION
ohDoc:
Exploiting Linked Data principles:
oh:olaf foaf:name Olaf Hartig dereferencing URIs
owl:sameAs following links
foaf:img
foaf:knows
foaf:knows dblpA:Olaf_Hartig
http://... cb:chris SELECT ?f ?img
rdfs:seeAlso WHERE {
oh:olaf foaf:knows ?f .
cbDoc: ?f foaf:depiction ?img .
}
cbDoc:
cb:chris
foaf:depiction
?f ?img
owl:sameAs cb:chris http://..
http://...
foaf:name dblpA:Christian
_Bizer
Chris Bizer
26/11/2012 PhD Viva, Jürgen Umbrich Slide 11 of11
39
13. PERFORMANCE FACTORS OF LTBQE
query time is influenced by
source selection
number of sequential lookups
result recall is influenced by
dereferenceability
execution order
connectivity
26/11/2012 PhD Viva, Jürgen Umbrich Slide 12 of12
39
14. Can the performance of live querying be improved
by applying lightweight reasoning?
[RR 2012]
[SWJ submission]
26/11/2012 PhD Viva, Jürgen Umbrich Slide 13 of13
39
15. OUR CONTRIBUTION TO LTBQE
Improved recall with reasoning extensions
to make more raw data available
subset of RDFS
explicit owl:sameAs
26/11/2012 PhD Viva, Jürgen Umbrich Slide 14 of14
39
16. HOW REASONING CAN HELP LTBQE
ohDoc:
SELECT ?label WHERE {
oh:olaf foaf:name Olaf Hartig oh:olaf foaf:knows ?f .
owl:sameAs ?f rdfs:label ?label .
foaf:img }
foaf:knows dblpA:Olaf_Hartig
http://... cb:chris ?label
rdfs:seeAlso Christian Bizer
Chris bizer
cbDoc:
foaf:name rdfs:subPropertyOf rdfs:label
cbDoc:
rdfs:label
cb:chris Christian Bizer
foaf:depiction dblpA:Christian
owl:sameAs _Bizer
http://...
foaf:name dblpA:Christian foaf:maker
_Bizer dblpP:Hartig09
Chris Bizer dblpADoc:Christian_Bizer
26/11/2012 PhD Viva, Jürgen Umbrich Slide 15 of15
39
17. LTBQE ANALYSIS
Investigate how practical LTBQE is and how much more raw
data and results can be make available with our extensions?
How many URIs can be dereferenced?
How much additional data with our extensions?
How do our extensions perform in practice?
26/11/2012 PhD Viva, Jürgen Umbrich Slide 16 of16
39
18. LTBQE ANALYSIS: EXPERIMENTS
How many URIs can be dereferenced?
position %URIs available data
<URI> ?p ?o . 85% 95%
BTC 2011
?s ?p <URI> . 46% 44%
25.4m URIs ?s <URI> ?o . 1% 0.00…%
?s rdf:type <URI> . 10% 0.2%
<URI> 44% 51%
Schema data
Improved query time by around 50%
by reducing number of lookup
26/11/2012 PhD Viva, Jürgen Umbrich Slide 17 of17
39
19. LTBQE ANALYSIS: EXPERIMENTS
How much additional data with our extensions?
position %URIs available data
BTC 2011
<URI> rdfs:seeAlso ?o . 2% 1.006x
18.65m URIs <URI> owl:sameAs ?o . 16% 2.5x
RDFS reasoning* 81% 1.78 x
*rdfs:subClassOf, rdfs:subPropertyOf, rdfs:domain, rdfs:range
authoritativeTbox[Bonatti] extracted from BTC 2011
26/11/2012 PhD Viva, Jürgen Umbrich Slide 18 of18
39
20. QUERY GENERATION
How do our extensions perform in practice?
Existing benchmarks target either a single domain or provide
only a few queries.
BTC 2011 1100 queries
100 each for
QWalk:
Random walk based
11 “typical”
query generation. shapes
26/11/2012 PhD Viva, Jürgen Umbrich Slide 19 of19
39
22. LIMITATION OF LTBQE: JOIN OVER LITERALS
ohDoc: dblpADoc:Olaf_Hartig
foaf:name Olaf Hartig dblpP:Hartig09
oh:olaf Olaf Hartig
owl:sameAs foaf:name foaf:maker
foaf:img
foaf:knows dblpA:Olaf_Hartig dblpA:Olaf_Hartig
http://... cb:chris
rdfs:seeAlso
cbDoc:
join over Literal
materialised SELECT ?p2
LTBQE store WHERE {
oh:olaf foaf:name ?name .
? outdated
results ?p2 foaf:name ?name .
}
26/11/2012 PhD Viva, Jürgen Umbrich Slide 21 of21
39
23. ALTERNATIVE: SOURCE SELECTION
ohDoc: dblpADoc:Olaf_Hartig
SOURCE INDEX QUERY
ENGINE
26/11/2012 PhD Viva, Jürgen Umbrich Slide 22 of22
39
24. How effective are hash-based data summaries for
source selection in live query processing?
[WWW 2010]
[WWWJ 2011]
26/11/2012 PhD Viva, Jürgen Umbrich Slide 23 of23
39
25. APPROXIMATE DATA SUMMARIES
Combined description of
schema and
instance data
Use approximation to reduce index size
(incurs false positives)
Hash-based approach
Space complexity: O(buckets * #sources)
QTree: Combination of histograms and R-tree inheriting the
benefit of both data structures
optimal for sparse data
26/11/2012 PhD Viva, Jürgen Umbrich Slide 24 of24
39
26. HASH-BASED DATA SUMMARIES
ohDoc: ohDoc:
oh:olaf foaf:name Olaf Hartig
o
Input: triple + source
p
Hash: triple
Insert: 3D point and save
source information
30 Data
oh:olaf foaf:name “Olaf Hartig” . ohDoc:
20
Hash:
o [ 24 , 5 , 2 ] , ohDoc:
10
Insert:
1 ([ 24 , 5 , 2 ] , ohDoc: )
1 10 20 30
s
26/11/2012 PhD Viva, Jürgen Umbrich Slide 25 of25
39
27. EFFICIENT SOURCE SELECTION
Summarise data with buckets and store cardinality and source
information
Query: Lookup
{ oh:olaf ?p ?o } hash ( 24 , ? , ? )
equi-width histogram QTree
30
20
o
10
1
1 10 20 30
ohDoc:
s
26/11/2012 PhD Viva, Jürgen Umbrich Slide 26 of26
39
28. EVALUATION
Number of estimated sources as the crucial performance factor
other approaches
Qtree
Number of sources (log)
actually relevant
26/11/2012 PhD Viva, Jürgen Umbrich Slide 27 of27
39
29. TRADE-OFF: FRESH OR FAST
ACCESSING DATA
MATERIALISED
AT RUNTIME
STORE
fresh fast outdated
results query results
times
slow
query limited
times coverage
26/11/2012 PhD Viva, Jürgen Umbrich Slide 28 of28
39
30. How can live and store query processing be
combined to obtain a trade-off between fast and
fresh results?
[DESWEB 2012]
[EKAW 2012]
[ISWC 2012]
26/11/2012 PhD Viva, Jürgen Umbrich Slide 29 of29
39
31. HYBRID SPARQL EXECUTION IDEA
GIVE ME THE CURRENT TEMPERATURE OF THE EUROPEAN CAPITALS.
fresh fast query
results times
dynamic static
26/11/2012 PhD Viva, Jürgen Umbrich Slide 30 of30
39
32. HYBRID SPARQL: ARCHITECTURE
coherence update
update
Index query
Live query monitor
interface
interface
query
planner
26/11/2012 PhD Viva, Jürgen Umbrich Slide 31 of31
39
33. COHERENCE MONITOR
coherence update
update
Index query
Live query monitor
interface
interface
query
planner
computes and stores statistics about the freshness and coverage
of cache for individual query patterns
store independent: can be applied to any store; no indication
of specific coverage or update rates
store specific: more sensitive to the update patterns and
coverage of the store
26/11/2012 PhD Viva, Jürgen Umbrich Slide 32 of32
39
38. REAL WORLD EXPERIMENTS
Evaluation of different hybrid query plan strategies
Methodology
QWalk: Various types of SPARQL SELECT queries
star-shaped, path-shaped, mixed
different numbers of patterns
at least one static and dynamic pattern
Variable counting ordering
Single split with threshold (e.g. 0.5)
Static part is executed first
Linked traversal based query execution
26/11/2012 PhD Viva, Jürgen Umbrich Slide 37 of37
39
39. REAL WORLD EXPERIMENTS
Avg. of 43 queries
live ordering
1 coh
sel
0.8
live recall
split
rnd.
thres.
0.4 fixed
opt
store
0.3
1 2 6 12
speedup
26/11/2012 PhD Viva, Jürgen Umbrich Slide 38 of38
39
40. CONCLUSION
How dynamic is Linked Data and what is the impact for store based query
processing?
We verified that Linked Data is dynamic and that it impacts the result
freshness and completeness of cache based query engines.
Can the performance of live querying be improved by applying lightweight
reasoning?
our source selection and reasoning optimisation improve query time and
result recall compared to the state of the art.
How effective are hash-based data summaries for source selection in live
query processing?
The QTree loosen the query restrictions of pure live querying and
outperforms similar source selection approaches.
How can live and cache query processing be combined to obtain a trade-off
between fast and fresh results?
Hybrid query execution with the knowledge of data dynamics for fast and
fresh results.
26/11/2012 PhD Viva, Jürgen Umbrich Slide 39 of39
39
41. FUTURE WORK
Dynamic Linked Data Observatory
Extended experiments
Data mining to discover dynamic relations
Hybrid Query Execution
Develop a cost model which combines selectivity and
coherence
Automatically find best plan and split
Combination of different query approaches
SPARQL as the query language for the Web
Navigational features
26/11/2012 PhD Viva, Jürgen Umbrich Slide 40 of40
39
Notas del editor
e.g. sindice, watson,swse, virtuoso
No stream processing mentioning No infrastructure needed – not asking for eventsAd-hocHow to do query processing
This setup allows for study-ing (i) dynamics within the datasets (ii) dynamics between datasets (esp. links) (iii)and the growth of Linked Data and the arrival of new sources (although to a lesserextent).
This setup allows for study-ing (i) dynamics within the datasets (ii) dynamics between datasets (esp. links) (iii)and the growth of Linked Data and the arrival of new sources (although to a lesserextent).
Make it clearerand the growth of Linked Data and the arrival of new sources (although to a lesserextent).
Don’t mention two stores
We proofed for two prominent stores that that problem exists
e.g. sindice, watson,swse, virtuoso
denote
More links and connect more parts of the graph
Snapshot live
Overlay, dereferencing schema knowledge
Reasonable increase Most inferences look reasonable
We run it liveQuery generation to the slide
Add here some query timesIf you would assume linear query times Use a table with ratios
Materialsied store, outdated results, we need to check them again But that means we do not use the data, only the source information
Shrink the source index , compared to materialsed index
Introduce example query to show that LTQBE is limited and we can fix it by doing source selection We do not need a full materialsed index, since we retrieve the source and compute the query over itIf we do live lookup.
investigate several lightweight source selection approaches to further im-prove the query times, increase the result recall and loosen the query typerestriction of pure link traversal based query approaches
Could combine with previousAttachsourceto pointIf we wouldstore for each point the source information we would end up with full index with dic.so we split the numerical data space into buckets
Qtree optimal for sparse dataSame number of buckets , but more fine grained source selection
ShowexperUse the diagram again iments in a different way
Introduce bit by bitInterfacesCoehereQuery planner
involve monitoring a large range of Linked Data sources to build a comprehensive, global picture of the dynamicity of the Web of Data. Previous empirical studies [17,15] have shown varying levels of dynamicity across Linked Data sources; furthermore, we speculate that dynamicity varies by the schema of data [17]. In term of benefits, cache- independent estimates can be applied generically to any store (and indeed to other use-cases) [6]; however, they give no indication as to the specific coverage or update rates, etc., of the cache engine at hand.
Materialised storesLODcacheSindice SPARQLUse store icons
Triple pattern estimatesCentered predicatesQuery Sampling URIsDistinct predicates for chaces
More details
Filter out queries which produced empty results (offline sources)
We verifiied that Linked Data is dynamic which has an impact on results of mat enginesLTBQE approaches offer fresh results but works only for deref URIs and we can improve the recall through reasoing extensionsA compact data summary such as the Qtree pose no query restrictions and can find more sources that can answer the query than ltbqeMat cahces and lTBQE can be combined in a hybird execution framework to deliver fresh and fast results by integrating the knowledge about data dynamics.
Bildnicht optimal
This setup allows for study-ing (i) dynamics within the datasets (ii) dynamics between datasets (esp. links) (iii)and the growth of Linked Data and the arrival of new sources (although to a lesserextent).
This setup allows for study-ing (i) dynamics within the datasets (ii) dynamics between datasets (esp. links) (iii)Make it clearerand the growth of Linked Data and the arrival of new sources (although to a lesserextent).
Fuege label ein und loeschezweiquellen
Explain to claudio – maybe remove it
Triple pattern estimatesCentered predicatesQuery Sampling URIsDistinct predicates for chaces