Rethinking Online SPARQL Querying to Support Incremental Result Visualization

Rethinking Online SPARQL Querying
to Support
Incremental Result Visualization
Olaf Hartig
http://olafhartig.de
@olafhartig

Rethinking Online SPARQL Querying to Support Incremental Result Visualization - Olaf Hartig 2
Prologue

Live Querying the Web of Data
● Federated query processing
– i.e., querying a federation of SPARQL endpoints
● Linked Data query processing
– i.e., querying Linked Data by relying only on the
Linked Data principles (interface: URI lookups)
– e.g., traversal-based query execution
● Querying other Linked Data fragment servers
– e.g., triple pattern fragments

Chapter 1

Can the progress that has been made
on (Read/Write) Linked Data change the
way we interact with the Web […] ?”
“

Information in Dynamic Web Pages
Support for such an incremental visualization
has not received much attention in existing
work on querying the Web of Data

“
I think we have not made enough progress to even
enable well-understood interaction techniques that
are widely applied in “traditional” Web applications
Can the progress that has been made
on (Read/Write) Linked Data change the
way we interact with the Web […] ?”
“

Topics
Opportunities to Optimize the Response
Times of Traversal-based Query Executions
Making the Core Fragment of SPARQL
Suitable for the Task

Chapter 2

Implementation Approach
Data Retrieval
Operator
Triple
Pattern
Operator
Triple
Pattern
Operator
Dispatcher
. . .
Triple pattern
( ?v1, knows, ?v2 )

Data Retrieval Operator
Dispatcher
. . .
GET http://example.org/...
. . . . . . . .
RDF triple
( Bob, knows, Alice )
Triple pattern
( ?v1, knows, ?v2 )
Triple
Pattern
Operator
Triple
Pattern
Operator

Triple Pattern Operator
Dispatcher
. . .
. . . . . . . . Triple pattern
( ?v1, knows, ?v2 )
RDF triple
( Bob, knows, Alice )
Intermediate Solution
Timestamp: 1
Bindings: ?v1 → Bob, ?v2 → Alice
Flags: [ ∙ | √ | ∙ | ∙ ]

Dispatcher
. . .
. . . . . . . .
Output
Timestamp: 1
Bindings: ?v1 → Alice, ?v2 → Bob
Flags: [ ∙ | √ | ∙ | ∙ ]

Output
Triple Pattern Operator cont'd
. . .
. . . . . . . .
?X

Output
Triple Pattern Operator cont'd
. . .
. . . . . . . .
?
Timestamp: 461
Bindings: ?v1 → Bob, ?v2 → Steve
Flags: [ ∙ | √ | ∙ | ∙ ]
Timestamp: 327
Bindings: ?v1 → Bob, ?v3 → Berlin
Flags: [√ | ∙ | ∙ | ∙ ]
Timestamp: 461
Bindings: ?v1 → Bob, ?v2 → Steve,
?v3 → Berlin
Flags: [√ | √ | ∙ | ∙ ]

Output
Properties
. . .
. . . . . . . .
TP Operator
Data
Retrieval
Dispatcher
TP Operator
● Supports:
– any reachability-based
query semantics
● Highly flexible
– routing of intermediate
solutions
● Inspired by “Eddies”
– Avnur & Hellerstein,
SIGMOD 2000

Hypothesis 1
Responses time can be reduced
by applying a suitable routing policy.

Test of Different Routing Policies
Setup:
● Data retrieval operator simply appends to its lookup queue
● Web simulation environment (test Web: W-62-47, test query: Q1, details: [Hartig and Özsu 2014])
● Each bar represents geometric mean of 5 separate executions
Response time for
last reported solution,
relative to overall QET
Response time for
first reported solution,
relative to overall QET
Routing policy
has no impact!

Hypothesis 1
Responses time can be reduced
by applying a suitable routing policy.
No!
Why?

Data Retrieval Dominates!!!
Query 1 Query 4 Query 5 Query 9 Query 10
0.1
1
10
100
1000
10000
100000
10 threads 20 threads cache
avg.queryexec.time(seconds)
logscale!
5 queries of the FedBench benchmark suite,
executed over real Linked Data on the WWW
Different number of lookup threads
used by the data retrieval operator Data retrieval op. equipped with a cache
● Cache populated
by a first execution
● Times measured for
a 2nd, cache-only
execution (i.e., data
retrieval deactivated)

Hypothesis 2
Response times can be reduced
by choosing a “good” strategy
of prioritizing URI lookups.
. . . . . . . .

0 1 2 3 4 5 6
0
5
10
15
20
25
30
35
QET
exec1
exec2
exec3
exec4
exec5
Prioritizing Lookups Randomly
result elements
timefrombeginofthequeryexecution
(inminutes)
ca. 25% of QET
ca. 58%
Setup:
● LD10 of the FedBench benchmark suite,
over real Linked Data on the WWW

Hypothesis 2
√

Question
√
What is
?

Chapter 3

Topics
Opportunities to Optimize the Response
Times of Traversal-based Query Executions √
Making the Core Fragment of SPARQL
Suitable for the Task
(by making it monotonic)

Monotonicity?
● Query Q is monotonic if for every pair ( , ) of
possible databases, it holds that:
● Example: the SPARQL pattern is
P = (a, p,?x) OPT (?x, p,?y)
is not monotonic
– G1 = { (a, p, b) }
– G2 = { (a, p, b), (b, p, c) }
– ⟦P⟧G1 = { μ }, where μ = { ?x → b }
– ⟦P⟧G2 = { μ' }, where μ' = { ?x → b, ?y → c } ≠ μ !
⟹ Q( ) ⊆ Q( )

What is the Issue?
● For any non-monotonic query, elements of
the result set can be output only after we
have seen all query-relevant parts of the DB
– Hence, since we discover our DB (the Web of Data)
at runtime, we can output result elements only after
completing the discovery process
● Good news: the AND-UNION-FILTER fragment of
SPARQL is monotonic [Arenas and Perez 2011]
● Bad news: for the AND-UNION-FILTER-OPT fragment,
monotonicity is undecidable [Hartig 2014]
– i.e., queries with OPT may be non-monotonic

What is the Usage of OPT?
● DBpedia
– 46.4% of ca. 1.3M unique queries
(logs from Apr. – Jul. 2010)
Picalausa and Vansummeren, in SWIM 2011
– 16.6% (logs from USEWOD 2011 dataset)
Gallego et al., in USEWOD 2011
– 15% (logs from USEWOD 2011 dataset)
Elbedweihy et al., in COLD 2011
● Semantic Web conference corpus (SWDF)
– 0.4% (logs from USEWOD 2011 dataset)
Gallego et al., in USEWOD 2011

A Proposal: The OPT
+
Operator
●
● Recall our example: the SPARQL pattern is
P' = (a, p,?x) OPT (?x, p,?y)
is not monotonic
– G1 = { (a, p, b) }, G2 = { (a, p, b), (b, p, c) }
– ⟦P'⟧G1 = { μ }, where μ = { ?x → b }
– ⟦P'⟧G2 = { μ, μ' }, where μ' = { ?x → b, ?y → c } ≠ μ !
● 〚 P1 OPT+
P2 〛 G = ( 〚 P1 〛 G ⋈ 〚 P2 〛 G ) υ ( 〚 P1 〛 G 〚 P2 〛 G )
● 〚 P1 OPT+
P2 〛 G = ( 〚 P1 〛 G ⋈ 〚 P2 〛 G ) υ 〚 P1 〛 G
➔ P1 OPT+
P2 ≡ (P1 AND P2) UNION P1

A Proposal: The OPT
+
Operator
●
● Recall our example: the SPARQL pattern is
P' = (a, p,?x) OPT+
(?x, p,?y)
is not monotonic √
– G1 = { (a, p, b) }, G2 = { (a, p, b), (b, p, c) }
– ⟦P'⟧G1 = { μ }, where μ = { ?x → b }
– ⟦P'⟧G2 = { μ, μ' }, where μ' = { ?x → b, ?y → c } ≠ μ !
● 〚 P1 OPT+
P2 〛 G = ( 〚 P1 〛 G ⋈ 〚 P2 〛 G ) υ ( 〚 P1 〛 G 〚 P2 〛 G )
● 〚 P1 OPT+
P2 〛 G = ( 〚 P1 〛 G ⋈ 〚 P2 〛 G ) υ 〚 P1 〛 G
➔ P1 OPT+
√

A Proposal: The OPT
+
Operator
● 〚 P1 OPT+
P2 〛 G = ( 〚 P1 〛 G ⋈ 〚 P2 〛 G ) υ ( 〚 P1 〛 G 〚 P2 〛 G )
● 〚 P1 OPT+
P2 〛 G = ( 〚 P1 〛 G ⋈ 〚 P2 〛 G ) υ 〚 P1 〛 G
➔ P1 OPT+

Epilogue

Conclusions
● Returning result elements early has not yet
received sufficient attention in existing work
on live querying the Web of Data
● Prioritizing data retrieval can reduce response
times of traversal-based query executions
What approaches are suitable and effective?
Similar for federated query processing, LDFs?
● Language features have to be chosen with care
Their impact has to be studied
Dedicated optimization techniques are possible

Rethinking Online SPARQL Querying to Support Incremental Result Visualization

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (18)

Destacado

Destacado (7)

Similar a Rethinking Online SPARQL Querying to Support Incremental Result Visualization

Similar a Rethinking Online SPARQL Querying to Support Incremental Result Visualization (20)

Más de Olaf Hartig

Más de Olaf Hartig (15)

Último

Último (20)

Rethinking Online SPARQL Querying to Support Incremental Result Visualization