SlideShare una empresa de Scribd logo
1 de 113
SUPERVISORS
PROF. DR.-ING. HABIL. KLAUS-PETER FÄHNRICH,
UNIVERSITY OF LEIPZIG
DR. AXEL-CYRILLE NGONGA NGOMO , UNIVERSITY OF
LEIPZIG
May 13th, 2016
EFFICIENT SOURCE SELECTION FOR
SPARQL ENDPOINT QUERY
FEDERATION
Muhammad Saleem
Faculty of Mathematics and Computer Science
University of Leipzig
PhD Defense
1
OUTLINE
1. Introduction
2. Problem Statement
3. State-of-the-art Analysis
4. HIBISCUS: Hyper graph-based source selection
5. DAW: Duplicate-aware source selection
6. SAFE: Policy-aware source selection
7. TopFed: Data distribution-aware source selection
8. FEASIBLE and LSQ
9. LargeRDFBench
10. Conclusion
11. Publication and Awards
2
INTRODUCTION
 Linked, decentralized
and distributed architecture
 9,960 datasets
 ~150B triples
 Complex information needs
 Need for federated queries
3
INTRODUCTION: EXAMPLE
Return the party membership and news pages about all US presidents.
 Party memberships
 US presidents
 US presidents
 News pages
 Computation of results require data from both sources
4
INTRODUCTION: EXECUTION OF
FEDERATION
S1 S2 S3 S4
RDF RDF RDF RDF
Parsing/Rewriting
Source Selection
Federator Optimizer
Integrator
Rewrite query
and get Individual
Triple Patterns
Identify
capable/relevant
sources
Generate
optimized query
Execution Plan
Integrate sub-
queries results
Execute sub-
queries
Federation
Engine
5
MOTIVATION: SOURCE SELECTION
FedBench (LD3): Return for all US presidents their party
membership and news pages about them.
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
DBpedia
RDF
Source Selection Algorithm
Triple pattern-wise source selection
S1TP1 =
KEGG
RDF
ChEBI
RDF
NYT
RDF
SWDF
RDF
LMDB
RDF
Jamendo
RDF
Geo
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9
//TP1
//TP3
//TP4
//TP5
//TP2
6
MOTIVATION: SOURCE SELECTION
7
Source Selection Algorithm
Triple pattern-wise source selection
S1TP1 = S1TP2 =
FedBench (LD3): Return for all US presidents their party
membership and news pages about them.
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
//TP1
//TP3
//TP4
//TP5
//TP2
DBpedia
RDF
KEGG
RDF
ChEBI
RDF
NYT
RDF
SWDF
RDF
LMDB
RDF
Jamendo
RDF
Geo
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9
MOTIVATION: SOURCE SELECTION
8
Source Selection Algorithm
Triple pattern-wise source selection
S1TP1 = S1TP2 =
S1TP3 =
FedBench (LD3): Return for all US presidents their party
membership and news pages about them.
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
//TP1
//TP3
//TP4
//TP5
//TP2
DBpedia
RDF
KEGG
RDF
ChEBI
RDF
NYT
RDF
SWDF
RDF
LMDB
RDF
Jamendo
RDF
Geo
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9
MOTIVATION: SOURCE SELECTION
9
Source Selection Algorithm
Triple pattern-wise source selection
S1TP1 = S1TP2 =
S1TP3 = S4TP4 =
FedBench (LD3): Return for all US presidents their party
membership and news pages about them.
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
//TP1
//TP3
//TP4
//TP5
//TP2
DBpedia
RDF
KEGG
RDF
ChEBI
RDF
NYT
RDF
SWDF
RDF
LMDB
RDF
Jamendo
RDF
Geo
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9
MOTIVATION: SOURCE SELECTION
10
Source Selection Algorithm
Triple pattern-wise source selection
S1TP1 = S1TP2 =
S1TP3 = S4TP4 =
S1TP5 = S2 S5-S9
Total triple pattern-wise sources selected =
1+1+1+1+8 => 12
S4
FedBench (LD3): Return for all US presidents their party
membership and news pages about them.
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
//TP1
//TP3
//TP4
//TP5
//TP2
DBpedia
RDF
KEGG
RDF
ChEBI
RDF
NYT
RDF
SWDF
RDF
LMDB
RDF
Jamendo
RDF
Geo
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9
MOTIVATION: ANYTHING WRONG?
11
Source Selection Algorithm
Triple pattern-wise source selection
S1TP1 = S1TP2 =
S1TP3 = S4TP4 =
S1TP5 = S2 S5-S9
Total triple pattern-wise sources selected =
1+1+1+1+1=> 5
S4
FedBench (LD3): Return for all US presidents their party
membership and news pages about them.
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
//TP1
//TP3
//TP4
//TP5
//TP2
317068
irrelevant
intermediate
results
DBpedia
RDF
KEGG
RDF
ChEBI
RDF
NYT
RDF
SWDF
RDF
LMDB
RDF
Jamendo
RDF
Geo
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9
PROBLEM STATEMENT
12
Overestimation of sources is expensive
 Extra intermediate results
 Extra network traffic
 Increase overall runtime
1. How to perform join-aware source
selection with ensured result set
completeness?
2. How to test the efficiency of the
source selection?
Comprehensive benchmarks
 Which system is better and why?
 What are the limitations of a given
system?
 How one can improve a given
system?
3. How to design comprehensive
federated SPARQL as well as triple
stores benchmark?
STATE-OF-THE-ART
13Saleem et al. A Fine-Grained Evaluation of SPARQL Endpoint Federation Systems (Semantic
PROBLEM STATEMENT AND
CONTRIBUTIONS
14
Research Questions
1. How to perform join-aware
source selection with
ensured result set
completeness?
2. How to perform duplicate-
aware source selection?
3. How to perform policy-aware
source selection?
4. How to perform data
distribution-aware source
selection?
5. How to design
comprehensive federated
SPARQL as well as triple
stores benchmark?
S1 S2 S3 S4
RDF RDF RDF RDF
Parsing/Rewriting
Source Selection
Federator Optimizer
Integrator
Federation
Engine
QUETSAL,
LargeRDFBen
ch, State-of-
the-art
EvaluationHIBISCuS,
DAW,
SAFE,
TopFed
PROBLEM STATEMENT AND
CONTRIBUTIONS
15
S1 S2 S3 S4
RDF RDF RDF RDF
Parsing/Rewriting
Source Selection
Federator Optimizer
Integrator
Federation
Engine
QUETSAL,
LargeRDFBen
ch, State-of-
the-art
EvaluationHIBISCuS,
DAW,
SAFE,
TopFed
Research Questions
1. How to perform join-aware
source selection with
ensured result set
completeness?
2. How to perform duplicate-
aware source selection?
3. How to perform policy-aware
source selection?
4. How to perform data
distribution-aware source
selection?
5. How to design
comprehensive federated
SPARQL as well as triple
stores benchmark?
MOTIVATION: JOIN-AWARE SOURCE
SELECTION
16
Source Selection Algorithm
Triple pattern-wise source selection
S1TP1 = S1TP2 =
S1TP3 = S4TP4 =
S1TP5 = S2 S5-S9
Total triple pattern-wise sources selected =
1+1+1+1+1=> 5
S4
FedBench (LD3): Return for all US presidents their party
membership and news pages about them.
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
//TP1
//TP3
//TP4
//TP5
//TP2
DBpedia
RDF
KEGG
RDF
ChEBI
RDF
NYT
RDF
SWDF
RDF
LMDB
RDF
Jamendo
RDF
Geo
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9
HIBISCUS: HYPER GRAPH-BASED
SOURCE SELECTION
 Models SPARQL queries as hypergraphs
 Makes use of URI‘s authorities in index
 Performs join-aware triple pattern-wise source selection
 Can be combined with any existing SPARQL endpoint federation
system
17
Muhammad Saleem, Axel-Cyrille Ngonga Ngomo HiBISCuS: Hypergraph-
Based Source Selection for SPARQL Endpoint Federation (ESWC, 2014)
HIBISCUS: HYPER GRAPH-BASED
SOURCE SELECTION
 Makes use of the URI’s authorities
18
http://dbpedia.org/ontology/party
Scheme Authority Path
HIBISCUS: HYPER GRAPH-BASED
SOURCE SELECTION
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
?president
rdf:type
dbpedia:
President
19
HIBISCUS: HYPER GRAPH-BASED
SOURCE SELECTION
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
?president
rdf:type
dbpedia:
President
dbpedia:
United_States
dbpedia:
nationality
20
HIBISCUS: HYPER GRAPH-BASED
SOURCE SELECTION
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
?president
rdf:type
dbpedia:
President
dbpedia:
United_States
dbpedia:
nationality
dbpedia:
party
?party
21
HIBISCUS: HYPER GRAPH-BASED
SOURCE SELECTION
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
?president
rdf:type
dbpedia:
President
dbpedia:
United_States
dbpedia:
nationality
?x
dbpedia:
party
?party
nyt:topic
Page
?page
22
HIBISCUS: HYPER GRAPH-BASED
SOURCE SELECTION
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
?president
rdf:type
dbpedia:
President
dbpedia:
United_States
dbpedia:
nationality
?x
owl:
SameAS
dbpedia:
party
?party
nyt:topic
Page
?page
Star simple hybrid Tail of hyperedge
23
HIBISCUS: HYPER GRAPH-BASED
SOURCE SELECTION
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
?president
rdf:type
dbpedia:
President
dbpedia:
United_States
dbpedia:
nationality
?x
owl:
SameAS
dbpedia:
party
?party
nyt:topic
Page
?page
24
dbpedi
a
KEG
G
NY
T
SWDF
LMD
B
Geo
Jamend
o
Obj.
auth.
dbpedi
a
Sbj.
auth.
KEG
G
Sbj.
auth. NY
T
Sbj.
auth.
SWD
F
Sbj.
auth. LMD
B
Sbj.
auth.
Geo
Sbj.
auth. DrgB
nk
Sbj.
auth.
Jamend
o
Sbj.
auth.
DrgBnk
HIBISCUS: HYPER GRAPH-BASED
SOURCE SELECTION
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
?president
rdf:type
dbpedia:
President
dbpedia:
United_States
dbpedia:
nationality
?x
owl:
SameAS
dbpedia:
party
?party
nyt:topic
Page
?page
25
dbpedi
a
KEG
G
NY
T
SWDF
LMD
B
Geo
Jamend
o
Obj.
auth.
dbpedi
a
Sbj.
auth.
KEG
G
Sbj.
auth. NY
T
Sbj.
auth.
SWD
F
Sbj.
auth. LMD
B
Sbj.
auth.
Geo
Sbj.
auth. DrgB
nk
Sbj.
auth.
Jamend
o
Sbj.
auth.
DrgBnk
HIBISCUS: HYPER GRAPH-BASED
SOURCE SELECTION
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
?president
rdf:type
dbpedia:
President
dbpedia:
United_States
dbpedia:
nationality
?x
owl:
SameAS
dbpedia:
party
?party
nyt:topic
Page
?page
26
Total triple pattern-wise sources selected = 5
instead of 12
EFFICIENT SOURCE SELECTION
FedX(warm) SPLENDID DARQ ANAPSID HiBISCus (warm)
Query #TP #AR SST #TP #AR SST #TP #AR SST #TP #AR SST #TP #AR SST
CD 78 0 7.33 78 99 320.9 84 0 7.286 36 43 186 35 0 30.43
LS 56 0 7.99 56 90 307.3 77 0 7.571 44 63 477.4 41 0 23.14
LD 97 0 8.09 97 126 279 113 0 7.727 54 37 803.5 47 0 16
Net 231 0 8 231 315 299 274 0 7.56 134 143 554 123 0 22
27
FEDX EXTENSION WITH HIBISCUS
0
50
100
150
200
250
300
350
400
450
500
CD1 CD2 CD3 CD4 CD5 CD6 CD7 LS1 LS2 LS3 LS4 LS5 LS6 LS7 LD1 LD2 LD3 LD4 LD5 LD6 LD7 LD8 LD9 LD10 LD11 Avg.
Queryexecutiontime(msec)
FedX (warm) FedX+HiBISCus
Improvement in 20/25 queries with net performance
improvement 24.61%
28
SPLENDID EXTENSION WITH
HIBISCUS
29
0
200
400
600
800
1000
1200
CD1 CD2 CD3 CD4 CD5 CD6 CD7 LS1 LS2 LS3 LS4 LS5 LS6 LS7 LD1 LD2 LD3 LD4 LD5 LD6 LD7 LD8 LD9 LD10LD11 Avg.
Queryexecutiontime(msec)
SPLENDID SPLENDID+HiBISCus
Improvement in 24/25 queries with net performance
improvement 82.72%
DARQ EXTENSION WITH HIBISCUS
30
0.01
0.1
1
10
100
1000
10000
100000
CD1 CD2 CD3 CD4 CD5 CD6 CD7 LS1 LS2 LS3 LS4 LS5 LS6 LS7 LD1 LD2 LD3 LD4 LD5 LD6 LD7 LD8 LD9 LD10LD11 Avg
Queryexecutiontime(msec)logscale
Hundreds
ANAPSID SPLENDID+HiBISCusNotsupported
Notsupported
Runtimeerror
Runtimeerror
Runtimeerror
Timeout
Timeout
Notsupported
Notsupported
Timeout
Timeout
Improvement in 20/20 queries with net performance
improvement 92.22%
SPLENDID+HIBISCUS VS. ANAPSID
31
0.01
0.1
1
10
100
1000
CD1 CD2 CD3 CD4 CD5 CD6 CD7 LS1 LS2 LS3 LS4 LS5 LS6 LS7 LD1 LD2 LD3 LD4 LD5 LD6 LD7 LD8 LD9 LD10LD11 Avg.
Queryexecutiontime(msec)logscale
Hundreds
ANAPSID SPLENDID+HiBISCus
ZeroResults
Improvement in 25/25 queries with net performance
improvement 98%
PROBLEM STATEMENT AND
CONTRIBUTIONS
32
S1 S2 S3 S4
RDF RDF RDF RDF
Parsing/Rewriting
Source Selection
Federator Optimizer
Integrator
Federation
Engine
QUETSAL,
LargeRDFBen
ch, State-of-
the-art
EvaluationHIBISCuS,
DAW,
SAFE,
TopFed
Research Questions
1. How to perform join-aware
source selection with
ensured result set
completeness?
2. How to perform duplicate-
aware source selection?
3. How to perform policy-
aware source selection?
4. How to perform data
distribution-aware source
selection?
5. How to design
comprehensive federated
SPARQL as well as triple
stores benchmark?
DAW: DUPLICATE-AWARE SOURCE
SELECTION
33
Retrieved results for TP1 (?uri <p1> ?v1)
Triple pattern-wise source selection and skipping
S1 S2 S3TP1 =
Total triple pattern-wise selected sources = 4
S1 S2TP2 = S4
Min. number of new triples (threshold) = 20
Total triple pattern-wise skipped sources = 2
Retrieved results for TP2 (?uri <p2> ?v2)
DAW: DUPLICATE-AWARE SOURCE
SELECTION
 A combination of MIPs with compact data summaries
 Use average selectivities values for bound subject and objects
 Can be combined with any existing SPARQL endpoint federation
system
 Can be used for partial result retrieval
34
Saleem et al. DAW: Duplicate-AWare Federated Query Processing over the Web of
Data (ISWC, 2013)
DAW: MIN-WISE INDEPENDENT
PERMUTATIONS
35
48 24 36 18 820
21 3 12 24 877
9 21 15 24 4640
21 18 45 30 339
h1 = (7x + 3) mod 51
h2 = (5x + 6) mod 51
hN = (3x + 9) mod 51
8
9
9
Apply Permutations to all ID’s
ID set
Create MIP
Vector from
Minima of
Permutations
8
9
30
24
36
9
8
24
20
48
36
13
MIPs estimated operations
h(concat(s,o))
T4(s,p,o) T5(s,p,o) T6(s,p,o)
T1(s,p,o) T2(s,p,o) T3(s,p,o)
Triples
VA VB
8
9
20
24
36
9
Union (VA , VB)
Resemblance (VA , VB ) = 2/6 => 0.33
Overlap (VA , VB ) =
0.33*(6+6) / (1+0.33) => 3
hi = ai∗x + bimod U
𝑅𝑒𝑠𝑒𝑚𝑏𝑙𝑎𝑛𝑐𝑒 (𝑆𝐴, 𝑆 𝐵) =
𝑆 𝐴⋂𝑆𝐵
𝑆 𝐴⋃𝑆𝐵
≈
|VA⋂VB|
𝑁 Overlap (𝑆𝐴, 𝑆 𝐵)≈
𝑅𝑒𝑠𝑒𝑚𝑏𝑙𝑎𝑛𝑐𝑒 𝑉 𝐴,𝑉 𝐵 ×( 𝑆 𝐴 + 𝑆 𝐵 )
(𝑅𝑒𝑠𝑒𝑚𝑏𝑙𝑎𝑛𝑐𝑒 𝑉 𝐴,𝑉𝐵 +1)
𝐸𝑟𝑟𝑜𝑟 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑖𝑜𝑛 = 𝑂(1 𝑁)
FEDX EXTENSION WITH DAW
36
0
1
2
3
4
5
6
STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 STP
Diseasome Publication Geo Data Movie
Executiontime(sec)
FedX DAW
Over all performance Evaluation
Diseasome Publication Geo Data Movie Overall
Average Gain % Average Gain % Average Gain % Average Gain % Average Gain %
FedX 2.44
18.79
1.48
-12.38
4.60
14.71
1.74
7.59
2.44
9.76
DAW 1.98 1.67 3.92 1.61 2.20
SPLENDID EXTENSION WITH DAW
37
0
1
2
3
4
5
6
7
8
9
10
STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 STP
Diseasome Publication Geo Movie
Executiontime(sec)
SPLENDID DAW
Over all performance Evaluation
Diseasome Publication Geo Data Movie Overall
Average Gain % Average Gain % Average Gain % Average Gain % Average Gain %
SPLENDID 3.78 19.48 2.18 -8.94 7.27 14.40 1.9 11.16 3.71 11.11
DAW 3.04 2.37 6.22 1.688 3.30
DARQ EXTENSION WITH DAW
38
0
5
10
15
20
25
30
35
40
STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 STP
Diseasome Publication Geo Movie
Executiontime(sec)
DARQ DAW
Over all performance Evaluation
Diseasome Publication Geo Data Movie Overall
Average Gain % Average Gain % Average Gain % Average Gain % AverageGain %
DARQ 8.27
23.34
5.26
6.14
23.44
16.31
1.96
13.88
9.59
16.46
DAW 6.34 4.94 19.62 1.688 8.01
PROBLEM STATEMENT AND
CONTRIBUTIONS
39
S1 S2 S3 S4
RDF RDF RDF RDF
Parsing/Rewriting
Source Selection
Federator Optimizer
Integrator
Federation
Engine
QUETSAL,
LargeRDFBen
ch, State-of-
the-art
EvaluationHIBISCuS,
DAW,
SAFE,
TopFed
Research Questions
1. How to perform join-aware
source selection with
ensured result set
completeness?
2. How to perform duplicate-
aware source selection?
3. How to perform policy-aware
source selection?
4. How to perform data
distribution-aware source
selection?
5. How to design
comprehensive federated
SPARQL as well as triple
stores benchmark?
SAFE: POLICY-AWARE SOURCE
SELECTION
40
return number of patients that have been administered the drug Insulin and exhibit
BMI > 25 and Hypertension and Diabetes as adverse events
Switzerland Cyprus Greece
Yasar et al. SAFE: Policy Aware SPARQL Query Federation Over RDF Data
SAFE: POLICY-AWARE SOURCE
SELECTION
41
Source
Selection
Access Policy
Filtering
Query
Execution
SAFE: POLICY-AWARE SOURCE
SELECTION
42
Access Policy Framework
Source
Selection
Access Policy
Filtering
Query
Execution
Oya
Clinical Researcher
Expertise – Diabetes
Requested Data
S1 S2 S3
Input Input
Denies AccessGrants Access
S1
S2
S3
SAFE: SOURCE SELECTION
EVALUATION
43
Systems Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Avg
SAFE 8 10 13 16 15 13 15 16 7 7 9 7 11
FedX 9 13 16 24 20 14 16 19 15 17 9 16 16
Systems Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Avg
SAFE 0 0 0 0 0 0 0 0 0 0 0 0 0
FedX 36 28 40 64 48 40 44 40 21 21 9 21 35
Sum of triple-pattern-wise sources selected for each query
Number of SPARQL ASK requests used for source selection
SAFE: QUERY RUNTIME
EVALUATION
44
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
Q 1 Q 2 Q 3 Q 4 Q 5 Q 6 Q 7 Q 8 Q9 Q10 Q11 Q12 Avg.
Time-LogScale(msec)
Query
SAFE FedX
SAFE is 3.61 times faster than FedX
PROBLEM STATEMENT AND
CONTRIBUTIONS
45
S1 S2 S3 S4
RDF RDF RDF RDF
Parsing/Rewriting
Source Selection
Federator Optimizer
Integrator
Federation
Engine
QUETSAL,
LargeRDFBen
ch, State-of-
the-art
EvaluationHIBISCuS,
DAW,
SAFE,
TopFed
Research Questions
1. How to perform join-aware
source selection with
ensured result set
completeness?
2. How to perform duplicate-
aware source selection?
3. How to perform policy-
aware source selection?
4. How to perform data
distribution-aware source
selection?
5. How to design
comprehensive federated
SPARQL as well as triple
stores benchmark?
TOPFED: DATA DISTRIBUTION-AWARE
SOURCE SELECTION
 Intelligent data distribution combined with
 Efficient source selection to handle federation over Big Data
 Federation over 20.4 billion Linked TCGA data
46Saleem et al. TopFed: TCGA Tailored Federated Query Processing and Linking to
TOPFED
47
b1 b2 p1 p2 g1 g2 g3p3 p4 g4 g5 g6p5 p6 g7 g8 g9
C = {CNV, SNP, E-Gene, E-Protein, miRNA, Clinical}
F = {Expression-Exon}M = {beta_value, position}
(CNV, SNP, E-Gene, miRNA,
E-Protein, Clinical)
Exon-Expression
Methylation
D = {seg_mean, rpmmm, scaled_est, p_exp_val}
C-2 = {{p ∈ {E ∪ A ∪ G} ∨ {p = rdf:type ∧ o ∈ F}} ∧ {{S-Join(p, E ∪ F) ∨ P-Join(p, E ∪ F)} ∨ {!S-Join(p, M ∪ B ∪ D ∪ C) ∧ !P-Join(p, M ∪ B ∪ D ∪ C) }}}
C-3 = {{p ∈ {M ∪ A} ∨ {p = rdf:type ∧ o ∈ B}} ∧ {{S-Join(p, M ∪ B) ∨ P-Join(p, M ∪ B) } ∨ {!S-Join(p, E ∪ F ∪ D ∪ C) ∧ !P-Join(p, E ∪ F ∪ D ∪ C) }}}
C-1 = {{p ∈ {D ∪ A ∪ G} ∨ {p = rdf:type ∧ o ∈ C}} ∧ {{S-Join(p, D ∪ C) ∨ P-Join(p, D ∪ C) } ∨ {!S-Join(p, M ∪ B ∪ E ∪ F) ∧ !P-Join(p, M ∪ B ∪ E ∪ F) }}}
C-1 ∨ Category
Colour = blue
IF tumour lookup is successful
forward to corresponding leaf
Else
broadcast to every one
For each query triple t(s, p, o) ∈ T
A = {chromosome, result, bcr_patient_barcode} G = {start, stop}
B = {DNA-Methylation}
E = {RPKM}
Tumours
SPARQL
endpoints
C-2 ∨ Category
Colour = pink
C-3 ∨ Category
Colour = green
1-16 17-33 1-5 6-11 12-16 17-22 23-27 28-33 1-4 5-8 9-12 13-16 17-20 21-24 25-27 28-30 31-33
TOPFED VS. FEDX
48
Selects 50% less data sources than FedX without
losing recall
TOPFED VS. FEDX
 TopFed outperforms FedX significantly on 90% of the queries
 On average, the query run time of TopFed is about 1/3 of that of FedX
49
1
10
100
1000
10000
100000
Query
No
1 2 3 4 5 6 7 8 9 10 Average
QueryExecutionTime(ms)LogScale
FedX (chached) TopFed
PROBLEM STATEMENT AND
CONTRIBUTIONS
50
S1 S2 S3 S4
RDF RDF RDF RDF
Parsing/Rewriting
Source Selection
Federator Optimizer
Integrator
Federation
Engine
QUETSAL,
LargeRDFBen
ch, State-of-
the-art
EvaluationHIBISCuS,
DAW,
SAFE,
TopFed
Research Questions
1. How to perform join-aware
source selection with
ensured result set
completeness?
2. How to perform duplicate-
aware source selection?
3. How to perform policy-
aware source selection?
4. How to perform data
distribution-aware source
selection?
5. How to design
comprehensive federated
SPARQL as well as triple
stores benchmark?
SPARQL BENCHMARKS
Non-Federated Benchmarks
 Centralized repositories
 Query span over a single dataset
 Real or synthetic
 Examples: LUBM, SP2Bench, BSBM, WatDiv, DBPSB, FEASIBLE
Federated Benchmarks
 Multiple Interlinked datasets
 Query span over multiple datasets
 Real or synthetic
 Examples: FedBench, LargeRDFBench
51
FEASIBLE: BENCHMARK
GENERATION FRAMEWORK
 Dataset cleaning
 Feature vectors and normalization
 Selection of exemplars
 Selection of benchmark queries
52Saleem et al. FEASIBLE: A Featured-Based SPARQL Benchmark Generation
FEATURE VECTORS AND
NORMALIZATION
53
SELECT DISTINCT ?entita ?nome
WHERE
{
?entita rdf:type dbo:VideoGame .
?entita rdfs:label ?nome
FILTER regex(?nome, "konami", "i")
}
LIMIT 100
Query Type: SELECT
Results Size: 13
Basic Graph Patterns (BGPs): 1
Triple Patterns: 2
Join Vertices: 1
Mean Join Vertices Degree: 2.0
Mean triple patterns selectivity: 0.01709761619798973
UNION: No
DISTINCT: Yes
ORDER BY: No
REGEX: Yes
LIMIT: Yes
OFFSET: No
OPTIONAL: No
FILTER: Yes
GROUP BY: No
Runtime (ms): 65
13 1 2 1 2 0.017 0 1 0 1 1 0 0 1 0 65
0.11 0.53
0.6
7
0.1
4
0.0
8 0.017 0 1 0 1 1 0 0 1 0 0.14
Feature Vector
Normalized Feature Vector
FEASIBLE
54
Plot feature vectors in a multidimensional space
Query F1 F2
Q1 0.2 0.2
Q2 0.5 0.3
Q3 0.8 0.3
Q4 0.9 0.1
Q5 0.5 0.5
Q6 0.2 0.7
Q7 0.1 0.8
Q8 0.13 0.65
Q9 0.9 0.5
Q10 0.1 0.5
Suppose we need a benchmark of 3 queries
Q1
Q2 Q3
Q4
Q5
Q6
Q7
Q8
Q9Q10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FEASIBLE
55
Calculate average point
Q1
Q2 Q3
Q4
Q5
Q6
Q7
Q8
Q9Q10
Avg.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FEASIBLE
56
Q1
Q2 Q3
Q4
Q5
Q6
Q7
Q8
Q9Q10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Select point of minimum Euclidean distance to avg. point
*Red is our first exemplar
FEASIBLE
57
Q1
Q2 Q3
Q4
Q5
Q6
Q7
Q8
Q9Q10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Select point that is farthest to exemplars
FEASIBLE
58
Q1
Q2 Q3
Q4
Q5
Q6
Q7
Q8
Q9Q10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FEASIBLE
59
Q1
Q2 Q3
Q4
Q5
Q6
Q7
Q8
Q9Q10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Select point that is farthest to exemplars
FEASIBLE
60
Q1
Q2 Q3
Q4
Q5
Q6
Q7
Q8
Q9Q10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Q1
Q2 Q3
Q4
Q5
Q6
Q7
Q8
Q9Q10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FEASIBLE
61
Calculate distance from Q1 to each exemplars
Q1
Q2 Q3
Q4
Q5
Q6
Q7
Q8
Q9Q10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FEASIBLE
62
Assign Q1 to the minimum distance exemplar
Q1
Q2 Q3
Q4
Q5
Q6
Q7
Q8
Q9Q10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FEASIBLE
63
Repeat the process for Q2
Q1
Q2 Q3
Q4
Q5
Q6
Q7
Q8
Q9Q10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FEASIBLE
64
Repeat the process for Q3
Q1
Q2 Q3
Q4
Q5
Q6
Q7
Q8
Q9Q10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FEASIBLE
65
Repeat the process for Q6
Q1
Q2 Q3
Q4
Q5
Q6
Q7
Q8
Q9Q10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FEASIBLE
66
Repeat the process for Q8
Q1
Q2 Q3
Q4
Q5
Q6
Q7
Q8
Q9Q10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FEASIBLE
67
Repeat the process for Q9
Q1
Q2 Q3
Q4
Q5
Q6
Q7
Q8
Q9Q10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FEASIBLE
68
Repeat the process for Q10
FEASIBLE
69
Q1
Q2 Q3
Q4
Q5
Q6
Q7
Q8
Q9Q10
Avg.
Avg.
Avg.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Calculate Average across each cluster
FEASIBLE
70
Q1
Q2 Q3
Q4
Q5
Q6
Q7
Q8
Q9Q10
Avg.
Avg.
Avg.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Calculate distance of each point in cluster to the average
FEASIBLE
71
Q1
Q2 Q3
Q4
Q5
Q6
Q7
Q8
Q9Q10
Avg.
Avg.
Avg.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Select minimum distance query as the final benchmark
query from that cluster
Purple, i.e., Q2 is the final selected query from yellow cluster
FEASIBLE
72
Q1
Q2 Q3
Q4
Q5
Q6
Q7
Q8
Q9Q10
Avg.
Avg.
Avg.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Select minimum distance query as the final benchmark
query from that cluster
Purple, i.e., Q3 is the final selected query from green cluster
FEASIBLE
73
Q1
Q2 Q3
Q4
Q5
Q6
Q7
Q8
Q9Q10
Avg.
Avg.
Avg.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Select minimum distance query as the final benchmark
query from that cluster
Purple, i.e., Q8 is the final selected query from brown cluster
Our benchmark queries are Q2, Q3, and Q8
COMPARISON OF COMPOSITE
ERROR
74
FEASIBLE’s composite error is 54.9% less than DBPSB
RANK-WISE RANKING OF TRIPLE
STORES
75
All values are in percentages
 None of the system is sole winner or loser for a particular rank
 Virtuoso mostly lies in the higher ranks, i.e., rank 1 and 2 (68.29%)
 Fuseki mostly in the middle ranks, i.e., rank 2 and 3 (65.14%)
 OWLIM-SE usually on the slower side, i.e., rank 3 and 4 (60.86 %)
 Sesame is either fast or slow. Rank 1 (31.71% of the queries) and
rank 4 (23.14%)
PROBLEM STATEMENT AND
CONTRIBUTIONS
76
S1 S2 S3 S4
RDF RDF RDF RDF
Parsing/Rewriting
Source Selection
Federator Optimizer
Integrator
Federation
Engine
QUETSAL,
LargeRDFBen
ch, State-of-
the-art
EvaluationHIBISCuS,
DAW,
SAFE,
TopFed
Research Questions
1. How to perform join-aware
source selection with
ensured result set
completeness?
2. How to perform duplicate-
aware source selection?
3. How to perform policy-
aware source selection?
4. How to perform data
distribution-aware source
selection?
5. How to design
comprehensive federated
SPARQL as well as triple
stores benchmark?
LARGERDFBENCH
32 Queries
 10 simple
 10 complex
 8 large data
14 Interlined datasets
77
Linked
MDB
DBpedi
a
New
York
Times
Linked
TCGA-
M
Linked
TCGA-
E
Linked
TCGA-
A
Affymetr
ix
SW
Dog
Food
KEGG
Drug
bank
Jamend
o
ChEBI
Geo
names
basedNear owl:sameAs
x-geneid
#Links: 251.3k
country, ethnicity, race
keggCompoundId
bcr_patient_barcode
Same instance
Life Sciences Cross Domain Large Data
bcr_patient_barcode
#Links: 1.7k
#Links: 4.1k
#Links: 21.7k
#Links: 1.3k
Saleem et al. LargeRDFBench: A Billion Triples Benchmark for SPARQL Endpoint
WHY LARGERDFBENCH?
78
WHY LARGERDFBENCH?
79
LARGERDFBENCH QUERIES
PROPERTIES
14 Simple
 2-7 triple patterns
 Subset of SPARQL clauses
 Query execution time around 2 seconds on avg.
10 Complex
 8-13 triple patterns
 Use more SPARQL clauses
 Query execution time up to 10 min
8 Large Data
 Minimum 80459 results
 Large intermediate results
 Query execution time in hours
80
SOURCE SELECTION EVALUATION
81
RESULT SET COMPLETENESS AND
CORRECTNESS
82
QUERIES RUNTIME RESULTS
83
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 Avg.
Time-LogScale(msec)
FedX(cold) FedX(100% cached) SPLENDID ANAPSID FedX+HiBISCuS SPLENDID+HiBISCuS
FedX+HiBISCuS, FedX  SPLENDID+HiBISCuS  ANAPSID  SPLENDID
12/14 8/14 10/14
QUERIES RUNTIME RESULTS
84
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 Avg.
Time-LogScale(msec)
FedX(cold) FedX(100% cached) SPLENDID ANAPSID FedX+HiBISCuS SPLENDID+HiBISCuS
Runtimeerror
Runtimeerror
Runtimeerror
ANAPSID  SPLENDID+HiBISCuS  FedX+HiBISCuS, FedX
SPLENDID
4/7 5/7 5/7
QUERIES RUNTIME RESULTS
85
CONCLUSIONS
86
S2 S3 S4
RDF RDF RDF
Parsing/Rewriting
Source Selection
Federator Optimizer
Integrator
Federation
Engine
S1
RDF
CONCLUSIONS
87
S2 S3 S4
RDF RDF RDF
Parsing/Rewriting
Source Selection
Federator Optimizer
Integrator
Federation
Engine
S1
RDF
Better source selection leads to
overall improvement of runtime
performance
• HIBISCUS: 24.61% - 92.22%
• DAW: 9.79% - 16.46%
• SAFE: 84%
• TopFed: 68%
CONCLUSIONS
88
S2 S3 S4
RDF RDF RDF
Parsing/Rewriting
Source Selection
Federator Optimizer
Integrator
Federation
Engine
S1
RDF
Better source selection leads to
overall improvement of runtime
performance
• HIBISCUS: 24.61% - 92.22%
• DAW: 9.79% - 16.46%
• SAFE: 84%
• TopFed: 68%
Better benchmarking
allows for informed
selection of RDF stores
• 55% less error than
DBSPB
• Column stores
(Virtuoso) not always
best
CONCLUSIONS
89
S2 S3 S4
RDF RDF RDF
Parsing/Rewriting
Source Selection
Federator Optimizer
Integrator
Federation
Engine
S1
RDF
Better source selection leads to
overall improvement of runtime
performance
• HIBISCUS: 24.61% - 92.22%
• DAW: 9.79% - 16.46%
• SAFE: 84%
• TopFed: 68%
Better benchmarking
allows for informed
selection of RDF stores
• 55% less error than
DBSPB
• Column stores
(Virtuoso) not always
best
LargeRDFBench addresses
drawbacks of current
federated benchmarks
• SPARQL features
• Size of intermediary
results
• Total runtime of
queries
CONCLUSIONS
90
S2 S3 S4
RDF RDF RDF
Parsing/Rewriting
Source Selection
Federator Optimizer
Integrator
Federation
Engine
S1
RDF
Better source selection leads to
overall improvement of runtime
performance
• HIBISCUS: 24.61% - 92.22%
• DAW: 9.79% - 16.46%
• SAFE: 84%
• TopFed: 68%
Better benchmarking
allows for informed
selection of RDF stores
• 55% less error than
DBSPB
• Column stores
(Virtuoso) not always
best
LargeRDFBench addresses
drawbacks of current
federated benchmarks
• SPARQL features
• Size of intermediary
results
• Total runtime of
queries
Contributions allow for
• Informed selection of triple
stores and of federation
engines
• Better source selection
• Efficient query planning
• Reduction of intermediate
results,
• Time-efficient query
execution
FUTURE DIRECTIONS
 Top-K relevant source selection
 Cost-based query planning
 Caching intermediate results
 Intelligent data distribution
 Provenance and runtime estimation
 Federated benchmarks out of queries log
 Synthetic benchmarks more like real benchmarks
91
AWARDS
1. Best paper award at conference on Semantics in Healthcare and
Life Sciences (CSHALS 2014) with paper titled GenomeSnip:
Fragmenting the Genomic Wheel to augment discovery in cancer
research
2. Semantic Web Challenge-Big Data Track winner at ISWC 2013 with
paper titled Fostering Serendipity through Big Linked Data
3. I-CHALLENGE (Linked Data Cup) winner at I-Semantics 2013 with
paper titled Linked Cancer Genome Atlas Database
92
PUBLICATIONS AND CITATIONS
Total Publications: 25
 5 Journals (I.F. 2.55, 2.55, 2.26, 0.44)
 10 Conference (5 A ranked, CORE)
 4 Workshops
 2 Tutorials (A ranked, CORE)
 1 Technical report
 3 Demo (A ranked, CORE)
93
THANK YOU
94
PUBLICATIONS
2016
1. Muhammad Saleem, Ricardo Usbeck, Michael Roder, and Axel-Cyrille Ngonga Ngomo SPARQL Querying
Benchmarks Tutorial at International Semantic Web Conference (ISWC), 2015
2. Ethem Cem Ozkan, Muhammad Saleem, Erdogan Dogdu, and Axel-Cyrille Ngonga Ngomo UPSP: Unique
Predicate-based Source Selection for SPARQL Endpoint Federation PROFILES at Extended Semantic Web
Conference (ESWC), 2016
95
PUBLICATIONS
2015
1. Muhammad Saleem, Yasar Khan, Ali Hasnain, Ivan Ermilov, and Axel-Cyrille Ngonga Ngomo A Fine-
Grained Evaluation of SPARQL Endpoint Federation Systems Semantic Web Journal, 2015
2. Muhammad Saleem, Qaiser Mehmood, and Axel-Cyrille Ngonga Ngomo FEASIBLE: A Featured-Based
SPARQL Benchmark Generation Framework International Semantic Web Conference (ISWC), 2015
3. Muhammad Saleem, Muhammad Intizar Ali, Ruben Verborgh, Qaiser Mehmood, and Axel-Cyrille
Ngonga Ngomo LSQ: The Linked SPARQL Queries Dataset International Semantic Web Conference
(ISWC), 2015
4. Muhammad Saleem, Muhammad Intizar Ali, Ruben Verborgh, andAxel-Cyrille Ngonga
Ngomo Federated Query Processing over Linked Data Tutorial at International Semantic Web
Conference (ISWC), 2015
5. Muhammad Saleem, Intizar Ali, Aidan Hogan,Qaiser Mehmood, and Axel-Cyrille Ngonga Ngomo LSQ:
The Linked SPARQL Queries Dataset Technical Report LSQ Technical Report
6. Muhammad Saleem, Qaiser Mehmood, and Axel-Cyrille Ngonga Ngomo Automatic SPARQL Benchmark
Generation Using FEASIBLE Demo at International Semantic Web Conference (ISWC), 2015
7. Muhammad Saleem, Muhammad Intizar Ali, Aidan Hogan, Qaiser Mehmood, and Axel-Cyrille Ngonga
Ngomo The LSQ Dataset: Querying for Queries Demo at International Semantic Web Conference (ISWC),
2015
8. Syeda Sana e Zainab, Ali Hasnain, Muhammad Saleem, Qaiser Mehmood, Durre Zehra, and Stefan
Decker SPARQL Query Formulation and Execution using FedViz Demo at International Semantic Web
Conference (ISWC), 2015
9. Syeda Sana e Zainab, Ali Hasnain, Muhammad Saleem, Qaiser Mehmood, Durre Zehra, and Stefan
Decker FedViz: A Visual Interface for SPARQL Queries Formulation and Execution VOILA
96
PUBLICATIONS
2014
1. Yasar Khan, Muhammad Saleem, Aftab Iqbal, Muntazir Mehdi, Aidan Hogan, Panagiotis Hasapis, Axel-
Cyrille Ngonga Ngomo, Stefan Decker, and Ratnesh Sahay SAFE: Policy Aware SPARQL Query Federation
Over RDF Data Cubes Semantic Web Applications and Tools for Life Sciences (SWAT4LS), 2014
2. Nur Aini Rakhmawati, Muhammad Saleem, Sarasi Lalithsena, and Stefan Decker QFed: Query Set For
Federated SPARQL Query Benchmark 16th International Conference on Information Integration and Web-
based Applications & Services (iiWAS), 2014
3. Bühmann, Lorenz, Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, Muhammad Saleem, Andreas Both, Valter
Crescenzi, Paolo Merialdo, and Disheng Qiu Web-Scale Extension of RDF Knowledge Bases from
Templated Websites International Semantic Web Conference (ISWC), 2014
4. Muhammad Saleem, Axel-Cyrille Ngonga HiBISCuS: Hypergraph-Based Source Selection for SPARQL
Endpoint Federation Extended Semantic Web Conference (ESWC), 2014
5. Maulik R. Kamdar, Aftab Iqbal, Muhammad Saleem, Helena F. Deus, and Stefan Decker GenomeSnip:
Fragmenting the Genomic Wheel to augment discovery in cancer research CSHALS, 2014, (Best paper
award)
6. Muhammad Saleem, Shanmukha Sampath, Axel-Cyrille Ngonga Ngomo, Aftab Iqbal, Jonas Almeidaand,
and Helena Deus TopFed: TCGA Tailored Federated Query Processing and Linking to LOD Journal of
Biomedical Semantics, 2014
7. Muhammad Saleem, Maulik R. Kamdar, Aftab Iqbal, Shanmukha Sampath, Helena F. Deus, and Axel-
Cyrille Ngonga Ngomo Big Linked Cancer Data: Integrating Linked TCGA and PubMed Journal of Web
Semantics, 2014 97
PUBLICATIONS
2009-2013
1. Muhammad Saleem, Maulik R. Kamdar, Aftab Iqbal, Shanmukha Sampath, Helena F. Deus, and Axel-Cyrille
Ngonga Ngomo Fostering Serendipity through Big Linked Data Semantic Web Challenge at International
Semantic Web Conference (ISWC), 2013, Semantic Web Challenge (Big Data Track) Winner
2. Muhammad Saleem, Shanmukha S Padmanabhuni, Axel-Cyrille Ngonga Ngomo, Jonas S Almeida, and
Stefan Decker, Helena Deus Linked Cancer Genome Atlas Database In Linked Data Cup, I-
Semantics2013, I-CHALLENGE (Linked Data Cup) Winner
3. Muhammad Saleem, Axel-Cyrille Ngonga Ngomo, Josian Xavier Pariera, Helena F. Deus, and Manfred
Hauswirth DAW: Duplicate-AWare Federated Query Processing over the Web of Data International Semantic
Web Conference (ISWC), 2013
4. Muhammad Saleem, Ali Zahir, Yasir Ismail, and Bilal Saeed Enhanced Generic Information Services Using
Mobile Messaging Grid and Pervasive Computing (GPC), 2010
5. Muhammad Saleem, Ali Zahir, Yasir Ismail, and Bilal Saeed Enhanced Generic Information Services Using
Mobile Messaging Grid and Pervasive Computing (GPC), 2010
6. Muhammad Saleem, and Kyung-Goo Doh Generic Information System Using SMS Gateway The Fourth
International Conference on Computer Sciences and Convergence Information Technology (ICCIT), 2009
7. Muhammad Saleem, Rasheed Hussain, Yasir Ismail, and Shaikh Mohsin Cost Effective Software Engineering
using Program Slicing Techniques The 2nd International Conference on Interaction Sciences: Information
Technology, Culture and Human (ICIS), 2009
98
ADDITIONAL SLIDES
99
STATE-OF-THE-ART: SPARQL
FEDERATION APPROACHES
 SPARQL Endpoint Federation (SEF)
 Linked Data Federation (LDF)
 Distributed Hash Tables (DHTs)
 Hybrid of SEF+LDF
100
Saleem et al. A Fine-Grained Evaluation of SPARQL Endpoint
Federation Systems (Semantic Web Journal, 2015)
STATE-OF-THE-ART: SOURCE
SELECTION
 Index-only
 Index-free (SPARQL ASK Queries)
 Hybrid (Index+ SPARQL ASK Queries)
101
STATE-OF-THE-ART
102
STATE-OF-THE-ART
103
HIBISCUS: DATA SUMMARIES
104
[] a ds:Service ;
ds:endpointUrl <http://dbpedia.org/sparql> ;
ds:capability [
ds:predicate dbpedia:party ;
ds:sbjAuthority <http://dbpedia.org/> ;
ds:objAuthority <http://dbpedia.org/> ;
] ;
ds:capability [
ds:predicate rdf:type ;
ds:sbjAuthority <http://dbpedia.org/> ;
ds:objAuthority owl:Thing, dbpedia:President; #we store all distinct
classes
] ;
ds:capability [
ds:predicate dbpedia:postalCode ;
ds:sbjAuthority <http://dbpedia.org/> ;
#No objAuthority as the object value for dbpedia:postalCode is string
] ;
EFFICIENT SOURCE SELECTION
FedX(warm) SPLENDID DARQ ANAPSID HiBISCus (warm)
Query #TP #AR SST #TP #AR SST #TP #AR SST #TP #AR SST #TP #AR SST
CD 78 0 7.33 78 99 320.9 84 0 7.286 36 43 186 35 0 30.43
LS 56 0 7.99 56 90 307.3 77 0 7.571 44 63 477.4 41 0 23.14
LD 97 0 8.09 97 126 279 113 0 7.727 54 37 803.5 47 0 16
Net 231 0 8 231 315 299 274 0 7.56 134 143 554 123 0 22
105
DAW: DATA SUMMARIES
106
[] a sd:Service ;
sd:endpointUrl <http://localhost:8890/sparql> ;
sd:capability [
sd:predicate diseasome:name ;
sd:totalTriples 147 ;
sd:avgSbjSel ``0.0068'' ;
sd:avgObjSel ``0.0069'' ;
sd:MIPs ``-6908232 -7090543 -6892373 -7064247 ...''; ] ;
sd:capability [
sd:predicate diseasome:chromosomalLocation ;
sd:totalTtriples 160 ;
sd:avgSbjSel ``0.0062'' ;
sd:avgObjSel ``0.0072'' ;
sd:MIPs ``-7056448 -7056410 -6845713 -6966021 ...''; ] ;
107
0
20
40
60
80
100
120
Recallin%
Ranked Sources
Optimal
DAW
0
20
40
60
80
100
120
Recallin% Ranked Sources
Optimal
DAW
Diseasome Publication
SOURCE RANKING VS. RECALL
TRIPLE STORE BENCHMARKS
Synthetic Benchmarks
 Make use of the synthetic queries and/or data
 Suitable to test scalability
 Often fail to reflect real datasets
 Examples: LUBM, SP2Bench, BSBM, WatDiv
Query Log Benchmarks
 Make use of the real queries from queries log
 Can be more close to the reality
 Scalability can be tested
 Examples: DBPSB, FEASIBLE
108
FEASIBLE: COMPOSITE ERROR
ESTIMATION
109
L is the query log, B is the benchmark and K is the set of all features
LARGERDFBENCH DATASETS
STATISTICS
110
SPARQL 1.1 QUERIES RUNTIME
RESULTS
111
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 Avg.
Time-LogScale(msec)
FedX(100% cached) ANAPSID
ANAPSID  FedX
8/14
SPARQL 1.1 QUERIES RUNTIME
RESULTS
112
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 Avg.
Time-LogScale(msec)
FedX(100% cached) ANAPSID
Runtimeerror
Timeout
Timeout
Timeout
Runtimeerror
Timeout
FedX  ANAPSID
6/8
CONCLUSION
 HIBISCUS: Hyper graph-based source
selection
 FedX: 20/25 queries, net improvement 24.61%
 SPLENDID: 24/25 queries, net improvement 82.75%
 DARQ: 20/20 queries, net improvement 92.22%
 DAW: Duplicate-aware source selection
 FedX: 63/79 queries, net improvement 9.79 %
 SPLENDID: 66/79 queries, net improvement 11.11%
 DARQ: 70/79 queries, net improvement 16.46%
 SAFE: Policy-aware source selection
 FedX: 12/12 queries, net improvement 84 %
 TopFed: Data distribution-aware selection
 FedX: 10/10 queries, net improvement 68 %
113
 Join-aware source selection leads
to,
Efficient query planning,
Reduce intermediate results, and
Decrease overall runtime
 FEASIBLE: Triple Store Benchmark
 FEASIBLE composite error is 55% smaller
than DBPSB
 New insights on performance of triple
stores
 LargeRDFBench
 Simple queries benchmarks are not
sufficient
 Ranking changes from simple to complex
queries

Más contenido relacionado

La actualidad más candente

Mon norton tut_queryinglinkeddata02
Mon norton tut_queryinglinkeddata02Mon norton tut_queryinglinkeddata02
Mon norton tut_queryinglinkeddata02
eswcsummerschool
 
2011 4IZ440 Semantic Web – RDF, SPARQL, and software APIs
2011 4IZ440 Semantic Web – RDF, SPARQL, and software APIs2011 4IZ440 Semantic Web – RDF, SPARQL, and software APIs
2011 4IZ440 Semantic Web – RDF, SPARQL, and software APIs
Josef Petrák
 
Efficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF DatabasesEfficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF Databases
Alexandra Roatiș
 
Herve_Momo-TASS_25SEP2015
Herve_Momo-TASS_25SEP2015Herve_Momo-TASS_25SEP2015
Herve_Momo-TASS_25SEP2015
Herve Momo
 
Ks2008 Semanticweb In Action
Ks2008 Semanticweb In ActionKs2008 Semanticweb In Action
Ks2008 Semanticweb In Action
Rinke Hoekstra
 

La actualidad más candente (20)

Sparql
SparqlSparql
Sparql
 
Debunking some “RDF vs. Property Graph” Alternative Facts
Debunking some “RDF vs. Property Graph” Alternative FactsDebunking some “RDF vs. Property Graph” Alternative Facts
Debunking some “RDF vs. Property Graph” Alternative Facts
 
Rdf Overview Presentation
Rdf Overview PresentationRdf Overview Presentation
Rdf Overview Presentation
 
GDG Meets U event - Big data & Wikidata - no lies codelab
GDG Meets U event - Big data & Wikidata -  no lies codelabGDG Meets U event - Big data & Wikidata -  no lies codelab
GDG Meets U event - Big data & Wikidata - no lies codelab
 
Mon norton tut_queryinglinkeddata02
Mon norton tut_queryinglinkeddata02Mon norton tut_queryinglinkeddata02
Mon norton tut_queryinglinkeddata02
 
2011 4IZ440 Semantic Web – RDF, SPARQL, and software APIs
2011 4IZ440 Semantic Web – RDF, SPARQL, and software APIs2011 4IZ440 Semantic Web – RDF, SPARQL, and software APIs
2011 4IZ440 Semantic Web – RDF, SPARQL, and software APIs
 
Efficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF DatabasesEfficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF Databases
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And Visualization
 
Herve_Momo-TASS_25SEP2015
Herve_Momo-TASS_25SEP2015Herve_Momo-TASS_25SEP2015
Herve_Momo-TASS_25SEP2015
 
Ks2008 Semanticweb In Action
Ks2008 Semanticweb In ActionKs2008 Semanticweb In Action
Ks2008 Semanticweb In Action
 
SPARTIQULATION - Verbalizing SPARQL queries
SPARTIQULATION - Verbalizing SPARQL queriesSPARTIQULATION - Verbalizing SPARQL queries
SPARTIQULATION - Verbalizing SPARQL queries
 
Semantic web meetup – sparql tutorial
Semantic web meetup – sparql tutorialSemantic web meetup – sparql tutorial
Semantic web meetup – sparql tutorial
 
GraphDB
GraphDBGraphDB
GraphDB
 
Introduction To RDF and RDFS
Introduction To RDF and RDFSIntroduction To RDF and RDFS
Introduction To RDF and RDFS
 
Semantic web for ontology chapter4 bynk
Semantic web for ontology chapter4 bynkSemantic web for ontology chapter4 bynk
Semantic web for ontology chapter4 bynk
 
Introduction to RDF
Introduction to RDFIntroduction to RDF
Introduction to RDF
 
Rdf
RdfRdf
Rdf
 
Another RDF Encoding Form
Another RDF Encoding FormAnother RDF Encoding Form
Another RDF Encoding Form
 
SPARQL Cheat Sheet
SPARQL Cheat SheetSPARQL Cheat Sheet
SPARQL Cheat Sheet
 
A Comparison Between Python APIs For RDF Processing
A Comparison Between Python APIs For RDF ProcessingA Comparison Between Python APIs For RDF Processing
A Comparison Between Python APIs For RDF Processing
 

Destacado (8)

DAW: Duplicate-AWare Federated Query Processing over the Web of Data
DAW: Duplicate-AWare Federated Query Processing over the Web of DataDAW: Duplicate-AWare Federated Query Processing over the Web of Data
DAW: Duplicate-AWare Federated Query Processing over the Web of Data
 
XGSN: An Open-source Semantic Sensing Middleware for the Web of Things
XGSN: An Open-source Semantic Sensing Middleware for the Web of ThingsXGSN: An Open-source Semantic Sensing Middleware for the Web of Things
XGSN: An Open-source Semantic Sensing Middleware for the Web of Things
 
GSN Global Sensor Networks for Environmental Data Management
GSN Global Sensor Networks for Environmental Data ManagementGSN Global Sensor Networks for Environmental Data Management
GSN Global Sensor Networks for Environmental Data Management
 
Audio monitoring
Audio monitoringAudio monitoring
Audio monitoring
 
Linked Cancer Genome Atlas Database
Linked Cancer Genome Atlas DatabaseLinked Cancer Genome Atlas Database
Linked Cancer Genome Atlas Database
 
LSQ: The Linked SPARQL Queries Dataset
LSQ: The Linked SPARQL Queries DatasetLSQ: The Linked SPARQL Queries Dataset
LSQ: The Linked SPARQL Queries Dataset
 
Linked Data in Healthcare and Life Sciences
Linked Data in Healthcare and Life SciencesLinked Data in Healthcare and Life Sciences
Linked Data in Healthcare and Life Sciences
 
Ontology
Ontology Ontology
Ontology
 

Similar a Efficient source selection for sparql endpoint federation

DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talkDistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
Gezim Sejdiu
 
Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011
Juan Sequeda
 
2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata
Jun Zhao
 
Wed roman tut_open_datapub
Wed roman tut_open_datapubWed roman tut_open_datapub
Wed roman tut_open_datapub
eswcsummerschool
 
Introduction of semantic technology for SAS programmers
Introduction of semantic technology for SAS programmersIntroduction of semantic technology for SAS programmers
Introduction of semantic technology for SAS programmers
Kevin Lee
 

Similar a Efficient source selection for sparql endpoint federation (20)

2009 0807 Lod Gmod
2009 0807 Lod Gmod2009 0807 Lod Gmod
2009 0807 Lod Gmod
 
Processing Life Science Data at Scale - using Semantic Web Technologies
Processing Life Science Data at Scale - using Semantic Web TechnologiesProcessing Life Science Data at Scale - using Semantic Web Technologies
Processing Life Science Data at Scale - using Semantic Web Technologies
 
Re-using Media on the Web: Media fragment re-mixing and playout
Re-using Media on the Web: Media fragment re-mixing and playoutRe-using Media on the Web: Media fragment re-mixing and playout
Re-using Media on the Web: Media fragment re-mixing and playout
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talkDistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
 
Querying Linked Data
Querying Linked DataQuerying Linked Data
Querying Linked Data
 
Sustainable queryable access to Linked Data
Sustainable queryable access to Linked DataSustainable queryable access to Linked Data
Sustainable queryable access to Linked Data
 
inteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access FrameworkinteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access Framework
 
Linked Data Fragments
Linked Data FragmentsLinked Data Fragments
Linked Data Fragments
 
Efficient RDF Interchange (ERI) Format for RDF Data Streams
Efficient RDF Interchange (ERI) Format for RDF Data StreamsEfficient RDF Interchange (ERI) Format for RDF Data Streams
Efficient RDF Interchange (ERI) Format for RDF Data Streams
 
Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011
 
List.MID: A MIDI-Based Benchmark for RDF Lists
List.MID: A MIDI-Based Benchmark for RDF ListsList.MID: A MIDI-Based Benchmark for RDF Lists
List.MID: A MIDI-Based Benchmark for RDF Lists
 
Linking Open Government Data at Scale
Linking Open Government Data at Scale Linking Open Government Data at Scale
Linking Open Government Data at Scale
 
Querying data on the Web – client or server?
Querying data on the Web – client or server?Querying data on the Web – client or server?
Querying data on the Web – client or server?
 
ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.
 
More Complete Resultset Retrieval from Large Heterogeneous RDF Sources
More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesMore Complete Resultset Retrieval from Large Heterogeneous RDF Sources
More Complete Resultset Retrieval from Large Heterogeneous RDF Sources
 
2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata
 
Wed roman tut_open_datapub
Wed roman tut_open_datapubWed roman tut_open_datapub
Wed roman tut_open_datapub
 
Introduction of semantic technology for SAS programmers
Introduction of semantic technology for SAS programmersIntroduction of semantic technology for SAS programmers
Introduction of semantic technology for SAS programmers
 
The Lonesome LOD Cloud
The Lonesome LOD CloudThe Lonesome LOD Cloud
The Lonesome LOD Cloud
 

Más de Muhammad Saleem

QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...
QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...
QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...
Muhammad Saleem
 
SQCFramework: SPARQL Query containment Benchmark Generation Framework
SQCFramework: SPARQL Query containment  Benchmark Generation Framework SQCFramework: SPARQL Query containment  Benchmark Generation Framework
SQCFramework: SPARQL Query containment Benchmark Generation Framework
Muhammad Saleem
 
Question Answering Over Linked Data: What is Difficult to Answer? What Affect...
Question Answering Over Linked Data: What is Difficult to Answer? What Affect...Question Answering Over Linked Data: What is Difficult to Answer? What Affect...
Question Answering Over Linked Data: What is Difficult to Answer? What Affect...
Muhammad Saleem
 

Más de Muhammad Saleem (10)

QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...
QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...
QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...
 
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...
 
LargeRDFBench
LargeRDFBenchLargeRDFBench
LargeRDFBench
 
Extended LargeRDFBench
Extended LargeRDFBenchExtended LargeRDFBench
Extended LargeRDFBench
 
CostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation
CostFed: Cost-Based Query Optimization for SPARQL Endpoint FederationCostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation
CostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation
 
SQCFramework: SPARQL Query containment Benchmark Generation Framework
SQCFramework: SPARQL Query containment  Benchmark Generation Framework SQCFramework: SPARQL Query containment  Benchmark Generation Framework
SQCFramework: SPARQL Query containment Benchmark Generation Framework
 
Question Answering Over Linked Data: What is Difficult to Answer? What Affect...
Question Answering Over Linked Data: What is Difficult to Answer? What Affect...Question Answering Over Linked Data: What is Difficult to Answer? What Affect...
Question Answering Over Linked Data: What is Difficult to Answer? What Affect...
 
SPARQL Querying Benchmarks ISWC2016
SPARQL Querying Benchmarks ISWC2016SPARQL Querying Benchmarks ISWC2016
SPARQL Querying Benchmarks ISWC2016
 
FEASIBLE-Benchmark-Framework-ISWC2015
FEASIBLE-Benchmark-Framework-ISWC2015FEASIBLE-Benchmark-Framework-ISWC2015
FEASIBLE-Benchmark-Framework-ISWC2015
 
Fostering Serendipity through Big Linked Data
Fostering Serendipity through Big Linked DataFostering Serendipity through Big Linked Data
Fostering Serendipity through Big Linked Data
 

Último

Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Último (20)

Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

Efficient source selection for sparql endpoint federation

  • 1. SUPERVISORS PROF. DR.-ING. HABIL. KLAUS-PETER FÄHNRICH, UNIVERSITY OF LEIPZIG DR. AXEL-CYRILLE NGONGA NGOMO , UNIVERSITY OF LEIPZIG May 13th, 2016 EFFICIENT SOURCE SELECTION FOR SPARQL ENDPOINT QUERY FEDERATION Muhammad Saleem Faculty of Mathematics and Computer Science University of Leipzig PhD Defense 1
  • 2. OUTLINE 1. Introduction 2. Problem Statement 3. State-of-the-art Analysis 4. HIBISCUS: Hyper graph-based source selection 5. DAW: Duplicate-aware source selection 6. SAFE: Policy-aware source selection 7. TopFed: Data distribution-aware source selection 8. FEASIBLE and LSQ 9. LargeRDFBench 10. Conclusion 11. Publication and Awards 2
  • 3. INTRODUCTION  Linked, decentralized and distributed architecture  9,960 datasets  ~150B triples  Complex information needs  Need for federated queries 3
  • 4. INTRODUCTION: EXAMPLE Return the party membership and news pages about all US presidents.  Party memberships  US presidents  US presidents  News pages  Computation of results require data from both sources 4
  • 5. INTRODUCTION: EXECUTION OF FEDERATION S1 S2 S3 S4 RDF RDF RDF RDF Parsing/Rewriting Source Selection Federator Optimizer Integrator Rewrite query and get Individual Triple Patterns Identify capable/relevant sources Generate optimized query Execution Plan Integrate sub- queries results Execute sub- queries Federation Engine 5
  • 6. MOTIVATION: SOURCE SELECTION FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } DBpedia RDF Source Selection Algorithm Triple pattern-wise source selection S1TP1 = KEGG RDF ChEBI RDF NYT RDF SWDF RDF LMDB RDF Jamendo RDF Geo RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9 //TP1 //TP3 //TP4 //TP5 //TP2 6
  • 7. MOTIVATION: SOURCE SELECTION 7 Source Selection Algorithm Triple pattern-wise source selection S1TP1 = S1TP2 = FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } //TP1 //TP3 //TP4 //TP5 //TP2 DBpedia RDF KEGG RDF ChEBI RDF NYT RDF SWDF RDF LMDB RDF Jamendo RDF Geo RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  • 8. MOTIVATION: SOURCE SELECTION 8 Source Selection Algorithm Triple pattern-wise source selection S1TP1 = S1TP2 = S1TP3 = FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } //TP1 //TP3 //TP4 //TP5 //TP2 DBpedia RDF KEGG RDF ChEBI RDF NYT RDF SWDF RDF LMDB RDF Jamendo RDF Geo RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  • 9. MOTIVATION: SOURCE SELECTION 9 Source Selection Algorithm Triple pattern-wise source selection S1TP1 = S1TP2 = S1TP3 = S4TP4 = FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } //TP1 //TP3 //TP4 //TP5 //TP2 DBpedia RDF KEGG RDF ChEBI RDF NYT RDF SWDF RDF LMDB RDF Jamendo RDF Geo RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  • 10. MOTIVATION: SOURCE SELECTION 10 Source Selection Algorithm Triple pattern-wise source selection S1TP1 = S1TP2 = S1TP3 = S4TP4 = S1TP5 = S2 S5-S9 Total triple pattern-wise sources selected = 1+1+1+1+8 => 12 S4 FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } //TP1 //TP3 //TP4 //TP5 //TP2 DBpedia RDF KEGG RDF ChEBI RDF NYT RDF SWDF RDF LMDB RDF Jamendo RDF Geo RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  • 11. MOTIVATION: ANYTHING WRONG? 11 Source Selection Algorithm Triple pattern-wise source selection S1TP1 = S1TP2 = S1TP3 = S4TP4 = S1TP5 = S2 S5-S9 Total triple pattern-wise sources selected = 1+1+1+1+1=> 5 S4 FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } //TP1 //TP3 //TP4 //TP5 //TP2 317068 irrelevant intermediate results DBpedia RDF KEGG RDF ChEBI RDF NYT RDF SWDF RDF LMDB RDF Jamendo RDF Geo RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  • 12. PROBLEM STATEMENT 12 Overestimation of sources is expensive  Extra intermediate results  Extra network traffic  Increase overall runtime 1. How to perform join-aware source selection with ensured result set completeness? 2. How to test the efficiency of the source selection? Comprehensive benchmarks  Which system is better and why?  What are the limitations of a given system?  How one can improve a given system? 3. How to design comprehensive federated SPARQL as well as triple stores benchmark?
  • 13. STATE-OF-THE-ART 13Saleem et al. A Fine-Grained Evaluation of SPARQL Endpoint Federation Systems (Semantic
  • 14. PROBLEM STATEMENT AND CONTRIBUTIONS 14 Research Questions 1. How to perform join-aware source selection with ensured result set completeness? 2. How to perform duplicate- aware source selection? 3. How to perform policy-aware source selection? 4. How to perform data distribution-aware source selection? 5. How to design comprehensive federated SPARQL as well as triple stores benchmark? S1 S2 S3 S4 RDF RDF RDF RDF Parsing/Rewriting Source Selection Federator Optimizer Integrator Federation Engine QUETSAL, LargeRDFBen ch, State-of- the-art EvaluationHIBISCuS, DAW, SAFE, TopFed
  • 15. PROBLEM STATEMENT AND CONTRIBUTIONS 15 S1 S2 S3 S4 RDF RDF RDF RDF Parsing/Rewriting Source Selection Federator Optimizer Integrator Federation Engine QUETSAL, LargeRDFBen ch, State-of- the-art EvaluationHIBISCuS, DAW, SAFE, TopFed Research Questions 1. How to perform join-aware source selection with ensured result set completeness? 2. How to perform duplicate- aware source selection? 3. How to perform policy-aware source selection? 4. How to perform data distribution-aware source selection? 5. How to design comprehensive federated SPARQL as well as triple stores benchmark?
  • 16. MOTIVATION: JOIN-AWARE SOURCE SELECTION 16 Source Selection Algorithm Triple pattern-wise source selection S1TP1 = S1TP2 = S1TP3 = S4TP4 = S1TP5 = S2 S5-S9 Total triple pattern-wise sources selected = 1+1+1+1+1=> 5 S4 FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } //TP1 //TP3 //TP4 //TP5 //TP2 DBpedia RDF KEGG RDF ChEBI RDF NYT RDF SWDF RDF LMDB RDF Jamendo RDF Geo RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9
  • 17. HIBISCUS: HYPER GRAPH-BASED SOURCE SELECTION  Models SPARQL queries as hypergraphs  Makes use of URI‘s authorities in index  Performs join-aware triple pattern-wise source selection  Can be combined with any existing SPARQL endpoint federation system 17 Muhammad Saleem, Axel-Cyrille Ngonga Ngomo HiBISCuS: Hypergraph- Based Source Selection for SPARQL Endpoint Federation (ESWC, 2014)
  • 18. HIBISCUS: HYPER GRAPH-BASED SOURCE SELECTION  Makes use of the URI’s authorities 18 http://dbpedia.org/ontology/party Scheme Authority Path
  • 19. HIBISCUS: HYPER GRAPH-BASED SOURCE SELECTION SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } ?president rdf:type dbpedia: President 19
  • 20. HIBISCUS: HYPER GRAPH-BASED SOURCE SELECTION SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } ?president rdf:type dbpedia: President dbpedia: United_States dbpedia: nationality 20
  • 21. HIBISCUS: HYPER GRAPH-BASED SOURCE SELECTION SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } ?president rdf:type dbpedia: President dbpedia: United_States dbpedia: nationality dbpedia: party ?party 21
  • 22. HIBISCUS: HYPER GRAPH-BASED SOURCE SELECTION SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } ?president rdf:type dbpedia: President dbpedia: United_States dbpedia: nationality ?x dbpedia: party ?party nyt:topic Page ?page 22
  • 23. HIBISCUS: HYPER GRAPH-BASED SOURCE SELECTION SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } ?president rdf:type dbpedia: President dbpedia: United_States dbpedia: nationality ?x owl: SameAS dbpedia: party ?party nyt:topic Page ?page Star simple hybrid Tail of hyperedge 23
  • 24. HIBISCUS: HYPER GRAPH-BASED SOURCE SELECTION SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } ?president rdf:type dbpedia: President dbpedia: United_States dbpedia: nationality ?x owl: SameAS dbpedia: party ?party nyt:topic Page ?page 24 dbpedi a KEG G NY T SWDF LMD B Geo Jamend o Obj. auth. dbpedi a Sbj. auth. KEG G Sbj. auth. NY T Sbj. auth. SWD F Sbj. auth. LMD B Sbj. auth. Geo Sbj. auth. DrgB nk Sbj. auth. Jamend o Sbj. auth. DrgBnk
  • 25. HIBISCUS: HYPER GRAPH-BASED SOURCE SELECTION SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } ?president rdf:type dbpedia: President dbpedia: United_States dbpedia: nationality ?x owl: SameAS dbpedia: party ?party nyt:topic Page ?page 25 dbpedi a KEG G NY T SWDF LMD B Geo Jamend o Obj. auth. dbpedi a Sbj. auth. KEG G Sbj. auth. NY T Sbj. auth. SWD F Sbj. auth. LMD B Sbj. auth. Geo Sbj. auth. DrgB nk Sbj. auth. Jamend o Sbj. auth. DrgBnk
  • 26. HIBISCUS: HYPER GRAPH-BASED SOURCE SELECTION SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } ?president rdf:type dbpedia: President dbpedia: United_States dbpedia: nationality ?x owl: SameAS dbpedia: party ?party nyt:topic Page ?page 26 Total triple pattern-wise sources selected = 5 instead of 12
  • 27. EFFICIENT SOURCE SELECTION FedX(warm) SPLENDID DARQ ANAPSID HiBISCus (warm) Query #TP #AR SST #TP #AR SST #TP #AR SST #TP #AR SST #TP #AR SST CD 78 0 7.33 78 99 320.9 84 0 7.286 36 43 186 35 0 30.43 LS 56 0 7.99 56 90 307.3 77 0 7.571 44 63 477.4 41 0 23.14 LD 97 0 8.09 97 126 279 113 0 7.727 54 37 803.5 47 0 16 Net 231 0 8 231 315 299 274 0 7.56 134 143 554 123 0 22 27
  • 28. FEDX EXTENSION WITH HIBISCUS 0 50 100 150 200 250 300 350 400 450 500 CD1 CD2 CD3 CD4 CD5 CD6 CD7 LS1 LS2 LS3 LS4 LS5 LS6 LS7 LD1 LD2 LD3 LD4 LD5 LD6 LD7 LD8 LD9 LD10 LD11 Avg. Queryexecutiontime(msec) FedX (warm) FedX+HiBISCus Improvement in 20/25 queries with net performance improvement 24.61% 28
  • 29. SPLENDID EXTENSION WITH HIBISCUS 29 0 200 400 600 800 1000 1200 CD1 CD2 CD3 CD4 CD5 CD6 CD7 LS1 LS2 LS3 LS4 LS5 LS6 LS7 LD1 LD2 LD3 LD4 LD5 LD6 LD7 LD8 LD9 LD10LD11 Avg. Queryexecutiontime(msec) SPLENDID SPLENDID+HiBISCus Improvement in 24/25 queries with net performance improvement 82.72%
  • 30. DARQ EXTENSION WITH HIBISCUS 30 0.01 0.1 1 10 100 1000 10000 100000 CD1 CD2 CD3 CD4 CD5 CD6 CD7 LS1 LS2 LS3 LS4 LS5 LS6 LS7 LD1 LD2 LD3 LD4 LD5 LD6 LD7 LD8 LD9 LD10LD11 Avg Queryexecutiontime(msec)logscale Hundreds ANAPSID SPLENDID+HiBISCusNotsupported Notsupported Runtimeerror Runtimeerror Runtimeerror Timeout Timeout Notsupported Notsupported Timeout Timeout Improvement in 20/20 queries with net performance improvement 92.22%
  • 31. SPLENDID+HIBISCUS VS. ANAPSID 31 0.01 0.1 1 10 100 1000 CD1 CD2 CD3 CD4 CD5 CD6 CD7 LS1 LS2 LS3 LS4 LS5 LS6 LS7 LD1 LD2 LD3 LD4 LD5 LD6 LD7 LD8 LD9 LD10LD11 Avg. Queryexecutiontime(msec)logscale Hundreds ANAPSID SPLENDID+HiBISCus ZeroResults Improvement in 25/25 queries with net performance improvement 98%
  • 32. PROBLEM STATEMENT AND CONTRIBUTIONS 32 S1 S2 S3 S4 RDF RDF RDF RDF Parsing/Rewriting Source Selection Federator Optimizer Integrator Federation Engine QUETSAL, LargeRDFBen ch, State-of- the-art EvaluationHIBISCuS, DAW, SAFE, TopFed Research Questions 1. How to perform join-aware source selection with ensured result set completeness? 2. How to perform duplicate- aware source selection? 3. How to perform policy- aware source selection? 4. How to perform data distribution-aware source selection? 5. How to design comprehensive federated SPARQL as well as triple stores benchmark?
  • 33. DAW: DUPLICATE-AWARE SOURCE SELECTION 33 Retrieved results for TP1 (?uri <p1> ?v1) Triple pattern-wise source selection and skipping S1 S2 S3TP1 = Total triple pattern-wise selected sources = 4 S1 S2TP2 = S4 Min. number of new triples (threshold) = 20 Total triple pattern-wise skipped sources = 2 Retrieved results for TP2 (?uri <p2> ?v2)
  • 34. DAW: DUPLICATE-AWARE SOURCE SELECTION  A combination of MIPs with compact data summaries  Use average selectivities values for bound subject and objects  Can be combined with any existing SPARQL endpoint federation system  Can be used for partial result retrieval 34 Saleem et al. DAW: Duplicate-AWare Federated Query Processing over the Web of Data (ISWC, 2013)
  • 35. DAW: MIN-WISE INDEPENDENT PERMUTATIONS 35 48 24 36 18 820 21 3 12 24 877 9 21 15 24 4640 21 18 45 30 339 h1 = (7x + 3) mod 51 h2 = (5x + 6) mod 51 hN = (3x + 9) mod 51 8 9 9 Apply Permutations to all ID’s ID set Create MIP Vector from Minima of Permutations 8 9 30 24 36 9 8 24 20 48 36 13 MIPs estimated operations h(concat(s,o)) T4(s,p,o) T5(s,p,o) T6(s,p,o) T1(s,p,o) T2(s,p,o) T3(s,p,o) Triples VA VB 8 9 20 24 36 9 Union (VA , VB) Resemblance (VA , VB ) = 2/6 => 0.33 Overlap (VA , VB ) = 0.33*(6+6) / (1+0.33) => 3 hi = ai∗x + bimod U 𝑅𝑒𝑠𝑒𝑚𝑏𝑙𝑎𝑛𝑐𝑒 (𝑆𝐴, 𝑆 𝐵) = 𝑆 𝐴⋂𝑆𝐵 𝑆 𝐴⋃𝑆𝐵 ≈ |VA⋂VB| 𝑁 Overlap (𝑆𝐴, 𝑆 𝐵)≈ 𝑅𝑒𝑠𝑒𝑚𝑏𝑙𝑎𝑛𝑐𝑒 𝑉 𝐴,𝑉 𝐵 ×( 𝑆 𝐴 + 𝑆 𝐵 ) (𝑅𝑒𝑠𝑒𝑚𝑏𝑙𝑎𝑛𝑐𝑒 𝑉 𝐴,𝑉𝐵 +1) 𝐸𝑟𝑟𝑜𝑟 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑖𝑜𝑛 = 𝑂(1 𝑁)
  • 36. FEDX EXTENSION WITH DAW 36 0 1 2 3 4 5 6 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 STP Diseasome Publication Geo Data Movie Executiontime(sec) FedX DAW Over all performance Evaluation Diseasome Publication Geo Data Movie Overall Average Gain % Average Gain % Average Gain % Average Gain % Average Gain % FedX 2.44 18.79 1.48 -12.38 4.60 14.71 1.74 7.59 2.44 9.76 DAW 1.98 1.67 3.92 1.61 2.20
  • 37. SPLENDID EXTENSION WITH DAW 37 0 1 2 3 4 5 6 7 8 9 10 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 STP Diseasome Publication Geo Movie Executiontime(sec) SPLENDID DAW Over all performance Evaluation Diseasome Publication Geo Data Movie Overall Average Gain % Average Gain % Average Gain % Average Gain % Average Gain % SPLENDID 3.78 19.48 2.18 -8.94 7.27 14.40 1.9 11.16 3.71 11.11 DAW 3.04 2.37 6.22 1.688 3.30
  • 38. DARQ EXTENSION WITH DAW 38 0 5 10 15 20 25 30 35 40 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 STP Diseasome Publication Geo Movie Executiontime(sec) DARQ DAW Over all performance Evaluation Diseasome Publication Geo Data Movie Overall Average Gain % Average Gain % Average Gain % Average Gain % AverageGain % DARQ 8.27 23.34 5.26 6.14 23.44 16.31 1.96 13.88 9.59 16.46 DAW 6.34 4.94 19.62 1.688 8.01
  • 39. PROBLEM STATEMENT AND CONTRIBUTIONS 39 S1 S2 S3 S4 RDF RDF RDF RDF Parsing/Rewriting Source Selection Federator Optimizer Integrator Federation Engine QUETSAL, LargeRDFBen ch, State-of- the-art EvaluationHIBISCuS, DAW, SAFE, TopFed Research Questions 1. How to perform join-aware source selection with ensured result set completeness? 2. How to perform duplicate- aware source selection? 3. How to perform policy-aware source selection? 4. How to perform data distribution-aware source selection? 5. How to design comprehensive federated SPARQL as well as triple stores benchmark?
  • 40. SAFE: POLICY-AWARE SOURCE SELECTION 40 return number of patients that have been administered the drug Insulin and exhibit BMI > 25 and Hypertension and Diabetes as adverse events Switzerland Cyprus Greece Yasar et al. SAFE: Policy Aware SPARQL Query Federation Over RDF Data
  • 42. SAFE: POLICY-AWARE SOURCE SELECTION 42 Access Policy Framework Source Selection Access Policy Filtering Query Execution Oya Clinical Researcher Expertise – Diabetes Requested Data S1 S2 S3 Input Input Denies AccessGrants Access S1 S2 S3
  • 43. SAFE: SOURCE SELECTION EVALUATION 43 Systems Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Avg SAFE 8 10 13 16 15 13 15 16 7 7 9 7 11 FedX 9 13 16 24 20 14 16 19 15 17 9 16 16 Systems Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Avg SAFE 0 0 0 0 0 0 0 0 0 0 0 0 0 FedX 36 28 40 64 48 40 44 40 21 21 9 21 35 Sum of triple-pattern-wise sources selected for each query Number of SPARQL ASK requests used for source selection
  • 44. SAFE: QUERY RUNTIME EVALUATION 44 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 Q 1 Q 2 Q 3 Q 4 Q 5 Q 6 Q 7 Q 8 Q9 Q10 Q11 Q12 Avg. Time-LogScale(msec) Query SAFE FedX SAFE is 3.61 times faster than FedX
  • 45. PROBLEM STATEMENT AND CONTRIBUTIONS 45 S1 S2 S3 S4 RDF RDF RDF RDF Parsing/Rewriting Source Selection Federator Optimizer Integrator Federation Engine QUETSAL, LargeRDFBen ch, State-of- the-art EvaluationHIBISCuS, DAW, SAFE, TopFed Research Questions 1. How to perform join-aware source selection with ensured result set completeness? 2. How to perform duplicate- aware source selection? 3. How to perform policy- aware source selection? 4. How to perform data distribution-aware source selection? 5. How to design comprehensive federated SPARQL as well as triple stores benchmark?
  • 46. TOPFED: DATA DISTRIBUTION-AWARE SOURCE SELECTION  Intelligent data distribution combined with  Efficient source selection to handle federation over Big Data  Federation over 20.4 billion Linked TCGA data 46Saleem et al. TopFed: TCGA Tailored Federated Query Processing and Linking to
  • 47. TOPFED 47 b1 b2 p1 p2 g1 g2 g3p3 p4 g4 g5 g6p5 p6 g7 g8 g9 C = {CNV, SNP, E-Gene, E-Protein, miRNA, Clinical} F = {Expression-Exon}M = {beta_value, position} (CNV, SNP, E-Gene, miRNA, E-Protein, Clinical) Exon-Expression Methylation D = {seg_mean, rpmmm, scaled_est, p_exp_val} C-2 = {{p ∈ {E ∪ A ∪ G} ∨ {p = rdf:type ∧ o ∈ F}} ∧ {{S-Join(p, E ∪ F) ∨ P-Join(p, E ∪ F)} ∨ {!S-Join(p, M ∪ B ∪ D ∪ C) ∧ !P-Join(p, M ∪ B ∪ D ∪ C) }}} C-3 = {{p ∈ {M ∪ A} ∨ {p = rdf:type ∧ o ∈ B}} ∧ {{S-Join(p, M ∪ B) ∨ P-Join(p, M ∪ B) } ∨ {!S-Join(p, E ∪ F ∪ D ∪ C) ∧ !P-Join(p, E ∪ F ∪ D ∪ C) }}} C-1 = {{p ∈ {D ∪ A ∪ G} ∨ {p = rdf:type ∧ o ∈ C}} ∧ {{S-Join(p, D ∪ C) ∨ P-Join(p, D ∪ C) } ∨ {!S-Join(p, M ∪ B ∪ E ∪ F) ∧ !P-Join(p, M ∪ B ∪ E ∪ F) }}} C-1 ∨ Category Colour = blue IF tumour lookup is successful forward to corresponding leaf Else broadcast to every one For each query triple t(s, p, o) ∈ T A = {chromosome, result, bcr_patient_barcode} G = {start, stop} B = {DNA-Methylation} E = {RPKM} Tumours SPARQL endpoints C-2 ∨ Category Colour = pink C-3 ∨ Category Colour = green 1-16 17-33 1-5 6-11 12-16 17-22 23-27 28-33 1-4 5-8 9-12 13-16 17-20 21-24 25-27 28-30 31-33
  • 48. TOPFED VS. FEDX 48 Selects 50% less data sources than FedX without losing recall
  • 49. TOPFED VS. FEDX  TopFed outperforms FedX significantly on 90% of the queries  On average, the query run time of TopFed is about 1/3 of that of FedX 49 1 10 100 1000 10000 100000 Query No 1 2 3 4 5 6 7 8 9 10 Average QueryExecutionTime(ms)LogScale FedX (chached) TopFed
  • 50. PROBLEM STATEMENT AND CONTRIBUTIONS 50 S1 S2 S3 S4 RDF RDF RDF RDF Parsing/Rewriting Source Selection Federator Optimizer Integrator Federation Engine QUETSAL, LargeRDFBen ch, State-of- the-art EvaluationHIBISCuS, DAW, SAFE, TopFed Research Questions 1. How to perform join-aware source selection with ensured result set completeness? 2. How to perform duplicate- aware source selection? 3. How to perform policy- aware source selection? 4. How to perform data distribution-aware source selection? 5. How to design comprehensive federated SPARQL as well as triple stores benchmark?
  • 51. SPARQL BENCHMARKS Non-Federated Benchmarks  Centralized repositories  Query span over a single dataset  Real or synthetic  Examples: LUBM, SP2Bench, BSBM, WatDiv, DBPSB, FEASIBLE Federated Benchmarks  Multiple Interlinked datasets  Query span over multiple datasets  Real or synthetic  Examples: FedBench, LargeRDFBench 51
  • 52. FEASIBLE: BENCHMARK GENERATION FRAMEWORK  Dataset cleaning  Feature vectors and normalization  Selection of exemplars  Selection of benchmark queries 52Saleem et al. FEASIBLE: A Featured-Based SPARQL Benchmark Generation
  • 53. FEATURE VECTORS AND NORMALIZATION 53 SELECT DISTINCT ?entita ?nome WHERE { ?entita rdf:type dbo:VideoGame . ?entita rdfs:label ?nome FILTER regex(?nome, "konami", "i") } LIMIT 100 Query Type: SELECT Results Size: 13 Basic Graph Patterns (BGPs): 1 Triple Patterns: 2 Join Vertices: 1 Mean Join Vertices Degree: 2.0 Mean triple patterns selectivity: 0.01709761619798973 UNION: No DISTINCT: Yes ORDER BY: No REGEX: Yes LIMIT: Yes OFFSET: No OPTIONAL: No FILTER: Yes GROUP BY: No Runtime (ms): 65 13 1 2 1 2 0.017 0 1 0 1 1 0 0 1 0 65 0.11 0.53 0.6 7 0.1 4 0.0 8 0.017 0 1 0 1 1 0 0 1 0 0.14 Feature Vector Normalized Feature Vector
  • 54. FEASIBLE 54 Plot feature vectors in a multidimensional space Query F1 F2 Q1 0.2 0.2 Q2 0.5 0.3 Q3 0.8 0.3 Q4 0.9 0.1 Q5 0.5 0.5 Q6 0.2 0.7 Q7 0.1 0.8 Q8 0.13 0.65 Q9 0.9 0.5 Q10 0.1 0.5 Suppose we need a benchmark of 3 queries Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
  • 55. FEASIBLE 55 Calculate average point Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10 Avg. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
  • 56. FEASIBLE 56 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Select point of minimum Euclidean distance to avg. point *Red is our first exemplar
  • 57. FEASIBLE 57 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Select point that is farthest to exemplars
  • 59. FEASIBLE 59 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Select point that is farthest to exemplars
  • 61. Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 FEASIBLE 61 Calculate distance from Q1 to each exemplars
  • 62. Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 FEASIBLE 62 Assign Q1 to the minimum distance exemplar
  • 63. Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 FEASIBLE 63 Repeat the process for Q2
  • 64. Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 FEASIBLE 64 Repeat the process for Q3
  • 65. Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 FEASIBLE 65 Repeat the process for Q6
  • 66. Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 FEASIBLE 66 Repeat the process for Q8
  • 67. Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 FEASIBLE 67 Repeat the process for Q9
  • 68. Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 FEASIBLE 68 Repeat the process for Q10
  • 69. FEASIBLE 69 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10 Avg. Avg. Avg. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Calculate Average across each cluster
  • 70. FEASIBLE 70 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10 Avg. Avg. Avg. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Calculate distance of each point in cluster to the average
  • 71. FEASIBLE 71 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10 Avg. Avg. Avg. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Select minimum distance query as the final benchmark query from that cluster Purple, i.e., Q2 is the final selected query from yellow cluster
  • 72. FEASIBLE 72 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10 Avg. Avg. Avg. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Select minimum distance query as the final benchmark query from that cluster Purple, i.e., Q3 is the final selected query from green cluster
  • 73. FEASIBLE 73 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10 Avg. Avg. Avg. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Select minimum distance query as the final benchmark query from that cluster Purple, i.e., Q8 is the final selected query from brown cluster Our benchmark queries are Q2, Q3, and Q8
  • 74. COMPARISON OF COMPOSITE ERROR 74 FEASIBLE’s composite error is 54.9% less than DBPSB
  • 75. RANK-WISE RANKING OF TRIPLE STORES 75 All values are in percentages  None of the system is sole winner or loser for a particular rank  Virtuoso mostly lies in the higher ranks, i.e., rank 1 and 2 (68.29%)  Fuseki mostly in the middle ranks, i.e., rank 2 and 3 (65.14%)  OWLIM-SE usually on the slower side, i.e., rank 3 and 4 (60.86 %)  Sesame is either fast or slow. Rank 1 (31.71% of the queries) and rank 4 (23.14%)
  • 76. PROBLEM STATEMENT AND CONTRIBUTIONS 76 S1 S2 S3 S4 RDF RDF RDF RDF Parsing/Rewriting Source Selection Federator Optimizer Integrator Federation Engine QUETSAL, LargeRDFBen ch, State-of- the-art EvaluationHIBISCuS, DAW, SAFE, TopFed Research Questions 1. How to perform join-aware source selection with ensured result set completeness? 2. How to perform duplicate- aware source selection? 3. How to perform policy- aware source selection? 4. How to perform data distribution-aware source selection? 5. How to design comprehensive federated SPARQL as well as triple stores benchmark?
  • 77. LARGERDFBENCH 32 Queries  10 simple  10 complex  8 large data 14 Interlined datasets 77 Linked MDB DBpedi a New York Times Linked TCGA- M Linked TCGA- E Linked TCGA- A Affymetr ix SW Dog Food KEGG Drug bank Jamend o ChEBI Geo names basedNear owl:sameAs x-geneid #Links: 251.3k country, ethnicity, race keggCompoundId bcr_patient_barcode Same instance Life Sciences Cross Domain Large Data bcr_patient_barcode #Links: 1.7k #Links: 4.1k #Links: 21.7k #Links: 1.3k Saleem et al. LargeRDFBench: A Billion Triples Benchmark for SPARQL Endpoint
  • 80. LARGERDFBENCH QUERIES PROPERTIES 14 Simple  2-7 triple patterns  Subset of SPARQL clauses  Query execution time around 2 seconds on avg. 10 Complex  8-13 triple patterns  Use more SPARQL clauses  Query execution time up to 10 min 8 Large Data  Minimum 80459 results  Large intermediate results  Query execution time in hours 80
  • 82. RESULT SET COMPLETENESS AND CORRECTNESS 82
  • 83. QUERIES RUNTIME RESULTS 83 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 Avg. Time-LogScale(msec) FedX(cold) FedX(100% cached) SPLENDID ANAPSID FedX+HiBISCuS SPLENDID+HiBISCuS FedX+HiBISCuS, FedX  SPLENDID+HiBISCuS  ANAPSID  SPLENDID 12/14 8/14 10/14
  • 84. QUERIES RUNTIME RESULTS 84 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 Avg. Time-LogScale(msec) FedX(cold) FedX(100% cached) SPLENDID ANAPSID FedX+HiBISCuS SPLENDID+HiBISCuS Runtimeerror Runtimeerror Runtimeerror ANAPSID  SPLENDID+HiBISCuS  FedX+HiBISCuS, FedX SPLENDID 4/7 5/7 5/7
  • 86. CONCLUSIONS 86 S2 S3 S4 RDF RDF RDF Parsing/Rewriting Source Selection Federator Optimizer Integrator Federation Engine S1 RDF
  • 87. CONCLUSIONS 87 S2 S3 S4 RDF RDF RDF Parsing/Rewriting Source Selection Federator Optimizer Integrator Federation Engine S1 RDF Better source selection leads to overall improvement of runtime performance • HIBISCUS: 24.61% - 92.22% • DAW: 9.79% - 16.46% • SAFE: 84% • TopFed: 68%
  • 88. CONCLUSIONS 88 S2 S3 S4 RDF RDF RDF Parsing/Rewriting Source Selection Federator Optimizer Integrator Federation Engine S1 RDF Better source selection leads to overall improvement of runtime performance • HIBISCUS: 24.61% - 92.22% • DAW: 9.79% - 16.46% • SAFE: 84% • TopFed: 68% Better benchmarking allows for informed selection of RDF stores • 55% less error than DBSPB • Column stores (Virtuoso) not always best
  • 89. CONCLUSIONS 89 S2 S3 S4 RDF RDF RDF Parsing/Rewriting Source Selection Federator Optimizer Integrator Federation Engine S1 RDF Better source selection leads to overall improvement of runtime performance • HIBISCUS: 24.61% - 92.22% • DAW: 9.79% - 16.46% • SAFE: 84% • TopFed: 68% Better benchmarking allows for informed selection of RDF stores • 55% less error than DBSPB • Column stores (Virtuoso) not always best LargeRDFBench addresses drawbacks of current federated benchmarks • SPARQL features • Size of intermediary results • Total runtime of queries
  • 90. CONCLUSIONS 90 S2 S3 S4 RDF RDF RDF Parsing/Rewriting Source Selection Federator Optimizer Integrator Federation Engine S1 RDF Better source selection leads to overall improvement of runtime performance • HIBISCUS: 24.61% - 92.22% • DAW: 9.79% - 16.46% • SAFE: 84% • TopFed: 68% Better benchmarking allows for informed selection of RDF stores • 55% less error than DBSPB • Column stores (Virtuoso) not always best LargeRDFBench addresses drawbacks of current federated benchmarks • SPARQL features • Size of intermediary results • Total runtime of queries Contributions allow for • Informed selection of triple stores and of federation engines • Better source selection • Efficient query planning • Reduction of intermediate results, • Time-efficient query execution
  • 91. FUTURE DIRECTIONS  Top-K relevant source selection  Cost-based query planning  Caching intermediate results  Intelligent data distribution  Provenance and runtime estimation  Federated benchmarks out of queries log  Synthetic benchmarks more like real benchmarks 91
  • 92. AWARDS 1. Best paper award at conference on Semantics in Healthcare and Life Sciences (CSHALS 2014) with paper titled GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research 2. Semantic Web Challenge-Big Data Track winner at ISWC 2013 with paper titled Fostering Serendipity through Big Linked Data 3. I-CHALLENGE (Linked Data Cup) winner at I-Semantics 2013 with paper titled Linked Cancer Genome Atlas Database 92
  • 93. PUBLICATIONS AND CITATIONS Total Publications: 25  5 Journals (I.F. 2.55, 2.55, 2.26, 0.44)  10 Conference (5 A ranked, CORE)  4 Workshops  2 Tutorials (A ranked, CORE)  1 Technical report  3 Demo (A ranked, CORE) 93
  • 95. PUBLICATIONS 2016 1. Muhammad Saleem, Ricardo Usbeck, Michael Roder, and Axel-Cyrille Ngonga Ngomo SPARQL Querying Benchmarks Tutorial at International Semantic Web Conference (ISWC), 2015 2. Ethem Cem Ozkan, Muhammad Saleem, Erdogan Dogdu, and Axel-Cyrille Ngonga Ngomo UPSP: Unique Predicate-based Source Selection for SPARQL Endpoint Federation PROFILES at Extended Semantic Web Conference (ESWC), 2016 95
  • 96. PUBLICATIONS 2015 1. Muhammad Saleem, Yasar Khan, Ali Hasnain, Ivan Ermilov, and Axel-Cyrille Ngonga Ngomo A Fine- Grained Evaluation of SPARQL Endpoint Federation Systems Semantic Web Journal, 2015 2. Muhammad Saleem, Qaiser Mehmood, and Axel-Cyrille Ngonga Ngomo FEASIBLE: A Featured-Based SPARQL Benchmark Generation Framework International Semantic Web Conference (ISWC), 2015 3. Muhammad Saleem, Muhammad Intizar Ali, Ruben Verborgh, Qaiser Mehmood, and Axel-Cyrille Ngonga Ngomo LSQ: The Linked SPARQL Queries Dataset International Semantic Web Conference (ISWC), 2015 4. Muhammad Saleem, Muhammad Intizar Ali, Ruben Verborgh, andAxel-Cyrille Ngonga Ngomo Federated Query Processing over Linked Data Tutorial at International Semantic Web Conference (ISWC), 2015 5. Muhammad Saleem, Intizar Ali, Aidan Hogan,Qaiser Mehmood, and Axel-Cyrille Ngonga Ngomo LSQ: The Linked SPARQL Queries Dataset Technical Report LSQ Technical Report 6. Muhammad Saleem, Qaiser Mehmood, and Axel-Cyrille Ngonga Ngomo Automatic SPARQL Benchmark Generation Using FEASIBLE Demo at International Semantic Web Conference (ISWC), 2015 7. Muhammad Saleem, Muhammad Intizar Ali, Aidan Hogan, Qaiser Mehmood, and Axel-Cyrille Ngonga Ngomo The LSQ Dataset: Querying for Queries Demo at International Semantic Web Conference (ISWC), 2015 8. Syeda Sana e Zainab, Ali Hasnain, Muhammad Saleem, Qaiser Mehmood, Durre Zehra, and Stefan Decker SPARQL Query Formulation and Execution using FedViz Demo at International Semantic Web Conference (ISWC), 2015 9. Syeda Sana e Zainab, Ali Hasnain, Muhammad Saleem, Qaiser Mehmood, Durre Zehra, and Stefan Decker FedViz: A Visual Interface for SPARQL Queries Formulation and Execution VOILA 96
  • 97. PUBLICATIONS 2014 1. Yasar Khan, Muhammad Saleem, Aftab Iqbal, Muntazir Mehdi, Aidan Hogan, Panagiotis Hasapis, Axel- Cyrille Ngonga Ngomo, Stefan Decker, and Ratnesh Sahay SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes Semantic Web Applications and Tools for Life Sciences (SWAT4LS), 2014 2. Nur Aini Rakhmawati, Muhammad Saleem, Sarasi Lalithsena, and Stefan Decker QFed: Query Set For Federated SPARQL Query Benchmark 16th International Conference on Information Integration and Web- based Applications & Services (iiWAS), 2014 3. Bühmann, Lorenz, Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, Muhammad Saleem, Andreas Both, Valter Crescenzi, Paolo Merialdo, and Disheng Qiu Web-Scale Extension of RDF Knowledge Bases from Templated Websites International Semantic Web Conference (ISWC), 2014 4. Muhammad Saleem, Axel-Cyrille Ngonga HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation Extended Semantic Web Conference (ESWC), 2014 5. Maulik R. Kamdar, Aftab Iqbal, Muhammad Saleem, Helena F. Deus, and Stefan Decker GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research CSHALS, 2014, (Best paper award) 6. Muhammad Saleem, Shanmukha Sampath, Axel-Cyrille Ngonga Ngomo, Aftab Iqbal, Jonas Almeidaand, and Helena Deus TopFed: TCGA Tailored Federated Query Processing and Linking to LOD Journal of Biomedical Semantics, 2014 7. Muhammad Saleem, Maulik R. Kamdar, Aftab Iqbal, Shanmukha Sampath, Helena F. Deus, and Axel- Cyrille Ngonga Ngomo Big Linked Cancer Data: Integrating Linked TCGA and PubMed Journal of Web Semantics, 2014 97
  • 98. PUBLICATIONS 2009-2013 1. Muhammad Saleem, Maulik R. Kamdar, Aftab Iqbal, Shanmukha Sampath, Helena F. Deus, and Axel-Cyrille Ngonga Ngomo Fostering Serendipity through Big Linked Data Semantic Web Challenge at International Semantic Web Conference (ISWC), 2013, Semantic Web Challenge (Big Data Track) Winner 2. Muhammad Saleem, Shanmukha S Padmanabhuni, Axel-Cyrille Ngonga Ngomo, Jonas S Almeida, and Stefan Decker, Helena Deus Linked Cancer Genome Atlas Database In Linked Data Cup, I- Semantics2013, I-CHALLENGE (Linked Data Cup) Winner 3. Muhammad Saleem, Axel-Cyrille Ngonga Ngomo, Josian Xavier Pariera, Helena F. Deus, and Manfred Hauswirth DAW: Duplicate-AWare Federated Query Processing over the Web of Data International Semantic Web Conference (ISWC), 2013 4. Muhammad Saleem, Ali Zahir, Yasir Ismail, and Bilal Saeed Enhanced Generic Information Services Using Mobile Messaging Grid and Pervasive Computing (GPC), 2010 5. Muhammad Saleem, Ali Zahir, Yasir Ismail, and Bilal Saeed Enhanced Generic Information Services Using Mobile Messaging Grid and Pervasive Computing (GPC), 2010 6. Muhammad Saleem, and Kyung-Goo Doh Generic Information System Using SMS Gateway The Fourth International Conference on Computer Sciences and Convergence Information Technology (ICCIT), 2009 7. Muhammad Saleem, Rasheed Hussain, Yasir Ismail, and Shaikh Mohsin Cost Effective Software Engineering using Program Slicing Techniques The 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human (ICIS), 2009 98
  • 100. STATE-OF-THE-ART: SPARQL FEDERATION APPROACHES  SPARQL Endpoint Federation (SEF)  Linked Data Federation (LDF)  Distributed Hash Tables (DHTs)  Hybrid of SEF+LDF 100 Saleem et al. A Fine-Grained Evaluation of SPARQL Endpoint Federation Systems (Semantic Web Journal, 2015)
  • 101. STATE-OF-THE-ART: SOURCE SELECTION  Index-only  Index-free (SPARQL ASK Queries)  Hybrid (Index+ SPARQL ASK Queries) 101
  • 104. HIBISCUS: DATA SUMMARIES 104 [] a ds:Service ; ds:endpointUrl <http://dbpedia.org/sparql> ; ds:capability [ ds:predicate dbpedia:party ; ds:sbjAuthority <http://dbpedia.org/> ; ds:objAuthority <http://dbpedia.org/> ; ] ; ds:capability [ ds:predicate rdf:type ; ds:sbjAuthority <http://dbpedia.org/> ; ds:objAuthority owl:Thing, dbpedia:President; #we store all distinct classes ] ; ds:capability [ ds:predicate dbpedia:postalCode ; ds:sbjAuthority <http://dbpedia.org/> ; #No objAuthority as the object value for dbpedia:postalCode is string ] ;
  • 105. EFFICIENT SOURCE SELECTION FedX(warm) SPLENDID DARQ ANAPSID HiBISCus (warm) Query #TP #AR SST #TP #AR SST #TP #AR SST #TP #AR SST #TP #AR SST CD 78 0 7.33 78 99 320.9 84 0 7.286 36 43 186 35 0 30.43 LS 56 0 7.99 56 90 307.3 77 0 7.571 44 63 477.4 41 0 23.14 LD 97 0 8.09 97 126 279 113 0 7.727 54 37 803.5 47 0 16 Net 231 0 8 231 315 299 274 0 7.56 134 143 554 123 0 22 105
  • 106. DAW: DATA SUMMARIES 106 [] a sd:Service ; sd:endpointUrl <http://localhost:8890/sparql> ; sd:capability [ sd:predicate diseasome:name ; sd:totalTriples 147 ; sd:avgSbjSel ``0.0068'' ; sd:avgObjSel ``0.0069'' ; sd:MIPs ``-6908232 -7090543 -6892373 -7064247 ...''; ] ; sd:capability [ sd:predicate diseasome:chromosomalLocation ; sd:totalTtriples 160 ; sd:avgSbjSel ``0.0062'' ; sd:avgObjSel ``0.0072'' ; sd:MIPs ``-7056448 -7056410 -6845713 -6966021 ...''; ] ;
  • 107. 107 0 20 40 60 80 100 120 Recallin% Ranked Sources Optimal DAW 0 20 40 60 80 100 120 Recallin% Ranked Sources Optimal DAW Diseasome Publication SOURCE RANKING VS. RECALL
  • 108. TRIPLE STORE BENCHMARKS Synthetic Benchmarks  Make use of the synthetic queries and/or data  Suitable to test scalability  Often fail to reflect real datasets  Examples: LUBM, SP2Bench, BSBM, WatDiv Query Log Benchmarks  Make use of the real queries from queries log  Can be more close to the reality  Scalability can be tested  Examples: DBPSB, FEASIBLE 108
  • 109. FEASIBLE: COMPOSITE ERROR ESTIMATION 109 L is the query log, B is the benchmark and K is the set of all features
  • 111. SPARQL 1.1 QUERIES RUNTIME RESULTS 111 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 Avg. Time-LogScale(msec) FedX(100% cached) ANAPSID ANAPSID  FedX 8/14
  • 112. SPARQL 1.1 QUERIES RUNTIME RESULTS 112 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 Avg. Time-LogScale(msec) FedX(100% cached) ANAPSID Runtimeerror Timeout Timeout Timeout Runtimeerror Timeout FedX  ANAPSID 6/8
  • 113. CONCLUSION  HIBISCUS: Hyper graph-based source selection  FedX: 20/25 queries, net improvement 24.61%  SPLENDID: 24/25 queries, net improvement 82.75%  DARQ: 20/20 queries, net improvement 92.22%  DAW: Duplicate-aware source selection  FedX: 63/79 queries, net improvement 9.79 %  SPLENDID: 66/79 queries, net improvement 11.11%  DARQ: 70/79 queries, net improvement 16.46%  SAFE: Policy-aware source selection  FedX: 12/12 queries, net improvement 84 %  TopFed: Data distribution-aware selection  FedX: 10/10 queries, net improvement 68 % 113  Join-aware source selection leads to, Efficient query planning, Reduce intermediate results, and Decrease overall runtime  FEASIBLE: Triple Store Benchmark  FEASIBLE composite error is 55% smaller than DBPSB  New insights on performance of triple stores  LargeRDFBench  Simple queries benchmarks are not sufficient  Ranking changes from simple to complex queries