1. Muhammad Saleem , Claus Stadler, Qaiser Mehmood, Jens
Lehmann, Axel-Cyrille Ngonga Ngomo
(K-Cap 2017, Austin, USA)
AKSW, University of Leipzig, Germany
DICE, University of Paderborn,
Germany
SDA, University of Bonn, Germany
1
2. Query containment
Why SQCFramework?
SQCFramework
Input queries
Important query features
Benchmark generation
Benchmark personalization
Evaluation and results
Conclusion
2
3. Deciding whether the result set of one query is
included in the result set of another?
3
Formally:
7. Manually provided by user
Selection from LSQ
Linked SPARQL Queries datasets
Extracted from endpoint queries log
Structural and data-driven statistics
7
20 datasets available from (http://hobbitdata.informatik.uni-leipzig.de/lsq-dumps/)
8. Number of entailments/sub-queries
Number of projection variables
Number of BGPs
Number of triple patterns
Max. number BGP triple patterns
Min. number BGP triple patterns
Number of join vertices
Mean join vertex degree
Number of LSQ features
8
9. 1. Selection of super-queries
2. Normalized feature vectors
3. Generation of clusters
4. Selection of most representative queries
9
11. 11
2
2
1
5
5
5
3
2.3
2
Number of entailments/sub-queries
Number of projection variables
Number of BGPs
Number of triple patterns
Max. number BGP triple patterns
Min. number BGP triple patterns
Number of join vertices
Mean join vertex degree
Number of LSQ features
10
8
6
12
5
10
10
5
30
0.2
0.25
0.16
0.41
1
0.5
0.33
0.46
0.06
Feature vector Max. feature vector Normalized feature vector
F M F/M
18. Number of projection variables in the super-
queries should be at most 2
Number of BGPs should be greater than 1
or the number of triple patterns should be
greater than 3
Benchmark should be selected from the
most recently executed 1000 queries
18
19. 19
Similarity error
Diversity score
L is the query log, B is the benchmark,
and k is the set of all features
We compared
FEASIBLE
FEASIBLE-Exemplars
KMeans++
DBSCAN+KMeans++
Random selection
Number of containment tests (#T)
Benchmark generation time (G) in sec
20. 20
Query Mixes per Hour (QMpH)
Number of handled test cases
Number of timed out test cases
We compared
TreeSolver
AFMU
SPARQL-Algebra
JSAC
We generated benchmarks using Semantic Web
Dog Food (SWDF) and DBpedia queries logs
21. 21
0
0.01
0.02
0.03
0.04
0.05
15 25 50 75 100 125
SIMILARITYERROR
#SUPER QUERIES
FEASIBLE KMeans++
DBScan+KMeans++ Random
FEASIBLE-Exemplars
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
2 4 6 9 12 15
SIMILARITYERROR
#SUPER QUERIES
FEASIBLE KMeans++
DBScan+KMeans++ Random
FEASIBLE-Exemplars
(SWDF) (DBpedia)
• Similarity error is inversely (in general) proportional to benchmark size
• Random selection in general generates benchmarks of smaller similarity
errors
22. 22
(SWDF) (DBpedia)
• Diversity score is inversely (in general) proportional to benchmark size
• FEASIBLE-Exemplars generates the more diverse benchmarks
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
15 25 50 75 100 125
DIVERSITYSCORE
#SUPER QUERIES
FEASIBLE KMeans++
DBScan+KMeans++ Random
FEASIBLE-Exemplars
0
0.1
0.2
0.3
0.4
0.5
2 4 6 9 12 15
DIVERSITYSCORE
#SUPER QUERIES
FEASIBLE KMeans++
DBScan+KMeans++ Random
FEASIBLE-Exemplars
25. 25
• JSAC correctly handled all cases in with reasonable
QMpH
0
0.5
1
1.5
2
QMpH
TreeSolver AFMU
JSAC SPARQL-Algebra
Total
Tests
#Handled
Tests
#Correct
Test
#Timeout
Tests
TreeSolver 1192 5 5 2
AFMU 1192 5 5 12
SPARQL-Algebra 1192 0 0 0
JSAC 1192 1192 1192 0
26. 26
SQCFramework:
Based on real data, real log queries
Flexible
Customizable
Use-case specific
Similarity error is inversely (in general) proportional to benchmark size
Random selection in general generates benchmarks of smaller similarity errors
Diversity score is inversely (in general) proportional to benchmark size
FEASIBLE-Exemplars generates the more diverse benchmarks
JSAC correctly handled all cases in with reasonable QMpH
SQCFramework available from (https://github.com/dice-group/sqcframework)