7. Scien'fic
method
applied
to
analysis
of
algorithms
• A
framework
for
predic'ng
performance
and
comparing
algorithms.
• Scien'fic
method
–
–
–
–
–
Observe
some
feature
of
the
natural
world.
Hypothesize
a
model
that
is
consistent
with
the
observa'ons.
Predict
events
using
the
hypothesis.
Verify
the
predic'ons
by
making
further
observa'ons.
Validate
by
repea'ng
un'l
the
hypothesis
and
observa'ons
agree.
• Principles
– Experiments
must
be
reproducible.
– Hypotheses
must
be
falsifiable.
• Feature
of
the
natural
world.
Computer
itself.
Slide
credit:
Robert
Sedgewick
6
8. Example:
3-‐Sum
• 3-‐SUM.
Given
N dis'nct
integers,
how
many
triples
sum
to
exactly
zero?
• 3-‐SUM
brute-‐force
algorithm.
Check
all
the
possible
triples.
• How
much
'me
does
it
take?
Slide
credit:
Robert
Sedgewick
7
9. Data
analysis
• Standard
plot.
Plot
running
'me
T (N)
vs.
input
size
N.
Slide
credit:
Robert
Sedgewick
8
10. Data
analysis
• Log-‐log
plot.
Plot
running
'me
lg(T (N))
vs.
input
size lg N.
• Regression.
Fit
straight
line
through
data
points:
a N b.
• Hypothesis.
The
running
'me
is
about
1.006 × 10 –10 × N 2.999
Slide
credit:
Robert
Sedgewick
9
11. Predic'on
and
valida'on
• Hypothesis.
The
running
'me
is
about
1.006 × 10 –10 × N 2.999
• Predic'ons.
– 51.0
seconds
for
N =
8000.
– 408.1
seconds
for
N =
16000.
• Observa'ons.
Validates
the
hypothesis
Slide
credit:
Robert
Sedgewick
10
12. Understanding
performance
of
database
queries
• Ganapathi
et
al.
predic'ng
performance
metrics
of
database
queries
prior
to
query
execu'on
using
machine
learning.
• Gupta
et
al.
use
machine
learning
for
predic'ng
query
execu'on
'me
ranges.
Ganapathi
et
al.:
Predic'ng
mul'ple
metrics
for
queries:
Befer
decisions
enabled
by
machine
learning.
In
Proc.
of
the
2009
IEEE
ICDE
Gupta
et
al.:
PQR:
Predic'ng
query
execu'on
'mes
for
autonomous
workload
management.
In
Proc.
of
the
2008
ICAC
11
13. Predic'ng
SPARQL
query
execu'on
'me
• Key
challenge.
Feature
engineering
– Represen'ng
SPARQL
queries
as
feature
vectors
• Each
dimension
of
the
vector
is
a
feature
12
14. Configura'on
• Apache
Jena
TDB
– With
DBpedia
3.8
dataset
• Training,
valida'on,
and
test
queries:
randomly
selected
from
DBpedia
SPARQL
Benchmark
(DBPSB)
query
dataset
– 3600
training,
1200
valida'on,
1200
test
13
15. Jena
ARQ
query
processing
• A
SPARQL
query
in
ARQ
goes
through
several
stages
of
processing:
– String
to
Query
(parsing)
– Transla'on
from
Query
to
a
SPARQL
algebra
expression
– Op'miza'on
of
the
algebra
expression
– Query
plan
determina'on
and
low-‐level
op'miza'on
– Evalua'on
of
the
query
plan
14
16. SPARQL
algebra
features
• SPARQL
Algebra1
1
hfp://www.w3.org/TR/sparql11-‐query/#sparqlQuery
15
18. Experiment
1
• Model:
Support
Vector
Machine
regression
• Evalua'on
measure:
R2
•
Measures
how
well
future
samples
are
likely
to
be
predicted
by
the
model.
17
19. Experiment
1
• Test
dataset
R2
=
0.004492
Log
scale
plomng
of
predicted
vs
actual
execu'on
'mes
for
the
test
queries.
18
20. Experiment
1
Some
of
the
long
running
queries
share
structurally
similar
basic
graph
paferns.
{
dbpedia
:1549
_Mikko
?p
?
uri
.
?
uri
rdf
:
type
?x
}
Challenge.
How
do
we
represent
basic
graph
paferns
as
vectors?
19
21. Basic
Graph
Pafern
Features
• Infinite
number
of
possibili'es
to
write
a
basic
graph
pafern
(BGP)
• Only
the
set
of
literal
values
and
the
set
of
resources
appearing
in
the
RDF
graph
– Exponen'al
number
of
possibili'es
– A
graph
with
n
triples
has
2n subsets
of
triples
• Feature
vector
with
exponen'al
number
of
dimensions
– Not
feasible
20
22. Basic
Graph
Pafern
Features
• Pafern
graph
=
RDF
graph
constructed
from
all
the
BGPs
in
a
query
– Replace
variables
with
a
fixed
symbol
‘?’
• Cluster
the
training
queries
based
on
pafern
graph
similari'es
• Create
a
vector
with
similarity
scores
between
the
pafern
graph
of
the
query
and
the
queries
in
the
cluster
centers.
21
23. • Graph
Edit
Distance
– Minimum
amount
of
distor'on
needed
to
transform
one
graph
to
another
– Compute
similarity
by
inversing
distance
22
24. • Graph
Edit
Distance
– Usually
computed
using
A*
search
• Exponen'al
running
'me
– Bipar'te
matching
based
approximated
graph
edit
distance
with
• Previous
research
shows
very
accurate
results
with
classifica'on
problems
23
25. • Clustering
Training
Queries
– K-‐mediods
clustering
algorithm
with
approximated
edit
distance
as
distance
func'on
• Selects
data
points
as
cluster
centers
• Arbitrary
distance
func'on
24
26. Experiment
2
• Model:
Support
Vector
Machine
regression
• Test
dataset
R2
=
0.124204
• K
=
10
Algebra
features
Algebra
+
BGP
features
25
27. Mul'ple
Regressions
• We
train
different
SMV
regressions
for
different
'me
ranges.
• The
variance
in
y-‐axis
is
less
for
each
regression,
easier
to
fit
a
curve.
26
28. • Different
'me
ranges
– Clustering
the
execu'on
'me
ranges
• We
use
x-‐means
clustering
algorithm
which
automa'cally
es'mates
the
number
of
clusters
– 5
clusters
found
in
the
training
dataset
– Each
cluster
contains
queries
with
similar
execu'on
'mes
27
29. • Predic'ng
execu'on
'me
range
– Predict
the
corresponding
clusters
for
unseen
queries.
– How
• Train
a
SMV
classifier
with
the
found
clusters
as
labels
• Classify
unseen
queries:
accuracy
of
96%
for
the
test
dataset
• This
means
we
can
accurately
predict
'me
ranges
28
30. • Predic'ng
execu'on
'me
– Different
SMV
regressions
for
different
'me
ranges.
– Use
the
corresponding
regression
to
the
'me
range
cluster
for
an
unseen
query
29
31. Experiment
3
• Test
dataset
R2
=
0.83862
Algebra
+
BGP
features
Mul'ple
regressions
30
32. Predic'ng
with
nearest
neighbors
regression
• The
k-‐nearest
neighbors
algorithm
(k-‐NN)
is
oAen
successful
in
the
cases
where
decision
boundary
is
irregular.
• We
train
a
k-‐NN
with
– Euclidean
distance
as
the
distance
func'on
– Distance
weigh'ng:
weighted
by
the
inverse
of
the
distance
31
33. • k-‐dimensional
tree
(k-‐d
tree)
data
structure
to
search
the
nearest
neighbors
– a
space-‐par''oning
data
structure
for
organizing
points
in
a
k-‐dimensional
space
• Complexity
of
a
search:
O(log N)
opera'ons
32
34. Experiment
4
• Test
dataset
R2
=
0.837
• k=2
for
k-‐NN
(selected
by
cross
valida'on)
Mul'ple
regressions
k-‐NN
33
35. • Future
work
– Training
data
with
broad
coverage
• DBpedia
SPARQL
benchmark
query
templates
– Berlin:
5
templates
– DBPSB:
20
templates
– Fine
tuning
with
more
cross
valida'on
34
37. Sugges'ng
SPARQL
queries
based
on
query
history
• Use
the
same
features
• Construct
a
k-‐d
tree
for
nearest
neighbor
search
• Top
M neighbors
for
a
query
are
the
top
M
sugges'ons
for
that
query
36
38. Example
SELECT
DISTINCT
?uri
WHERE
{
dbpedia
:1549
_Mikko
?p
?
uri
.
?
uri
rdf
:
type
?x
}
Sugges'on
1
SELECT
DISTINCT
?uri
WHERE
{
dbpedia
:
Radu_Sabo
?p
?
uri
.
?
uri
rdf
:
type
?x
}
Sugges'on
2
SELECT
DISTINCT
?uri
WHERE
{
dbpedia
:
Hafar_Al
-‐
Ba'n
?p
?
uri
.
?
uri
rdf
:
type
?x
}
Sugges'on
3
SELECT
DISTINCT
?uri
WHERE
{
dbpedia
:
Maurice_D
._G.
_Scof
?p
?
uri
.
?
uri
rdf
:
type
?x
}
37
39. • Future
work
– Query
construc'on
and
refinement
workflow
• How
to
use
the
query
sugges'ons?
– Evalua'ng
the
sugges'ons
• User
study
38