17. Ideas are Wrong
• Too Many Back-ends (aka
Tinkerpop is wrong)
• No lessons applied from
Relational Databases
• API is incomplete (bulk)
• Query Languages are
Incompetent
18. Implementations
are Wrong
• Nodes as Objects sucks
• No internal algebras
• Incompetent Query Optimizers
• Incompetent Query Executors
• Incompetent Engineering
• A short clip of the talk
31. Peter Suggests:
https://homepages.cwi.nl/~boncz/edbt2022.pdf
1. Row Storage for Properties of Nodes/Relationships
2. Less Indexing
3. Less Joins
4. Be more Relational then add Graph Functionality
5. Don’t rely on the query optimizer
6. Don’t allow generic recursive queries
7. Limit the query language
51. Problem with Joins
Table 1
ID
0
1
3
4
5
6
7
8
9
11
Table 2
ID
0
2
6
7
8
9
Table 3
ID
2
4
5
8
10
Results
Table 1 Table 2 Table 3
8 8 8
Intermediate Results
Table1 and Table 2
0
6
7
8
9
52. Worst Case Optimal Joins
● Worst-Case Optimal Join Algorithms: Techniques, Results, and
Open Problems. Ngo. (Gems of PODS 2018)
● Worst-Case Optimal Join Algorithms: Techniques, Results, and
Open Problems. Ngo, Porat, Re, Rudra. (Journal of the ACM
2018)
● What do Shannon-type inequalities, submodular width, and
disjunctive datalog have to do with one another? Abo Khamis,
Ngo, Suciu, (PODS 2017 - Invited to Journal of ACM)
● Computing Join Queries with Functional Dependencies. Abo
Khamis, Ngo, Suciu. (PODS 2017)
● Joins via Geometric Resolutions: Worst-case and Beyond. Abo
Khamis, Ngo, Re, Rudra. (PODS 2015, Invited to TODS 2015)
● Beyond Worst-Case Analysis for Joins with Minesweeper. Abo
Khamis, Ngo, Re, Rudra. (PODS 2014)
● Leapfrog Triejoin: A Simple Worst-Case Optimal Join Algorithm.
Veldhuizen (ICDT 2014 - Best Newcomer)
● Skew Strikes Back: New Developments in the Theory of Join
Algorithms. Ngo, Re, Rudra. (Invited to SIGMOD Record 2013)
● Worst Case Optimal Join Algorithms. Ngo, Porat, Re,
Rudra. (PODS 2012 – Best Paper)
54. More than 3 Tables
m
a
14
Brand
Category
Retailer
Rating
p
o
n
b
7) seek m
6) seek m
3) seek f
5) seek m
4) seek g
2) seek c
1) seek c
c d e f g
Worst-Case Optimal Joins take advantage of sorted keys and gaps in the data to
eliminate intermediate results, speed up queries and get rid of the Join problem.
59. Reduce the Search Space
m
a
14
Airport
Day
Flight
Destination
p
o
n
b
7) seek m
6) seek m
3) seek f
5) seek m
4) seek g
2) seek c
1) seek c
c d e f g
What if you wanted to earn miles on your frequent flyer program and filter by Airline? No
problem here, the more joins the merrier.
64. What’s wrong with NULL?
SELECT *
FROM parts
WHERE (price <= 99) OR (price > 99)
SELECT *
FROM parts
WHERE (price <= 99) OR (price > 99) OR isNull(price)
SELECT AVG(height)
FROM parts
SELECT orders.id, parts.id
FROM orders LEFT OUTER JOIN
parts ON parts.id = orders.part_id
SELECT orders.id, parts.id
FROM parts LEFT OUTER JOIN
orders ON parts.id = orders.part_id
●(a and NOT(a)) != True
●Aggregation requires special cases
●Outer Joins are not commutative
a x b != b x a
Query Optimizers hate Nulls. The 3 valued
logic cause major headaches.
67. Sets vs Bags
Set: {1,2,3}, {8,3,4}
Bags: {1,2,2,3}, {3, 3, 3, 3}
Sets have Unique Values
Bags allow Duplicate Values
●Queries that use only ANDs (no ORs)
are called “conjunctive queries”
●Conjunctive Queries under Set
Semantics are Much Easier to Optimize
Query Optimizers hate Bags. Duplicates cause
major headaches.
73. Math
You learned this in middle school
• 1 + (2 + 3) = (1 + 2) + 3
• 3 + 4 = 4 + 3
• 3 + 0 = 3
• 1 + (-1) = 0
• 2 x (3 x 4) = (2 x 3) x 4
• 2 x 5 = 5 x 2
• 2 x 1 = 2
• 2 x 0.5 = 1
• 2 x (3 + 4) = (2 x 3) + (2 x 4)
• (3 + 4) x 2 = (3 x 2) + (4 x 2)
74. Math
You learned this in high school
• a + (b + c) = (a + b) + c
• a + b = b + a
• a + 0 = a
• a + (-a) = 0
• a x (b x c) = (a x b) x c
• a x b = b x a
• a x 1 = a
• a x a-1 = 1, a != 0
• a x (b + c) = (a x b) + (a x c)
• (a + b) x c = (a x c) + (b x c)
75. Math
You forgot this in high school
• Addition:
• Associativity:
• a ⊕ (b ⊕ c) = (a ⊕ b) ⊕ c
• Commutativity:
• a ⊕ b = b ⊕ a
• Identity: a ⊕ ō = a
• Inverse: a ⊕ (-a) = ō
• Multiplication
• Associativity:
• a ⊗ (b ⊗ c) = (a ⊗ b) ⊗ c
• Commutativity:
• a ⊗ b = b ⊗ a
• Identity: a ⊗ ī = a
• Inverse: a ⊗ a-1 = ī
• Distribution of Multiplication over Addition:
• a ⊗ (b ⊕ c) = (a ⊗ b) ⊕ (a ⊗ c)
• (a ⊕ b) ⊗ c = (a ⊗ c) ⊕ (b ⊗ c)
76. Example 1
Query: find the count of the combined rows a, b, c in tables R, S and T
def result = count[a,b,c: R(a) and S(b) and T(c)]
Mathematic Representation:
80. Example 1
Query: count the number of combined rows a, b, c in tables R, S and T
def result = count[a,b,c: R(a) and S(b) and T(c)]
Optimized Query:
def result = count[R] * count[S] * count[T]
n^3 is much slower than 3n
81. Example 2
Query: find the minimum sum of rows a, b, c in tables R, S and T:
def result = min[a,b,c,v: v = R[a] + S[b] + T[c]]
Mathematic Representation:
83. Example 2
Query: find the minimum sum of rows a, b, c in tables R, S and T:
def result = min[a,b,c,v: v = R[a] + S[b] + T[c]]
Optimized Query:
def result = min[R] + min[S] + min[T]
84. C
B D
A E F
1
2
9 4
6
3
5
AEF = 9 + 4 = 13
ABDF = 1 + 6 + 5 = 12
ABCDF = 1 + 2 + 3 + 5 = 11
min{13,12,11} = 11
Shortest Path
from A to F
85. C
B D
A E F
0.9
0.9
0.4 0.8
0.2
1.0
0.7
AEF = 0.4 x 0.8 = 0.32
ABDF = 0.9 x 0.2 x 0.7 = 0.126
ABCDF = 0.9 x 0.9 x 1.0 x 0.7 = 0.567
max{0.32,0.126,0.567} = 0.567
Maximum Reliability
from A to F
86. C
B D
A E F
T
I
A T
H
M
E
AEF = A · T = AT
ABDF = T · H · E = THE
ABCDF = T · I · M · E = TIME
union{at, the, time} = at the time
Words
from A to F
87. Math
You skipped this in college
• min { (9 + 4), (1 + 6 + 5), ( 1 + 2 + 3 + 5 ) }
• max { (0.4 x 0.8), (0.9 x 0.2 x 0.7), (0.9 x 0.9 x 1.0 x 0.7) }
• union { (A · T), (T · H · E), (T · I · M · E) }
88. Math
You skipped this in college
• ⊕ { (9 ⊗ 4), (1 ⊗ 6 ⊗ 5), ( 1 ⊗ 2 ⊗ 3 ⊗ 5 ) }
• ⊕ { (0.4 ⊗ 0.8), (0.9 ⊗ 0.2 ⊗ 0.7), (0.9 ⊗ 0.9 ⊗ 1.0 ⊗ 0.7) }
• ⊕ { (A ⊗ T), (T ⊗ H ⊗ E), (T ⊗ I ⊗ M ⊗ E) }
89. Example 3
Query: count the number of 3-hop paths per node in a graph
def path3(a, b, c, d) = edge(a,b) and edge(b,c) and edge(c,d)
def result[a] = count[path3[a]]
Mathematic Representation:
A B C D
90. Query: count the number of 3-hop paths per node in a graph
A B C D
91. Example 3
Query: count the number of 3-hop paths per node in a graph
def path3(a, b, c, d) = edge(a,b) and edge(b,c) and edge(c,d)
def result[a] = count[path3[a]]
Optimized Query:
def path1[c] = count[edge[c]]
def path2[b] = sum[path1[c] for c in edge[b]]
def result[a] = sum[path2[b] for b in edge[a]]
A B C D
92. Semantic Query Optimizer
It knows math!
• Compute Discrete Fourier Transform in Fast Fourier Transform-time
• Junction Tree Algorithm for inference in Probabilistic Graphical Models
• Message passing, belief propagation
• Viterbi Algorithm, forward/backward for Hidden Markov Models most probable
paths
• Counting sub-graph patterns (motifs)
• Yannakakis Algorithm for acyclic conjunctive queries in Polynomial Time
• Fractional hypertree-width time algorithm for Constraint Satisfaction Problems
• Best known results for Conjunctive Queries and Quanti
f
ied Conjunctive Queries
93. Semantic Query Optimizer
It knows math!
• This optimizer produces much better code than the average developer
because it knows a ton more math than the average developer.
• Maryam Mirzakhani
• Terence Tao
• Ramanujan
• Katherine Goble
• Good Will Hunting
104. Betweenness Centrality
Graph Algorithms
One of many of graph centrality measures which are
useful for assessing the importance of a node.
High Level Definition: Number of times a node
appears on shortest paths within a network
Why it’s Useful: Identify which nodes control
information flow between different areas of the
graph; also called “Bridge Nodes”
Business Use-Cases:
Communication Analysis: Identify important
people which communicate across different
groups
Retail Purchase Analysis: Which products
introduce customers to new categories
105. Betweenness Centrality
Computation
Brandes Algorithm is applied as follows:
1. For each pair of nodes, compute all
shortest paths and capture nodes
(less endpoints) on said path(s)
2. For each pair of nodes, assign each
node along path a value of one if there
is only one shortest path, or the
fractional contribution (1/n) if n
shortest paths
3. Sum the value from step 2 for each
node; this is the Betweenness
Centrality
106. Betweenness Centrality Implementation
// Shortest path between s and t when they are the same is 0.
def shortest_path[s, t] = Min[
v, w:
(shortest_path(s, t, w) and v = 1) or
(w = shortest_path[s,v] +1 and E(v, t))
]
// When s and t are the same, there is only one shortest path between
// them, namely the one with length 0.
def nb_shortest(s, t, n) = V(s) and V(t) and s = t and n = 1
// When s and t are *not* the same, it is the sum of the number of
shortest
// paths between s and v for all the v's adjacent to t and on the shortest
// path between s and t.
def nb_shortest(s, t, n) =
s != t and
n = sum[v, m:
shortest_path[s, v] + 1 = shortest_path[s, t] and E(v, t) and
nb_shortest(s, v, m)
]
// sum over all t's such that there is an edge between v and t,
// and v is on the shortest path between s and t
def C[s, v] = sum[t, r:
E(v, t) and shortest_path[s, t] = shortest_path[s, v] + 1 and
(
a = C[s, t] or
not C(s, t, _) and a = 0.0
) and
r = (nb_shortest[s, v] / nb_shortest[s, t]) * (1 + a)
] from a
// Note that below we divide by 2 because we are double
counting every edge.
def betweenness_centrality_brandes[v] =
sum[s, p : s != v and C[s, v] = p]/2
107. Betweenness Centrality ReComputation
Incremental updates to
data and recomputation
of Betweenness
Centrality takes only a
few seconds, whereas
the entire graph needs to
be re-computed in other
systems.
110. Incremental Maintenance
1. Dependency tracking to figure out which views are affected by a change.
2. Demand-driven execution to only compute what users are actively interested in.
3. Differential computation to incrementally maintain even general recursion.
4. Semantic optimization to recover better maintenance algorithms where possible.