Outrageous Ideas for Graph Databases

Outrageous Ideas
Data Day Texas - June 13, 2022
For Graph Databases

@maxdemarzi

maxdemarzi.com

GitHub.com/maxdemarzi

Max De Marzi

Ideas are Wrong
• Too Many Back-ends (aka
Tinkerpop is wrong)

• No lessons applied from
Relational Databases

• API is incomplete (bulk)

• Query Languages are
Incompetent

Implementations

are Wrong
• Nodes as Objects sucks

• No internal algebras

• Incompetent Query Optimizers

• Incompetent Query Executors

• Incompetent Engineering

• A short clip of the talk

https://homepages.cwi.nl/~boncz/edbt2022.pdf

Peter Suggests:
https://homepages.cwi.nl/~boncz/edbt2022.pdf
1. Row Storage for Properties of Nodes/Relationships

2. Less Indexing

3. Less Joins

4. Be more Relational then add Graph Functionality

5. Don’t rely on the query optimizer

6. Don’t allow generic recursive queries

7. Limit the query language

Completely Sensible Ideas
Data Day Texas - June 13, 2022
For Graph Databases

How many Paths are there from

The top left node to

the bottom right node?
2 Paths
6 Paths

14x14 = 11 minutes

15x15 = 10 Hours

20x20 = Nope
How many Paths are there?

20x20 = 10 Minutes
137 Billion

-[*]-
Death Star Queries
Blows up Alderaaning Servers

20 x 20 in

0.41 Seconds
137 Billion

Graph Normal Form

Narrow Tables
Key-Value or Key-Key

Composite Index Explosion
Dual Indexed Narrow Tables = Dynamic Composite Indexes

Problem with Joins
Table 1
ID
0
1
3
4
5
6
7
8
9
11
Table 2
ID
0
2
6
7
8
9
Table 3
ID
2
4
5
8
10
Results
Table 1 Table 2 Table 3
8 8 8
Intermediate Results
Table1 and Table 2
0
6
7
8
9

Worst Case Optimal Joins
● Worst-Case Optimal Join Algorithms: Techniques, Results, and
Open Problems. Ngo. (Gems of PODS 2018)
● Worst-Case Optimal Join Algorithms: Techniques, Results, and
Open Problems. Ngo, Porat, Re, Rudra. (Journal of the ACM
2018)
● What do Shannon-type inequalities, submodular width, and
disjunctive datalog have to do with one another? Abo Khamis,
Ngo, Suciu, (PODS 2017 - Invited to Journal of ACM)
● Computing Join Queries with Functional Dependencies. Abo
Khamis, Ngo, Suciu. (PODS 2017)
● Joins via Geometric Resolutions: Worst-case and Beyond. Abo
Khamis, Ngo, Re, Rudra. (PODS 2015, Invited to TODS 2015)
● Beyond Worst-Case Analysis for Joins with Minesweeper. Abo
Khamis, Ngo, Re, Rudra. (PODS 2014)
● Leapfrog Triejoin: A Simple Worst-Case Optimal Join Algorithm.
Veldhuizen (ICDT 2014 - Best Newcomer)
● Skew Strikes Back: New Developments in the Theory of Join
Algorithms. Ngo, Re, Rudra. (Invited to SIGMOD Record 2013)
● Worst Case Optimal Join Algorithms. Ngo, Porat, Re,
Rudra. (PODS 2012 – Best Paper)

LeapFrog Join
Table 1
ID
0
1
3
4
5
6
7
8
9
11
Table 2
ID
0
2
6
7
8
9
Table 3
ID
2
4
5
8
10
Table IDs Action
0 0 2 Table 1: Seek 2
8 8 8 Emit, Table 3: Next
8 8 10 Table 1: Seek 10
11 8 10 Table 2: Seek 11 END
Results
8 8 8
Start
End
Seek 2 Seek 3 Seek 6
Seek 8
Seek 10
Seek 8
Next
Seek 11

More than 3 Tables
m
a
14
Brand
Category
Retailer
Rating
p
o
n
b
7) seek m
6) seek m
3) seek f
5) seek m
4) seek g
2) seek c
1) seek c
c d e f g
Worst-Case Optimal Joins take advantage of sorted keys and gaps in the data to
eliminate intermediate results, speed up queries and get rid of the Join problem.

in Legacy GraphDBs:
How do you model Flight Data?

Don’t we care about Flights only on particular Days?

Group Destinations together!

OMG WAT!

Reduce the Search Space
m
a
14
Airport
Day
Flight
Destination
p
o
n
b
7) seek m
6) seek m
3) seek f
5) seek m
4) seek g
2) seek c
1) seek c
c d e f g
What if you wanted to earn miles on your frequent flyer program and filter by Airline? No
problem here, the more joins the merrier.

Vision Reality
Relational Databases

What’s wrong with NULL?
SELECT * 
FROM parts
WHERE (price <= 99) OR (price > 99)
SELECT * 
FROM parts
WHERE (price <= 99) OR (price > 99) OR isNull(price)
SELECT AVG(height) 
FROM parts
SELECT orders.id, parts.id 
FROM orders LEFT OUTER JOIN
parts ON parts.id = orders.part_id
SELECT orders.id, parts.id 
FROM parts LEFT OUTER JOIN
orders ON parts.id = orders.part_id
 
●(a and NOT(a)) != True
●Aggregation requires special cases
●Outer Joins are not commutative  
a x b != b x a
Query Optimizers hate Nulls. The 3 valued
logic cause major headaches.

Sets vs Bags
Set: {1,2,3}, {8,3,4}
Bags: {1,2,2,3}, {3, 3, 3, 3}
Sets have Unique Values
Bags allow Duplicate Values
●Queries that use only ANDs (no ORs)
are called “conjunctive queries”
●Conjunctive Queries under Set
Semantics are Much Easier to Optimize
Query Optimizers hate Bags. Duplicates cause
major headaches.

Traditional Query Optimizers
• Predicate pushdown (push selection through join)

• Projection pushdown (push projection through join)

• Aggregation pushdown

• Their “pull ups” counter parts

• Split conjunctive predicates (split AND statements)

• Replace cartesian products (use inner joins with predicates)

• (Un)Nesting Sub-Queries

• Etc.

Data Answer
Query
Equivalent Query

Math
Semantic 
Optimizer
Optimized

Query
Semantic Query Optimizer

Math
You learned this in middle school
• 1 + (2 + 3) = (1 + 2) + 3

• 3 + 4 = 4 + 3

• 3 + 0 = 3

• 1 + (-1) = 0
• 2 x (3 x 4) = (2 x 3) x 4

• 2 x 5 = 5 x 2

• 2 x 1 = 2

• 2 x 0.5 = 1
• 2 x (3 + 4) = (2 x 3) + (2 x 4)

• (3 + 4) x 2 = (3 x 2) + (4 x 2)

Math
You learned this in high school
• a + (b + c) = (a + b) + c

• a + b = b + a

• a + 0 = a

• a + (-a) = 0
• a x (b x c) = (a x b) x c

• a x b = b x a

• a x 1 = a

• a x a-1 = 1, a != 0
• a x (b + c) = (a x b) + (a x c)

• (a + b) x c = (a x c) + (b x c)

Math
You forgot this in high school
• Addition:

• Associativity:

• a ⊕ (b ⊕ c) = (a ⊕ b) ⊕ c

• Commutativity:

• a ⊕ b = b ⊕ a

• Identity: a ⊕ ō = a

• Inverse: a ⊕ (-a) = ō
• Multiplication

• Associativity:

• a ⊗ (b ⊗ c) = (a ⊗ b) ⊗ c

• Commutativity:

• a ⊗ b = b ⊗ a

• Identity: a ⊗ ī = a

• Inverse: a ⊗ a-1 = ī
• Distribution of Multiplication over Addition:

• a ⊗ (b ⊕ c) = (a ⊗ b) ⊕ (a ⊗ c)

• (a ⊕ b) ⊗ c = (a ⊗ c) ⊕ (b ⊗ c)

Example 1
Query: find the count of the combined rows a, b, c in tables R, S and T 
 

def result = count[a,b,c: R(a) and S(b) and T(c)]
Mathematic Representation:

Example 1
Query: count the number of combined rows a, b, c in tables R, S and T

Example 1
Query: count the number of combined rows a, b, c in tables R, S and T 
 

def result = count[a,b,c: R(a) and S(b) and T(c)]
Optimized Query:
def result = count[R] * count[S] * count[T]
n^3 is much slower than 3n

Example 2
Query: find the minimum sum of rows a, b, c in tables R, S and T: 
 

def result = min[a,b,c,v: v = R[a] + S[b] + T[c]]

Example 2
Query: find the minimum sum of rows a, b, c in tables R, S and T: 
 

def result = min[a,b,c,v: v = R[a] + S[b] + T[c]]
Optimized Query:
def result = min[R] + min[S] + min[T]

C
B D
A E F
1
2
9 4
6
3
5
AEF = 9 + 4 = 13

ABDF = 1 + 6 + 5 = 12

ABCDF = 1 + 2 + 3 + 5 = 11
min{13,12,11} = 11
Shortest Path

from A to F

C
B D
A E F
0.9
0.9
0.4 0.8
0.2
1.0
0.7
AEF = 0.4 x 0.8 = 0.32

ABDF = 0.9 x 0.2 x 0.7 = 0.126

ABCDF = 0.9 x 0.9 x 1.0 x 0.7 = 0.567
max{0.32,0.126,0.567} = 0.567
Maximum Reliability

from A to F

C
B D
A E F
T
I
A T
H
M
E
AEF = A · T = AT

ABDF = T · H · E = THE

ABCDF = T · I · M · E = TIME
union{at, the, time} = at the time
Words

from A to F

Math
You skipped this in college
• min { (9 + 4), (1 + 6 + 5), ( 1 + 2 + 3 + 5 ) }

• max { (0.4 x 0.8), (0.9 x 0.2 x 0.7), (0.9 x 0.9 x 1.0 x 0.7) }

• union { (A · T), (T · H · E), (T · I · M · E) }

Math
You skipped this in college
• ⊕ { (9 ⊗ 4), (1 ⊗ 6 ⊗ 5), ( 1 ⊗ 2 ⊗ 3 ⊗ 5 ) }

• ⊕ { (0.4 ⊗ 0.8), (0.9 ⊗ 0.2 ⊗ 0.7), (0.9 ⊗ 0.9 ⊗ 1.0 ⊗ 0.7) }

• ⊕ { (A ⊗ T), (T ⊗ H ⊗ E), (T ⊗ I ⊗ M ⊗ E) }

Example 3
Query: count the number of 3-hop paths per node in a graph

def path3(a, b, c, d) = edge(a,b) and edge(b,c) and edge(c,d)

def result[a] = count[path3[a]]
A B C D

A B C D

Example 3

def path3(a, b, c, d) = edge(a,b) and edge(b,c) and edge(c,d)

def result[a] = count[path3[a]]
Optimized Query:
def path1[c] = count[edge[c]]

def path2[b] = sum[path1[c] for c in edge[b]]

def result[a] = sum[path2[b] for b in edge[a]]
A B C D

It knows math!
• Compute Discrete Fourier Transform in Fast Fourier Transform-time

• Junction Tree Algorithm for inference in Probabilistic Graphical Models

• Message passing, belief propagation

• Viterbi Algorithm, forward/backward for Hidden Markov Models most probable
paths

• Counting sub-graph patterns (motifs)

• Yannakakis Algorithm for acyclic conjunctive queries in Polynomial Time

• Fractional hypertree-width time algorithm for Constraint Satisfaction Problems

• Best known results for Conjunctive Queries and Quanti
f
ied Conjunctive Queries

It knows math!
• This optimizer produces much better code than the average developer
because it knows a ton more math than the average developer.
• Maryam Mirzakhani

• Terence Tao

• Ramanujan

• Katherine Goble

• Good Will Hunting

95
def reachable = edge; reachable.edge
Recursion

def number_of_paths_of_length(node_number, path_length, path_count) =

node_number=1, path_length=0, path_count=1

def number_of_paths_of_length[node_number, path_length] =

sum[other_node, paths_of_length : paths_of_length =

number_of_paths_of_length[other_node, path_length - 1]

and edge(other_node, node_number)]

def output = number_of_paths_of_length[number_of_nodes, 2 * lattice_size]

@function @transient
def :_intermediate#0(other_node#1, path_length#0, _t#0) =
reduce[(_x#0, _y#0, _z#0) : :rel_primitive_add(_x#0, _y#0, _z#0),
(x#8, paths_of_length#1) :
:number_of_paths_of_length(other_node#1, x#8, paths_of_length#1) and
:rel_primitive_add(1, x#8, path_length#0),
(_no_init#0) : false](_t#0)

def :_intermediate#1(node_number#0, path_length#0, path_count#0) =
(other_node#1, _t#0) :
:edge(other_node#1, node_number#0) and
:_intermediate#0(other_node#1, path_length#0, _t#0),
(_no_init#1) : false](path_count#0)

def :number_of_paths_of_length(node_number#0, path_length#0, path_count#0) =
:_base_case#0(node_number#0, path_length#0, path_count#0) or
:_intermediate#1(node_number#0, path_length#0, path_count#0)
Naive recursion, iteration 1
Evaluating `_intermediate#0`:
(1, 1) => (1,)
(2, 1) => (1,)
(4, 1) => (1,)
Evaluating `number_of_paths_of_length`:
(1, 0, 1)
(2, 1, 1)
(4, 1, 1)



(1, 1) => (1,)
(2, 2) => (1,)
(4, 2) => (1,)
(2, 1) => (1,)
(3, 2) => (1,)
(4, 1) => (1,)
(5, 2) => (2,)
(7, 2) => (1,)
(1, 0, 1)
(2, 1, 1)
(3, 2, 1)
(4, 1, 1)
(5, 2, 2)
(7, 2, 1)



(1, 1) => (1,)
(2, 2) => (1,)
(3, 3) => (1,)
(4, 2) => (1,)
(5, 3) => (2,)
(7, 3) => (1,)
(2, 1) => (1,)
(3, 2) => (1,)
(4, 1) => (1,)
(5, 2) => (2,)
(6, 3) => (3,)
(7, 2) => (1,)
(8, 3) => (3,)
(1, 0, 1)
(2, 1, 1)
(3, 2, 1)
(4, 1, 1)
(5, 2, 2)
(6, 3, 3)
(7, 2, 1)
(8, 3, 3)



(1, 1) => (1,)
(2, 2) => (1,)
(3, 3) => (1,)
(4, 2) => (1,)
(5, 3) => (2,)
(6, 4) => (3,)
(7, 3) => (1,)
(8, 4) => (3,)
(2, 1) => (1,)
(3, 2) => (1,)
(4, 1) => (1,)
(5, 2) => (2,)
(6, 3) => (3,)
(7, 2) => (1,)
(8, 3) => (3,)
(9, 4) => (6,)
(1, 0, 1)
(2, 1, 1)
(3, 2, 1)
(4, 1, 1)
(5, 2, 2)
(6, 3, 3)
(7, 2, 1)
(8, 3, 3)
(9, 4, 6)

Graph Analytics
module graph_analytics[G] 
with G use node, edge 
 
def neighbor(x, y) = edge(x, y) or edge(y, x) 
def outdegree[x] = count[edge[x]] 
def degree[x] = count[neighbor[x]] 
def cn[x, y] = count[intersect[neighbor[x], neighbor[y]]] // Count of Common Neighbors 
 
def reachable = edge; reachable.edge 
def reachable_undirected = neighbor; reachable_undirected.neighbor 
 
def scc[x] = min[v: reachable(x, v) and reachable(v, x)] // Strongly Connected Component 
def wcc[x] = min[reachable_undirected[x]] // Weakly Connected Component 
 
def cosine_sim[x, y] = cn[x, y] / sqrt[degree[x] * degree[y]] 
def jaccard_sim[x, y] = cn[x, y] / count[neighbor[x]] + count[neighbor[y]] - cn[x, y]

…

end

Betweenness Centrality

Graph Algorithms
One of many of graph centrality measures which are
useful for assessing the importance of a node.

High Level Definition: Number of times a node
appears on shortest paths within a network

Why it’s Useful: Identify which nodes control
information flow between different areas of the
graph; also called “Bridge Nodes”

Business Use-Cases:

Communication Analysis: Identify important
people which communicate across different
groups

Retail Purchase Analysis: Which products
introduce customers to new categories

Betweenness Centrality

Computation
Brandes Algorithm is applied as follows:

1. For each pair of nodes, compute all
shortest paths and capture nodes
(less endpoints) on said path(s)

2. For each pair of nodes, assign each
node along path a value of one if there
is only one shortest path, or the
fractional contribution (1/n) if n
shortest paths

3. Sum the value from step 2 for each
node; this is the Betweenness
Centrality

Betweenness Centrality Implementation
// Shortest path between s and t when they are the same is 0.

def shortest_path[s, t] = Min[

v, w:

(shortest_path(s, t, w) and v = 1) or

(w = shortest_path[s,v] +1 and E(v, t))

]
// When s and t are the same, there is only one shortest path between
// them, namely the one with length 0.
def nb_shortest(s, t, n) = V(s) and V(t) and s = t and n = 1
// When s and t are *not* the same, it is the sum of the number of
shortest
// paths between s and v for all the v's adjacent to t and on the shortest
// path between s and t.
def nb_shortest(s, t, n) =
s != t and
n = sum[v, m:
shortest_path[s, v] + 1 = shortest_path[s, t] and E(v, t) and
nb_shortest(s, v, m)
]
// sum over all t's such that there is an edge between v and t,
// and v is on the shortest path between s and t
def C[s, v] = sum[t, r:
E(v, t) and shortest_path[s, t] = shortest_path[s, v] + 1 and
(
a = C[s, t] or
not C(s, t, _) and a = 0.0
) and
r = (nb_shortest[s, v] / nb_shortest[s, t]) * (1 + a)
] from a
// Note that below we divide by 2 because we are double
counting every edge.
def betweenness_centrality_brandes[v] =
sum[s, p : s != v and C[s, v] = p]/2

Betweenness Centrality ReComputation
Incremental updates to
data and recomputation
of Betweenness
Centrality takes only a
few seconds, whereas
the entire graph needs to
be re-computed in other
systems.

Algorithm Change ReComputation
Incremental updates to
code is also
recomputated, whereas
the entire algorithm
needs to be re-
computed in other
systems.

Incremental Maintenance
1. Dependency tracking to figure out which views are affected by a change.

2. Demand-driven execution to only compute what users are actively interested in.

3. Differential computation to incrementally maintain even general recursion.

4. Semantic optimization to recover better maintenance algorithms where possible.

Outrageous Ideas for Graph Databases

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Outrageous Ideas for Graph Databases

Similar a Outrageous Ideas for Graph Databases (20)

Más de Max De Marzi

Más de Max De Marzi (20)

Último

Último (20)

Outrageous Ideas for Graph Databases