Complex queries in a distributed multi-model database

Complex queries in a
distributed multi-model
database
Max Neunhöﬀer
Tech Talk Geekdom, 16 March 2015
www.arangodb.com

Documents and collections
{
"_key": "123456",
"_id": "chars/123456",
"name": "Duck",
"firstname": "Donald",
"dob": "1934-11-13",
"hobbies": ["Golf",
"Singing",
"Running"],
"home":
{"town": "Duck town",
"street": "Lake Road",
"number": 17},
"species": "duck"
}
1

{
"_key": "123456",
"_id": "chars/123456",
"name": "Duck",
"dob": "1934-11-13",
"hobbies": ["Golf",
"Singing",
"Running"],
"home":
"number": 17},
"species": "duck"
}
When I say “document”,
I mean “JSON”.
1

{
"_key": "123456",
"_id": "chars/123456",
"name": "Duck",
"dob": "1934-11-13",
"hobbies": ["Golf",
"Singing",
"Running"],
"home":
"number": 17},
"species": "duck"
}
I mean “JSON”.
A “collection” is a set of
documents in a DB.
1

{
"_key": "123456",
"_id": "chars/123456",
"name": "Duck",
"dob": "1934-11-13",
"hobbies": ["Golf",
"Singing",
"Running"],
"home":
"number": 17},
"species": "duck"
}
I mean “JSON”.
documents in a DB.
The DB can inspect the
values, allowing for
secondary indexes.
1

{
"_key": "123456",
"_id": "chars/123456",
"name": "Duck",
"dob": "1934-11-13",
"hobbies": ["Golf",
"Singing",
"Running"],
"home":
"number": 17},
"species": "duck"
}
I mean “JSON”.
documents in a DB.
secondary indexes.
Or one can just treat the
DB as a key/value store.
1

{
"_key": "123456",
"_id": "chars/123456",
"name": "Duck",
"dob": "1934-11-13",
"hobbies": ["Golf",
"Singing",
"Running"],
"home":
"number": 17},
"species": "duck"
}
I mean “JSON”.
documents in a DB.
secondary indexes.
Or one can just treat the
DB as a key/value store.
Sharding: the data of a
collection is distributed
between multiple servers.
1

Graphs
A
B
D
E
F
G
C
"likes"
"hates"
2

Graphs
A
B
D
E
F
G
C
"likes"
"hates"
A graph consists of vertices and
edges.
2

Graphs
A
B
D
E
F
G
C
"likes"
"hates"
edges.
Graphs model relations, can be
directed or undirected.
2

Graphs
A
B
D
E
F
G
C
"likes"
"hates"
edges.
Vertices and edges are
documents.
2

Graphs
A
B
D
E
F
G
C
"likes"
"hates"
edges.
documents.
Every edge has a _from and a _to
attribute.
2

Graphs
A
B
D
E
F
G
C
"likes"
"hates"
edges.
documents.
attribute.
The database oﬀers queries and
transactions dealing with graphs.
2

Graphs
A
B
D
E
F
G
C
"likes"
"hates"
edges.
documents.
attribute.
The database oﬀers queries and
transactions dealing with graphs.
For example, paths in the graph
are interesting.
2

Query 1
Fetch all documents in a collection
FOR p IN people
RETURN p
3

Query 1
FOR p IN people
RETURN p
[ { "name": "Schmidt", "firstname": "Helmut",
"hobbies": ["Smoking"]},
{ "name": "Neunhöffer", "firstname": "Max",
"hobbies": ["Piano", "Golf"]},
...
]
3

Query 1
FOR p IN people
RETURN p
[ { "name": "Schmidt", "firstname": "Helmut",
"hobbies": ["Smoking"]},
{ "name": "Neunhöffer", "firstname": "Max",
"hobbies": ["Piano", "Golf"]},
...
]
(Actually, a cursor is returned.)
3

Query 2
Use ﬁltering, sorting and limit
FOR p IN people
FILTER p.age >= @minage
SORT p.name, p.firstname
LIMIT @nrlimit
RETURN { name: CONCAT(p.name, ", ",
p.firstname),
age : p.age }
4

Query 2
Use ﬁltering, sorting and limit
FOR p IN people
FILTER p.age >= @minage
SORT p.name, p.firstname
LIMIT @nrlimit
RETURN { name: CONCAT(p.name, ", ",
p.firstname),
age : p.age }
[ { "name": "Neunhöffer, Max", "age": 44 },
{ "name": "Schmidt, Helmut", "age": 95 },
...
]
4

Query 3
Aggregation and functions
FOR p IN people
COLLECT a = p.age INTO L
FILTER a >= @minage
RETURN { "age": a, "number": LENGTH(L) }
5

Query 3
Aggregation and functions
FOR p IN people
COLLECT a = p.age INTO L
FILTER a >= @minage
RETURN { "age": a, "number": LENGTH(L) }
[ { "age": 18, "number": 10 },
{ "age": 19, "number": 17 },
{ "age": 20, "number": 12 },
...
]
5

Query 4
Joins
FOR p IN @@peoplecollection
FOR h IN houses
FILTER p._key == h.owner
SORT h.streetname, h.housename
RETURN { housename: h.housename,
streetname: h.streetname,
owner: p.name,
value: h.value }
6

Query 4
Joins
FOR p IN @@peoplecollection
FOR h IN houses
FILTER p._key == h.owner
SORT h.streetname, h.housename
RETURN { housename: h.housename,
streetname: h.streetname,
owner: p.name,
value: h.value }
[ { "housename": "Firlefanz",
"streetname": "Meyer street",
"owner": "Hans Schmidt", "value": 423000
},
...
]
6

Query 5
Modifying data
FOR e IN events
FILTER e.timestamp<"2014-09-01T09:53+0200"
INSERT e IN oldevents
FOR e IN events
FILTER e.timestamp<"2014-09-01T09:53+0200"
REMOVE e._key IN events
7

Query 6
Graph queries
FOR x IN GRAPH_SHORTEST_PATH(
"routeplanner", "germanCity/Cologne",
"frenchCity/Paris", {weight: "distance"} )
RETURN { begin : x.startVertex,
end : x.vertex,
distance : x.distance,
nrPaths : LENGTH(x.paths) }
8

Query 6
Graph queries
FOR x IN GRAPH_SHORTEST_PATH(
"routeplanner", "germanCity/Cologne",
"frenchCity/Paris", {weight: "distance"} )
RETURN { begin : x.startVertex,
end : x.vertex,
distance : x.distance,
nrPaths : LENGTH(x.paths) }
[ { "begin": "germanCity/Cologne",
"end" : {"_id": "frenchCity/Paris", ... },
"distance": 550,
"nrPaths": 10 },
...
]
8

Life of a query
1. Text and query parameters come from user
9

Life of a query
2. Parse text, produce abstract syntax tree (AST)
9

Life of a query
3. Substitute query parameters
9

Life of a query
4. First optimisation: constant expressions, etc.
9

Life of a query
5. Translate AST into an execution plan (EXP)
9

Life of a query
6. Optimise one EXP, produce many, potentially better EXPs
9

Life of a query
7. Reason about distribution in cluster
9

Life of a query
8. Optimise distributed EXPs
9

Life of a query
9. Estimate costs for all EXPs, and sort by ascending cost
9

Life of a query
10. Instanciate “cheapest” plan, i.e. set up execution engine
9

Life of a query
11. Distribute and link up engines on diﬀerent servers
9

Life of a query
11. Distribute and link up engines on diﬀerent servers
12. Execute plan, provide cursor API
9

Execution plans
FOR a IN collA
RETURN {x: a.x, z: b.z}
EnumerateCollection a
EnumerateCollection b
Calculation xx == b.y
Filter xx == b.y
Singleton
Calculation xx
Return {x: a.x, z: b.z}
Calc {x: a.x, z: b.z}
FILTER xx == b.y
FOR b IN collB
LET xx = a.x
Query → EXP
10

Execution plans
FOR a IN collA
Filter xx == b.y
Singleton
Calculation xx
FILTER xx == b.y
FOR b IN collB
LET xx = a.x
Query → EXP
Black arrows are
dependencies
10

Execution plans
FOR a IN collA
Filter xx == b.y
Singleton
Calculation xx
FILTER xx == b.y
FOR b IN collB
LET xx = a.x
Query → EXP
Black arrows are
dependencies
Think of a pipeline
10

Execution plans
FOR a IN collA
Filter xx == b.y
Singleton
Calculation xx
FILTER xx == b.y
FOR b IN collB
LET xx = a.x
Query → EXP
Black arrows are
dependencies
Think of a pipeline
Each node provides
a cursor API
10

Execution plans
FOR a IN collA
Filter xx == b.y
Singleton
Calculation xx
FILTER xx == b.y
FOR b IN collB
LET xx = a.x
Query → EXP
Black arrows are
dependencies
Think of a pipeline
Each node provides
a cursor API
Blocks of “Items”
travel through the
pipeline
10

Execution plans
FOR a IN collA
Filter xx == b.y
Singleton
Calculation xx
FILTER xx == b.y
FOR b IN collB
LET xx = a.x
Query → EXP
Black arrows are
dependencies
Think of a pipeline
Each node provides
a cursor API
Blocks of “Items”
travel through the
pipeline
What is an “item”???
10

Pipeline and items
FOR a IN collA EnumerateCollection a
Singleton
Calculation xx
FOR b IN collB
LET xx = a.x Items have vars a, xx
Items have no vars
Items are the thingies traveling through the pipeline.
11

Pipeline and items
Singleton
Calculation xx
FOR b IN collB
Items have no vars
An item holds values of those variables in the current frame
11

Pipeline and items
Singleton
Calculation xx
FOR b IN collB
Items have no vars
Thus: Items look diﬀerently in diﬀerent parts of the plan
11

Pipeline and items
Singleton
Calculation xx
FOR b IN collB
Items have no vars
Thus: Items look diﬀerently in diﬀerent parts of the plan
We always deal with blocks of items for performance reasons
11

Execution plans
FOR a IN collA
Filter xx == b.y
Singleton
Calculation xx
FILTER xx == b.y
FOR b IN collB
LET xx = a.x
12

Move ﬁlters up
FOR a IN collA
FOR b IN collB
FILTER a.x == 10
FILTER a.u == b.v
RETURN {u:a.u,w:b.w}
Singleton
EnumColl a
EnumColl b
Calc a.x == 10
Return {u:a.u,w:b.w}
Filter a.u == b.v
Calc a.u == b.v
Filter a.x == 10
13

Move ﬁlters up
FOR a IN collA
FOR b IN collB
FILTER a.x == 10
FILTER a.u == b.v
The result and behaviour does not
change, if the ﬁrst FILTER is pulled
out of the inner FOR.
Singleton
EnumColl a
EnumColl b
Calc a.x == 10
Filter a.u == b.v
Calc a.u == b.v
Filter a.x == 10
13

Move ﬁlters up
FOR a IN collA
FILTER a.x < 10
FOR b IN collB
FILTER a.u == b.v
However, the number of items trave-
ling in the pipeline is decreased.
Singleton
EnumColl a
Filter a.u == b.v
Calc a.u == b.v
Calc a.x == 10
EnumColl b
Filter a.x == 10
13

Move ﬁlters up
FOR a IN collA
FILTER a.x < 10
FOR b IN collB
FILTER a.u == b.v
However, the number of items trave-
ling in the pipeline is decreased.
Note that the two FOR statements
could be interchanged!
Singleton
EnumColl a
Filter a.u == b.v
Calc a.u == b.v
Calc a.x == 10
EnumColl b
Filter a.x == 10
13

Remove unnecessary calculations
FOR a IN collA
LET L = LENGTH(a.hobbies)
FOR b IN collB
FILTER a.u == b.v
RETURN {h:a.hobbies,w:b.w}
Singleton
EnumColl a
Calc L = ...
EnumColl b
Calc a.u == b.v
Filter a.u == b.v
Return {...}
14

FOR a IN collA
LET L = LENGTH(a.hobbies)
FOR b IN collB
FILTER a.u == b.v
The Calculation of L is unnecessary!
Singleton
EnumColl a
Calc L = ...
EnumColl b
Calc a.u == b.v
Filter a.u == b.v
Return {...}
14

FOR a IN collA
FOR b IN collB
FILTER a.u == b.v
(since it cannot throw an exception).
Singleton
EnumColl a
EnumColl b
Calc a.u == b.v
Filter a.u == b.v
Return {...}
14

FOR a IN collA
FOR b IN collB
FILTER a.u == b.v
(since it cannot throw an exception).
Therefore we can just leave it out.
Singleton
EnumColl a
EnumColl b
Calc a.u == b.v
Filter a.u == b.v
Return {...}
14

Use index for FILTER and SORT
FOR a IN collA
FILTER a.x > 17 &&
a.x <= 23 &&
a.y == 10
SORT a.y, a.x
RETURN a
Singleton
EnumColl a
Filter ...
Calc ...
Sort a.y, a.x
Return a
15

FOR a IN collA
FILTER a.x > 17 &&
a.x <= 23 &&
a.y == 10
SORT a.y, a.x
RETURN a
Assume collA has a skiplist index on “y”
and “x” (in this order),
Singleton
EnumColl a
Filter ...
Calc ...
Sort a.y, a.x
Return a
15

FOR a IN collA
FILTER a.x > 17 &&
a.x <= 23 &&
a.y == 10
SORT a.y, a.x
RETURN a
and “x” (in this order), then we can read
oﬀ the half-open interval between
{ y: 10, x: 17 } and
{ y: 10, x: 23 }
from the skiplist index.
Singleton
Sort a.y, a.x
Return a
IndexRange a
15

FOR a IN collA
FILTER a.x > 17 &&
a.x <= 23 &&
a.y == 10
SORT a.y, a.x
RETURN a
and “x” (in this order), then we can read
oﬀ the half-open interval between
{ y: 10, x: 17 } and
{ y: 10, x: 23 }
from the skiplist index.
The result will automatically be sorted by
y and then by x.
Singleton
Return a
IndexRange a
15

Data distribution in a cluster
Requests
DBserver DBserver DBserver
CoordinatorCoordinator
4 2 5 3 11
The shards of a collection are distributed across the DB
servers.
16

Data distribution in a cluster
Requests
DBserver DBserver DBserver
CoordinatorCoordinator
4 2 5 3 11
The shards of a collection are distributed across the DB
servers.
The coordinators receive queries and organise their
execution
16

Scatter/gather
EnumerateCollection
17

Scatter/gather
Remote
EnumShard
Remote Remote
EnumShard
Remote
Concat/Merge
Remote
EnumShard
Remote
Scatter
17

Modifying queries
Fortunately:
There can be at most one modifying node in each query.
There can be no modifying nodes in subqueries.
18

Modifying queries
Fortunately:
Modifying nodes
The modifying node in a query
is executed on the DBservers,
18

Modifying queries
Fortunately:
Modifying nodes
to this end, we either scatter the items to all DBservers,
or, if possible, we distribute each item to the shard
that is responsible for the modiﬁcation.
18

Modifying queries
Fortunately:
Modifying nodes
to this end, we either scatter the items to all DBservers,
or, if possible, we distribute each item to the shard
that is responsible for the modiﬁcation.
Sometimes, we can even optimise away a gather/scatter
combination and parallelise completely.
18

Complex queries in a distributed multi-model database

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Complex queries in a distributed multi-model database

Similar a Complex queries in a distributed multi-model database (20)

Más de Max Neunhöffer

Más de Max Neunhöffer (6)

Último

Último (20)

Complex queries in a distributed multi-model database