Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Max Neunhöffer – Joins and aggregations in a distributed NoSQL DB - NoSQL matters Barcelona 2014

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Cargando en…3
×

Eche un vistazo a continuación

1 de 70 Anuncio

Max Neunhöffer – Joins and aggregations in a distributed NoSQL DB - NoSQL matters Barcelona 2014

Descargar para leer sin conexión

Max Neunhöffer – Joins and aggregations in a distributed NoSQL DB

NoSQL databases are renowned for their good horizontal scalability and sharding is these days an essential feature for every self-respecting DB. However, most systems choose to offer less features with respect to joins and aggregations in queries than traditional relational DBMS do. In this talk I report about the joys and pains of (re-)implementing the powerful query language AQL with joins and aggregations for ArangoDB. I will cover (distributed) execution plans, query optimisation and data locality issues.

Max Neunhöffer – Joins and aggregations in a distributed NoSQL DB

NoSQL databases are renowned for their good horizontal scalability and sharding is these days an essential feature for every self-respecting DB. However, most systems choose to offer less features with respect to joins and aggregations in queries than traditional relational DBMS do. In this talk I report about the joys and pains of (re-)implementing the powerful query language AQL with joins and aggregations for ArangoDB. I will cover (distributed) execution plans, query optimisation and data locality issues.

Anuncio
Anuncio

Más Contenido Relacionado

Similares a Max Neunhöffer – Joins and aggregations in a distributed NoSQL DB - NoSQL matters Barcelona 2014 (20)

Anuncio

Más de NoSQLmatters (20)

Más reciente (20)

Anuncio

Max Neunhöffer – Joins and aggregations in a distributed NoSQL DB - NoSQL matters Barcelona 2014

  1. 1. Joins and aggregations in a distributed NoSQL DB NoSQLmatters, Barcelona, 22 November 2014 Max Neunhöffer www.arangodb.com
  2. 2. Documents and collections { "_key": "123456", "_id": "chars/123456", "name": "Duck", "firstname": "Donald", "dob": "1934-11-13", "hobbies": ["Golf", "Singing", "Running"], "home": {"town": "Duck town", "street": "Lake Road", "number": 17}, "species": "duck" } 1
  3. 3. Documents and collections { "_key": "123456", "_id": "chars/123456", "name": "Duck", "firstname": "Donald", "dob": "1934-11-13", "hobbies": ["Golf", "Singing", "Running"], "home": {"town": "Duck town", "street": "Lake Road", "number": 17}, "species": "duck" } When I say “document”, I mean “JSON”. 1
  4. 4. Documents and collections { "_key": "123456", "_id": "chars/123456", "name": "Duck", "firstname": "Donald", "dob": "1934-11-13", "hobbies": ["Golf", "Singing", "Running"], "home": {"town": "Duck town", "street": "Lake Road", "number": 17}, "species": "duck" } When I say “document”, I mean “JSON”. A “collection” is a set of documents in a DB. 1
  5. 5. Documents and collections { "_key": "123456", "_id": "chars/123456", "name": "Duck", "firstname": "Donald", "dob": "1934-11-13", "hobbies": ["Golf", "Singing", "Running"], "home": {"town": "Duck town", "street": "Lake Road", "number": 17}, "species": "duck" } When I say “document”, I mean “JSON”. A “collection” is a set of documents in a DB. The DB can inspect the values, allowing for secondary indexes. 1
  6. 6. Documents and collections { "_key": "123456", "_id": "chars/123456", "name": "Duck", "firstname": "Donald", "dob": "1934-11-13", "hobbies": ["Golf", "Singing", "Running"], "home": {"town": "Duck town", "street": "Lake Road", "number": 17}, "species": "duck" } When I say “document”, I mean “JSON”. A “collection” is a set of documents in a DB. The DB can inspect the values, allowing for secondary indexes. Or one can just treat the DB as a key/value store. 1
  7. 7. Documents and collections { "_key": "123456", "_id": "chars/123456", "name": "Duck", "firstname": "Donald", "dob": "1934-11-13", "hobbies": ["Golf", "Singing", "Running"], "home": {"town": "Duck town", "street": "Lake Road", "number": 17}, "species": "duck" } When I say “document”, I mean “JSON”. A “collection” is a set of documents in a DB. The DB can inspect the values, allowing for secondary indexes. Or one can just treat the DB as a key/value store. Sharding: the data of a collection is distributed between multiple servers. 1
  8. 8. Graphs A B D E F G C "likes" "hates" 2
  9. 9. Graphs A B D E F G C "likes" "hates" A graph consists of vertices and edges. 2
  10. 10. Graphs A B D E F G C "likes" "hates" A graph consists of vertices and edges. Graphs model relations, can be directed or undirected. 2
  11. 11. Graphs A B D E F G C "likes" "hates" A graph consists of vertices and edges. Graphs model relations, can be directed or undirected. Vertices and edges are documents. 2
  12. 12. Graphs A B D E F G C "likes" "hates" A graph consists of vertices and edges. Graphs model relations, can be directed or undirected. Vertices and edges are documents. Every edge has a _from and a _to attribute. 2
  13. 13. Graphs A B D E F G C "likes" "hates" A graph consists of vertices and edges. Graphs model relations, can be directed or undirected. Vertices and edges are documents. Every edge has a _from and a _to attribute. The database offers queries and transactions dealing with graphs. 2
  14. 14. Graphs A B D E F G C "likes" "hates" A graph consists of vertices and edges. Graphs model relations, can be directed or undirected. Vertices and edges are documents. Every edge has a _from and a _to attribute. The database offers queries and transactions dealing with graphs. For example, paths in the graph are interesting. 2
  15. 15. Query 1 Fetch all documents in a collection FOR p IN people RETURN p 3
  16. 16. Query 1 Fetch all documents in a collection FOR p IN people RETURN p [ { "name": "Schmidt", "firstname": "Helmut", "hobbies": ["Smoking"]}, { "name": "Neunhöffer", "firstname": "Max", "hobbies": ["Piano", "Golf"]}, ... ] 3
  17. 17. Query 1 Fetch all documents in a collection FOR p IN people RETURN p [ { "name": "Schmidt", "firstname": "Helmut", "hobbies": ["Smoking"]}, { "name": "Neunhöffer", "firstname": "Max", "hobbies": ["Piano", "Golf"]}, ... ] (Actually, a cursor is returned.) 3
  18. 18. Query 2 Use 1ltering, sorting and limit FOR p IN people FILTER p.age >= @minage SORT p.name, p.firstname LIMIT @nrlimit RETURN { name: CONCAT(p.name, ", ", p.firstname), age : p.age } 4
  19. 19. Query 2 Use 1ltering, sorting and limit FOR p IN people FILTER p.age >= @minage SORT p.name, p.firstname LIMIT @nrlimit RETURN { name: CONCAT(p.name, ", ", p.firstname), age : p.age } [ { "name": "Neunhöffer, Max", "age": 44 }, { "name": "Schmidt, Helmut", "age": 95 }, ... ] 4
  20. 20. Query 3 Aggregation and functions FOR p IN people COLLECT a = p.age INTO L FILTER a >= @minage RETURN { "age": a, "number": LENGTH(L) } 5
  21. 21. Query 3 Aggregation and functions FOR p IN people COLLECT a = p.age INTO L FILTER a >= @minage RETURN { "age": a, "number": LENGTH(L) } [ { "age": 18, "number": 10 }, { "age": 19, "number": 17 }, { "age": 20, "number": 12 }, ... ] 5
  22. 22. Query 4 Joins FOR p IN @@peoplecollection FOR h IN houses FILTER p._key == h.owner SORT h.streetname, h.housename RETURN { housename: h.housename, streetname: h.streetname, owner: p.name, value: h.value } 6
  23. 23. Query 4 Joins FOR p IN @@peoplecollection FOR h IN houses FILTER p._key == h.owner SORT h.streetname, h.housename RETURN { housename: h.housename, streetname: h.streetname, owner: p.name, value: h.value } [ { "housename": "Firlefanz", "streetname": "Meyer street", "owner": "Hans Schmidt", "value": 423000 }, ... ] 6
  24. 24. Query 5 Modifying data FOR e IN events FILTER e.timestamp < "2014-09-01T09:53+0200" INSERT e IN oldevents FOR e IN events FILTER e.timestamp < "2014-09-01T09:53+0200" REMOVE e._key IN events 7
  25. 25. Query 6 Graph queries FOR x IN GRAPH_SHORTEST_PATH( "routeplanner", "germanCity/Cologne", "frenchCity/Paris", {weight: "distance"} ) RETURN { begin : x.startVertex, end : x.vertex, distance : x.distance, nrPaths : LENGTH(x.paths) } 8
  26. 26. Query 6 Graph queries FOR x IN GRAPH_SHORTEST_PATH( "routeplanner", "germanCity/Cologne", "frenchCity/Paris", {weight: "distance"} ) RETURN { begin : x.startVertex, end : x.vertex, distance : x.distance, nrPaths : LENGTH(x.paths) } [ { "begin": "germanCity/Cologne", "end" : {"_id": "frenchCity/Paris", ... }, "distance": 550, "nrPaths": 10 }, ... ] 8
  27. 27. Life of a query Text and query parameters come from user 9
  28. 28. Life of a query Text and query parameters come from user Parse text, produce abstract syntax tree (AST) 9
  29. 29. Life of a query Text and query parameters come from user Parse text, produce abstract syntax tree (AST) Substitute query parameters 9
  30. 30. Life of a query Text and query parameters come from user Parse text, produce abstract syntax tree (AST) Substitute query parameters First optimisation: constant expressions, etc. 9
  31. 31. Life of a query Text and query parameters come from user Parse text, produce abstract syntax tree (AST) Substitute query parameters First optimisation: constant expressions, etc. Translate AST into an execution plan (EXP) 9
  32. 32. Life of a query Text and query parameters come from user Parse text, produce abstract syntax tree (AST) Substitute query parameters First optimisation: constant expressions, etc. Translate AST into an execution plan (EXP) Optimise one EXP, produce many, potentially better EXPs 9
  33. 33. Life of a query Text and query parameters come from user Parse text, produce abstract syntax tree (AST) Substitute query parameters First optimisation: constant expressions, etc. Translate AST into an execution plan (EXP) Optimise one EXP, produce many, potentially better EXPs Reason about distribution in cluster 9
  34. 34. Life of a query Text and query parameters come from user Parse text, produce abstract syntax tree (AST) Substitute query parameters First optimisation: constant expressions, etc. Translate AST into an execution plan (EXP) Optimise one EXP, produce many, potentially better EXPs Reason about distribution in cluster Optimise distributed EXPs 9
  35. 35. Life of a query Text and query parameters come from user Parse text, produce abstract syntax tree (AST) Substitute query parameters First optimisation: constant expressions, etc. Translate AST into an execution plan (EXP) Optimise one EXP, produce many, potentially better EXPs Reason about distribution in cluster Optimise distributed EXPs Estimate costs for all EXPs, and sort by ascending cost 9
  36. 36. Life of a query Text and query parameters come from user Parse text, produce abstract syntax tree (AST) Substitute query parameters First optimisation: constant expressions, etc. Translate AST into an execution plan (EXP) Optimise one EXP, produce many, potentially better EXPs Reason about distribution in cluster Optimise distributed EXPs Estimate costs for all EXPs, and sort by ascending cost Instanciate “cheapest” plan, i.e. set up execution engine 9
  37. 37. Life of a query Text and query parameters come from user Parse text, produce abstract syntax tree (AST) Substitute query parameters First optimisation: constant expressions, etc. Translate AST into an execution plan (EXP) Optimise one EXP, produce many, potentially better EXPs Reason about distribution in cluster Optimise distributed EXPs Estimate costs for all EXPs, and sort by ascending cost Instanciate “cheapest” plan, i.e. set up execution engine Distribute and link up engines on different servers 9
  38. 38. Life of a query Text and query parameters come from user Parse text, produce abstract syntax tree (AST) Substitute query parameters First optimisation: constant expressions, etc. Translate AST into an execution plan (EXP) Optimise one EXP, produce many, potentially better EXPs Reason about distribution in cluster Optimise distributed EXPs Estimate costs for all EXPs, and sort by ascending cost Instanciate “cheapest” plan, i.e. set up execution engine Distribute and link up engines on different servers Execute plan, provide cursor API 9
  39. 39. Execution plans FOR a IN collA LET xx = a.x FOR b IN collB RETURN {x: a.x, z: b.z} Singleton EnumerateCollection a Calculation xx EnumerateCollection b Calculation xx == b.y Filter xx == b.y Calc {x: a.x, z: b.z} Return {x: a.x, z: b.z} FILTER xx == b.y Query ! EXP 10
  40. 40. Execution plans FOR a IN collA LET xx = a.x FOR b IN collB RETURN {x: a.x, z: b.z} Singleton EnumerateCollection a Calculation xx EnumerateCollection b Calculation xx == b.y Filter xx == b.y Calc {x: a.x, z: b.z} Return {x: a.x, z: b.z} FILTER xx == b.y Query ! EXP Black arrows are dependencies 10
  41. 41. Execution plans FOR a IN collA LET xx = a.x FOR b IN collB RETURN {x: a.x, z: b.z} Singleton EnumerateCollection a Calculation xx EnumerateCollection b Calculation xx == b.y Filter xx == b.y Calc {x: a.x, z: b.z} Return {x: a.x, z: b.z} FILTER xx == b.y Query ! EXP Black arrows are dependencies Think of a pipeline 10
  42. 42. Execution plans FOR a IN collA LET xx = a.x FOR b IN collB RETURN {x: a.x, z: b.z} Singleton EnumerateCollection a Calculation xx EnumerateCollection b Calculation xx == b.y Filter xx == b.y Calc {x: a.x, z: b.z} Return {x: a.x, z: b.z} FILTER xx == b.y Query ! EXP Black arrows are dependencies Think of a pipeline Each node provides a cursor API 10
  43. 43. Execution plans FOR a IN collA LET xx = a.x FOR b IN collB RETURN {x: a.x, z: b.z} Singleton EnumerateCollection a Calculation xx EnumerateCollection b Calculation xx == b.y Filter xx == b.y Calc {x: a.x, z: b.z} Return {x: a.x, z: b.z} FILTER xx == b.y Query ! EXP Black arrows are dependencies Think of a pipeline Each node provides a cursor API Blocks of “Items” travel through the pipeline 10
  44. 44. Execution plans FOR a IN collA LET xx = a.x FOR b IN collB RETURN {x: a.x, z: b.z} Singleton EnumerateCollection a Calculation xx EnumerateCollection b Calculation xx == b.y Filter xx == b.y Calc {x: a.x, z: b.z} Return {x: a.x, z: b.z} FILTER xx == b.y Query ! EXP Black arrows are dependencies Think of a pipeline Each node provides a cursor API Blocks of “Items” travel through the pipeline What is an “item”??? 10
  45. 45. Pipeline and items Singleton FOR a IN collA EnumerateCollection a LET xx = a.x Calculation xx Items have vars a, xx EnumerateCollection b FOR b IN collB Items have no vars Items are the thingies traveling through the pipeline. 11
  46. 46. Pipeline and items Singleton FOR a IN collA EnumerateCollection a LET xx = a.x Calculation xx Items have vars a, xx EnumerateCollection b FOR b IN collB Items have no vars Items are the thingies traveling through the pipeline. An item holds values of those variables in the current frame 11
  47. 47. Pipeline and items Singleton FOR a IN collA EnumerateCollection a LET xx = a.x Calculation xx Items have vars a, xx EnumerateCollection b FOR b IN collB Items have no vars Items are the thingies traveling through the pipeline. An item holds values of those variables in the current frame Thus: Items look differently in different parts of the plan 11
  48. 48. Pipeline and items Singleton FOR a IN collA EnumerateCollection a LET xx = a.x Calculation xx Items have vars a, xx EnumerateCollection b FOR b IN collB Items have no vars Items are the thingies traveling through the pipeline. An item holds values of those variables in the current frame Thus: Items look differently in different parts of the plan We always deal with blocks of items for performance reasons 11
  49. 49. Execution plans FOR a IN collA LET xx = a.x FOR b IN collB RETURN {x: a.x, z: b.z} Singleton EnumerateCollection a Calculation xx EnumerateCollection b Calculation xx == b.y Filter xx == b.y Calc {x: a.x, z: b.z} Return {x: a.x, z: b.z} FILTER xx == b.y 12
  50. 50. Move 1lters up FOR a IN collA FOR b IN collB FILTER a.x == 10 FILTER a.u == b.v RETURN {u:a.u,w:b.w} Singleton EnumColl a EnumColl b Calc a.x == 10 Filter a.x == 10 Calc a.u == b.v Filter a.u == b.v Return {u:a.u,w:b.w} 13
  51. 51. Move 1lters up FOR a IN collA FOR b IN collB FILTER a.x == 10 FILTER a.u == b.v RETURN {u:a.u,w:b.w} The result and behaviour does not change, if the 1rst FILTER is pulled out of the inner FOR. Singleton EnumColl a EnumColl b Calc a.x == 10 Filter a.x == 10 Calc a.u == b.v Filter a.u == b.v Return {u:a.u,w:b.w} 13
  52. 52. Move 1lters up FOR a IN collA FILTER a.x < 10 FOR b IN collB FILTER a.u == b.v RETURN {u:a.u,w:b.w} The result and behaviour does not change, if the 1rst FILTER is pulled out of the inner FOR. However, the number of items trave-ling in the pipeline is decreased. Singleton EnumColl a Calc a.x == 10 Filter a.x == 10 EnumColl b Calc a.u == b.v Filter a.u == b.v Return {u:a.u,w:b.w} 13
  53. 53. Move 1lters up FOR a IN collA FILTER a.x < 10 FOR b IN collB FILTER a.u == b.v RETURN {u:a.u,w:b.w} The result and behaviour does not change, if the 1rst FILTER is pulled out of the inner FOR. However, the number of items trave-ling in the pipeline is decreased. Note that the two FOR statements could be interchanged! Singleton EnumColl a Calc a.x == 10 Filter a.x == 10 EnumColl b Calc a.u == b.v Filter a.u == b.v Return {u:a.u,w:b.w} 13
  54. 54. Remove unnecessary calculations FOR a IN collA LET L = LENGTH(a.hobbies) FOR b IN collB FILTER a.u == b.v RETURN {h:a.hobbies,w:b.w} Singleton EnumColl a Calc L = ... EnumColl b Calc a.u == b.v Filter a.u == b.v Return {...} 14
  55. 55. Remove unnecessary calculations FOR a IN collA LET L = LENGTH(a.hobbies) FOR b IN collB FILTER a.u == b.v RETURN {h:a.hobbies,w:b.w} The Calculation of L is unnecessary! Singleton EnumColl a Calc L = ... EnumColl b Calc a.u == b.v Filter a.u == b.v Return {...} 14
  56. 56. Remove unnecessary calculations FOR a IN collA FOR b IN collB FILTER a.u == b.v RETURN {h:a.hobbies,w:b.w} The Calculation of L is unnecessary! (since it cannot throw an exception). Singleton EnumColl a EnumColl b Calc a.u == b.v Filter a.u == b.v Return {...} 14
  57. 57. Remove unnecessary calculations FOR a IN collA FOR b IN collB FILTER a.u == b.v RETURN {h:a.hobbies,w:b.w} The Calculation of L is unnecessary! (since it cannot throw an exception). Therefore we can just leave it out. Singleton EnumColl a EnumColl b Calc a.u == b.v Filter a.u == b.v Return {...} 14
  58. 58. Use index for FILTER and SORT FOR a IN collA FILTER a.x > 17 && a.x <= 23 && a.y == 10 SORT a.y, a.x RETURN a Singleton EnumColl a Calc ... Filter ... Sort a.y, a.x Return a 15
  59. 59. Use index for FILTER and SORT FOR a IN collA FILTER a.x > 17 && a.x <= 23 && a.y == 10 SORT a.y, a.x RETURN a Assume collA has a skiplist index on “y” and “x” (in this order), Singleton EnumColl a Calc ... Filter ... Sort a.y, a.x Return a 15
  60. 60. Use index for FILTER and SORT FOR a IN collA FILTER a.x > 17 && a.x <= 23 && a.y == 10 SORT a.y, a.x RETURN a Assume collA has a skiplist index on “y” and “x” (in this order), then we can read off the half-open interval between { y: 10, x: 17 } and { y: 10, x: 23 } from the skiplist index. Singleton IndexRange a Sort a.y, a.x Return a 15
  61. 61. Use index for FILTER and SORT FOR a IN collA FILTER a.x > 17 && a.x <= 23 && a.y == 10 SORT a.y, a.x RETURN a Assume collA has a skiplist index on “y” and “x” (in this order), then we can read off the half-open interval between { y: 10, x: 17 } and { y: 10, x: 23 } from the skiplist index. The result will automatically be sorted by y and then by x. Singleton IndexRange a Return a 15
  62. 62. Data distribution in a cluster Requests Coordinator Coordinator DBserver DBserver DBserver 1 4 2 5 3 1 The shards of a collection are distributed across the DB servers. 16
  63. 63. Data distribution in a cluster Requests Coordinator Coordinator DBserver DBserver DBserver 1 4 2 5 3 1 The shards of a collection are distributed across the DB servers. The coordinators receive queries and organise their execution 16
  64. 64. Scatter/gather EnumerateCollection 17
  65. 65. Scatter/gather Remote Remote EnumShard Remote EnumShard Remote Concat/Merge Remote EnumShard Remote Scatter 17
  66. 66. Scatter/gather Remote Remote EnumShard Remote EnumShard Remote Concat/Merge Remote EnumShard Remote Scatter 17
  67. 67. Modifying queries Fortunately: There can be at most one modifying node in each query. There can be no modifying nodes in subqueries. 18
  68. 68. Modifying queries Fortunately: There can be at most one modifying node in each query. There can be no modifying nodes in subqueries. Modifying nodes The modifying node in a query is executed on the DBservers, 18
  69. 69. Modifying queries Fortunately: There can be at most one modifying node in each query. There can be no modifying nodes in subqueries. Modifying nodes The modifying node in a query is executed on the DBservers, to this end, we either scatter the items to all DBservers, or, if possible, we distribute each item to the shard that is responsible for the modi1cation. 18
  70. 70. Modifying queries Fortunately: There can be at most one modifying node in each query. There can be no modifying nodes in subqueries. Modifying nodes The modifying node in a query is executed on the DBservers, to this end, we either scatter the items to all DBservers, or, if possible, we distribute each item to the shard that is responsible for the modi1cation. Sometimes, we can even optimise away a gather/scatter combination and parallelise completely. 18

×