Frank Celler – Processing large-scale graphs with Google(TM) Pregel
Many popular graph databases are optimized to run on a single machine, using efficient traversals to query the stored graphs. This boosts performance of algorithms originating at a single vertex and iterating through the graph e.g. finding shortest paths or neighbors. However, graphs are getting bigger and traversals are poorly performing if they require a large depth. If you need to distribute a large-scale graph thru several machines, traversals won't be the best choice (in case of performance) to process the graph. Therefore Google has released it's Pregel framework offering an environment to query distributed graphs, Pregel is also known as the map-reduce for graphs. In this talk I want to present the architecture and requirements of the Pregel framework and introduce you to the different mind-set required to write a Pregel algorithm. Furthermore I will give a short introduction to three implementations or Pregel — Giraph, TinkerPop3 and ArangoDB.
2. About
about us
Frank Celler (@fceller) working on the ArangoDB core
Michael Hackstein (@mchacki) started an experimental
implementation of Pregel
1
3. About
about us
Frank Celler (@fceller) working on the ArangoDB core
Michael Hackstein (@mchacki) started an experimental
implementation of Pregel
about the talk
different kinds of graph algorithms
Pregel example
Pregel mind set aka Framework
more examples
1
4. Pregel at ArangoDB
Started as a side project in free hack time
Experimental on operational database
Implemented as an alternative to traversals
Make use of the 2exibility of JavaScript:
No strict type system
No pre-compilation, on-the-2y queries
Native JSON documents
Really fast development
2
5. Graph Algorithms
Pattern matching
Search through the entire graph
Identify similar components
) Touch all vertices and their neighbourhoods
3
6. Graph Algorithms
Pattern matching
Search through the entire graph
Identify similar components
) Touch all vertices and their neighbourhoods
Traversals
De1ne a speci1c start point
Iteratively explore the graph
) History of steps is known
3
7. Graph Algorithms
Pattern matching
Search through the entire graph
Identify similar components
) Touch all vertices and their neighbourhoods
Traversals
De1ne a speci1c start point
Iteratively explore the graph
) History of steps is known
Global measurements
Compute one value for the graph, based on all it’s vertices
or edges
Compute one value for each vertex or edge
) Often require a global view on the graph
3
8. Pregel
A framework to query distributed, directed graphs.
Known as “Map-Reduce” for graphs
Uses same phases
Has several iterations
Aims at:
Operate all servers at full capacity
Reduce network traZc
Good at calculations touching all vertices
Bad at calculations touching a very small number of vertices
4
24. Worker ^= Map
“Map” a user-de1ned algorithm over all vertices
Output: set of messages to other vertices
Available parameters:
The current vertex and his outbound edges
All incoming messages
Global values
Allow modi1cations on the vertex:
Attach a result to this vertex and his outgoing edges
Delete the vertex and his outgoing edges
Deactivate the vertex
7
25. Combine ^= Reduce
“Reduce” all generated messages
Output: An aggregated message for each vertex.
Executed on sender as well as receiver.
Available parameters:
One new message for a vertex
The stored aggregate for this vertex
Typical combiners are SUM, MIN or MAX
Reduces network traZc
8
26. Activity ^= Termination
Execute several rounds of Map/Reduce
Count active vertices and messages
Start next round if one of the following is true:
At least one vertex is active
At least one message is sent
Terminate if neither a vertex is active nor messages were sent
Store all non-deleted vertices and edges as resulting graph
9