The document summarizes preliminary results from a project comparing the performance of open source implementations of Pregel and related graph processing systems (Hama, Giraph, GPS) on single-source shortest path (SSSP) and PageRank algorithms. Initial results show that Hama does not scale well to larger graphs, while Giraph and GPS scale better. Further analysis of memory usage, network traffic, additional systems like GraphLab and Signal/Collect, and using Green-Marl to generate code for Giraph and GPS is still in progress.
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
Comparing pregel related systems
1. Comparison and Evaluation of Open Source
Implementations of Pregel and Related Systems
December 2, 2013
Joshua Woo, Prashant Raghav, Vishnu Prathish
David R. Cheriton School of Computer Science
University of Waterloo
3. Motivation
Recall: Pregel
● Large-scale graph processing system
● Fault-tolerant framework for graph
algorithms
● MapReduce for graph operations?
● Vertex-centric model (“think like a vertex”)
4. Motivation
● Pregel is proprietary
● Many open source graph processing
systems
○ Pregel clones
○ Pregel-inspired
○ BSP
6. Motivation
System Impl. Language Type
Apache Hama Java Pure BSP framework
Signal/Collect Scala Pregel inspired
Apache Giraph Java Pregel clone
GPS Java Advanced Pregel clone
GraphLab C++ Pregel inspired
Phoebus Erlang Pregel clone
GoldenOrb Java Pregel clone
HipG Java Advanced Pregel clone
Mizan C++ Advanced Pregel clone
7. Motivation
● How do these systems compare?
○ In terms of performance (runtime)?
○ In terms of memory footprint?
○ In terms of network utilization (num. messages)?
○ Variables:
■ Algorithm
■ Graph size (number of vertices)
■ Cluster size
9. Our Project
● Measure the runtime of at least two
algorithms on each system
○ PageRank
■ Fixed number of supersteps = 30
○ Single Source Shortest Path (SSSP)
○ k-means clustering
11. Setup
● Experiments on AWS
○ 5 runs per dataset per algorithm per cluster
■ 35 runs per algorithm per cluster
■ 70 runs per cluster
■ 140 runs in total (single-node, 4-node)
● TODO: another 70 runs (8-node)
14. Setup
● Input Graph
○ Source files converted into format suitable for each
system
■ Time for this conversion excluded from results:
● Conversion done before algorithms are run (pre-processing?)
● Negligible for largeEWD (1,000,000 vertices, 15,172,126
edges)
19. Preliminary Analysis
● A point of resource crunch
○ No significant change in performance until a point
● Hama does not scale well (vertices ~10^4)
● Giraph and GPS scale better
● In general, PageRank runtime > SSSP runtime
● GPS input reader does not guarantee true partitioning
for large datasets
● Which ‘knobs’ to keep constant? - Optimization vs.
Comparability