Introducing Apache Giraph for Large Scale Graph Processing

Introducing Apache Giraph for
Large Scale Graph Processing

Sebastian Schelter

PhD student at the Database Systems and Information
Management Group of TU Berlin

Committer and PMC member at Apache Mahout and Apache Giraph

mail ssc@apache.org blog http://ssc.io

Graph recap
graph: abstract representation of a set of objects
(vertices), where some pairs of these objects are
connected by links (edges), which can be directed or
undirected

Graphs can be used to model arbitrary things like
road networks, social networks, flows of goods, etc.

Majority of graph algorithms B

are iterative and traverse
the graph in some way A D

C

Real world graphs are really large!
• the World Wide Web has several billion pages
with several billion links
• Facebook‘s social graph had more than 700
million users and more than 68 billion
friendships in 2011
• twitter‘s social graph has billions of follower
relationships

Why not use MapReduce/Hadoop?
• Example: PageRank, Google‘s famous algorithm for
measuring the authority of a webpage based on the
underlying network of hyperlinks
• defined recursively: each vertex distributes its
authority to its neighbors in equal proportions

pj
pi   dj
j  ( j , i ) 

Textbook approach to
PageRank in MapReduce
• PageRank p is the principal eigenvector of the Markov matrix
M defined by the transition probabilities between web pages
• it can be obtained by iteratively multiplying an initial
PageRank vector by M (power iteration)

p  M p0
k

row 1 of M ∙
row 2 of M ∙ pi
pi+1

row n of M ∙

Drawbacks
• Not intuitive: only crazy scientists
think in matrices and eigenvectors
• Unnecessarily slow: Each iteration is scheduled
as separate MapReduce job with lots of overhead
– the graph structure is read from disk
– the map output is spilled to disk
– the intermediary result is written to HDFS
• Hard to implement: a join has to be implemented
by hand, lots of work, best strategy is data
dependent

Google Pregel
• distributed system especially developed for
large scale graph processing
• intuitive API that let‘s you ‚think like a vertex‘
• Bulk Synchronous Parallel (BSP) as execution
model
• fault tolerance by checkpointing

Bulk Synchronous Parallel (BSP)
processors

local computation

superstep

communication

barrier
synchronization

Vertex-centric BSP
• each vertex has an id, a value, a list of its adjacent vertex ids and the
corresponding edge values
• each vertex is invoked in each superstep, can recompute its value and
send messages to other vertices, which are delivered over superstep
barriers
• advanced features : termination votes, combiners, aggregators, topology
mutations
vertex1 vertex1 vertex1



superstep i superstep i + 1 superstep i + 2

Master-slave architecture
• vertices are partitioned and
assigned to workers
– default: hash-partitioning
– custom partitioning possible

• master assigns and coordinates,
while workers execute vertices Master
and communicate with each
other

Worker 1 Worker 2 Worker 3

PageRank in Pregel
class PageRankVertex {
void compute(Iterator messages) {
if (getSuperstep() > 0) {
// recompute own PageRank from the neighbors messages
pageRank = sum(messages);
pj

setVertexValue(pageRank);
}
pi 
j  ( j , i )  dj
if (getSuperstep() < k) {
// send updated PageRank to each neighbor
sendMessageToAllNeighbors(pageRank / getNumOutEdges());
} else {
voteToHalt(); // terminate
}
}}

PageRank toy example
.17 .33
.33 .33 .33 Superstep 0
.17 .17
.17
Input graph
.25 .34
.17 .50 .34 Superstep 1 A B C
.09 .25
.09

.22 .34
.25 .43 .34 Superstep 2
.13 .22
.13

Cool, where can I download it?
• Pregel is proprietary, but:
– Apache Giraph is an open source
implementation of Pregel
– runs on standard Hadoop infrastructure
– computation is executed in memory
– can be a job in a pipeline (MapReduce, Hive)
– uses Apache ZooKeeper for synchronization

Giraph‘s Hadoop usage

TaskTracker TaskTracker TaskTracker

worker worker worker worker worker worker

TaskTracker

ZooKeeper master worker
JobTracker
NameNode

Anatomy of an execution
Setup Teardown
• load the graph from disk • write back result
• assign vertices to workers • write back aggregators
• validate workers health

Compute Synchronize
• assign messages to workers • send messages to workers
• iterate on active vertices • compute aggregators
• call vertices compute() • checkpoint

Who is doing what?
• ZooKeeper: responsible for computation state
– partition/worker mapping
– global state: #superstep
– checkpoint paths, aggregator values, statistics

• Master: responsible for coordination
– assigns partitions to workers
– coordinates synchronization
– requests checkpoints
– aggregates aggregator values
– collects health statuses

• Worker: responsible for vertices
– invokes active vertices compute() function
– sends, receives and assigns messages
– computes local aggregation values

What do you have to implement?
• your algorithm as a Vertex
– Subclass one of the many existing implementations:
BasicVertex, MutableVertex, EdgeListVertex,
HashMapVertex, LongDoubleFloatDoubleVertex,...
• a VertexInputFormat to read your graph
– e.g. from a text file with adjacency lists like
<vertex> <neighbor1> <neighbor2> ...
• a VertexOutputFormat to write back the result
– e.g. <vertex> <pageRank>

Starting a Giraph job
• no difference to starting a Hadoop job:

$ hadoop jar giraph-0.1-jar-with-dependencies.jar
o.a.g.GiraphRunner o.a.g.examples.ConnectedComponentsVertex
--inputFormat o.a.g.examples.IntIntNullIntTextInputFormat
--inputPath hdfs:///wikipedia/pagelinks.txt
--outputFormat o.a.g.examples.ComponentOutputFormat
--outputPath hdfs:///wikipedia/results/
--workers 89
--combiner o.a.g.examples.MinimumIntCombiner

What‘s to come?
• Current and future work in Giraph
– graduation from the incubator
– out-of-core messaging
– algorithms library

• 2-day workshop after Berlin Buzzwords
– topic: ‚Parallel Processing beyond MapReduce‘
– meet the developers of Giraph and Stratosphere
http://berlinbuzzwords.de/content/workshops-berlin-buzzwords

Further resources
• Apache Giraph homepage
http://incubator.apache.org/giraph

• Claudio Martella: “Apache Giraph: Distributed
Graph Processing in the Cloud”
http://prezi.com/9ake_klzwrga/apache-giraph-distributed-graph-
processing-in-the-cloud/

• Malewicz et al.: „Pregel – a system for large scale
graph processing“, PODC 09
http://dl.acm.org/citation.cfm?id=1582723

Introducing Apache Giraph for Large Scale Graph Processing

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (14)

Similar a Introducing Apache Giraph for Large Scale Graph Processing

Similar a Introducing Apache Giraph for Large Scale Graph Processing (20)

Más de sscdotopen

Más de sscdotopen (8)

Último

Último (20)

Introducing Apache Giraph for Large Scale Graph Processing