1. Processing Over a Billion Edges
on Apache Giraph
Hadoop Summit 2012
Avery Ching
Software Engineer
6/14/2012
2. Agenda
1 Motivation and Background
2 Giraph Concepts/API
3 Example Applications
4 Architecture Overview
5 Recent/Future Improvements
3. What is Apache Giraph?
• Loose implementation of Google’s Pregel that runs
as a map-only job on Hadoop
• “Think like a vertex” that can send messages to any
other vertex in the graph using the bulk synchronous
parallel programming model
• An in-memory scalable system*
▪ Will be enhanced with out-of-core messages/vertices to handle
larger problem sets.
4. What (social) graphs are we targeting?
• 3/2012 LinkedIn has 161 million users
• 6/2012 Twitter discloses 140 million MAU
• 4/2012 Facebook declares 901 million MAU
5. Example applications
• Ranking
▪ Popularity, importance, etc.
• Label Propagation
▪ Location, school, gender, etc.
• Community
▪ Groups, interests
6. Bulk synchronous parallel
• Supersteps
▪ A global epoch followed by a global barrier where components
do concurrent computation and send messages
• Point-to-point messages (i.e. vertex to vertex)
▪ Sent during a superstep from one component to another and
then delivered in the following superstep
• Computation complete when all components
complete
7. Computation + Superstep
Communication
Processors
Time Barrier
8. MapReduce -> Giraph
“Think like a vertex”, not a key-value pair!
MapReduce Giraph
public class Mapper<
public class Vertex<
KEYIN,
I extends
VALUEIN,
WritableComparable,
KEYOUT,
V extends Writable,
VALUEOUT> {
E extends Writable,
void map(KEYIN key,
M extends Writable> {
VALUEIN value,
void compute(
Context context)
Iterator<M> msgIterator);
throws IOException,
}
InterruptedException;
}
9. Basic Giraph API
Methods available to compute()
Immediate effect/access Next superstep
I getVertexId() void sendMsg(I id, M msg)
V getVertexValue() void sendMsgToAllEdges(M msg)
void setVertexValue(V vertexValue)
void addVertexRequest(
Iterator<I> iterator() BasicVertex<I, V, E, M> vertex)
E getEdgeValue(I targetVertexId) void removeVertexRequest(I vertexId)
boolean hasEdge(I targetVertexId) void addEdgeRequest(
boolean addEdge(I targetVertexId, I sourceVertexId,
E Edge<I, E> edge)
edgeValue) void removeEdgeRequest(
E removeEdge(I targetVertexId) I sourceVertexId,
I destVertexId)
void voteToHalt()
boolean isHalted()
10. Why not implement Giraph with multiple
MapReduce jobs?
• Too much disk, no in-memory caching, a superstep
becomes a job!
Input Map Intermediate Reduce Output
format tasks files tasks format
Split 0
Output 0
Split 1
Split 2
Split 3 Output 1
11. Giraph is a single Map-only job in
Hadoop
• Hadoop is purely a resource manager for Giraph, all
communication is done through Netty-based IPC
Vertex input Map Vertex output
format tasks format
Split 0
Output 0
Split 1
Split 2
Split 3 Output 1
12. Maximum vertex value implementation
public class MaxValueVertex extends EdgeListVertex<
IntWritable, IntWritable, IntWritable, IntWritable> {
@Override
public void compute(Iterator<IntWritable> msgIterator) {
boolean changed = false;
while (msgIterator.hasNext()) {
IntWritable msgValue = msgIterator.next();
if (msgValue.get() > getVertexValue().get()) {
setVertexValue(msgValue);
changed = true;
}
}
if (getSuperstep() == 0 || changed) {
sendMsgToAllEdges(getVertexValue());
} else {
voteToHalt();
}
}
}
13. Maximum vertex value
Processor 1 5 5 5 5
Processor 2 1
1
5 5
5
2
2 2 5
5
Time Barrier Barrier Barrier
14. Page rank implementation
public class SimplePageRankVertex extends EdgeListVertex<LongWritable,
DoubleWritable, FloatWritable, DoubleWritable> {
public void compute(Iterator<DoubleWritable> msgIterator) {
if (getSuperstep() >= 1) {
double sum = 0;
while (msgIterator.hasNext()) {
sum += msgIterator.next().get();
}
setVertexValue(new DoubleWritable((0.15f / getNumVertices()) + 0.85f *
sum);
}
if (getSuperstep() < 30) {
long edges = getNumOutEdges();
sentMsgToAllEdges(new DoubleWritable(getVertexValue().get() / edges));
} else {
voteToHalt();
}
}
}
16. Giraph components
• Master – Application coordinator
▪ One active master at a time
▪ Assigns partition owners to workers prior to each superstep
▪ Synchronizes supersteps
• Worker – Computation & messaging
▪ Loads the graph from input splits
▪ Does the computation/messaging of its assigned partitions
• ZooKeeper
▪ Maintains global application state
17. Graph distribution
• Master graph partitioner
▪ Create initial partitions, generate partition owner changes
between supersteps
• Worker graph partitioner
▪ Determine which partition a vertex belongs to
▪ Create/modify the partition stats (can split/merge partitions)
• Default is hash partitioning (hashCode())
▪ Range-based partitioning is also possible on a per-type basis
19. Customizable fault tolerance
• No single point of failure from Giraph threads
▪ With multiple master threads, if the current master dies, a new one will automatically
take over.
▪ If a worker thread dies, the application is rolled back to a previously checkpointed
superstep. The next superstep will begin with the new amount of workers
▪ If a zookeeper server dies, as long as a quorum remains, the application can proceed
• Hadoop single points of failure still exist
▪ Namenode, jobtracker
▪ Restarting manually from a checkpoint is always possible
19
20. Master thread fault tolerance
Before failure of active master 0 After failure of active master 0
“Active” “Active”
Master 0 Master 0
Active Active
Master Master
“Spare” State “Active” State
Master 1 Master 1
“Spare” “Spare”
Master 2 Master 2
• One active master, with spare masters taking over in the event of an active master
failure
• All active master state is stored in ZooKeeper so that a spare master can
immediately step in when an active master fails
• “Active” master implemented as a queue in ZooKeeper
20
21. Worker thread fault tolerance
Superstep i Superstep i+1 Superstep i+2
(no checkpoint) (checkpoint) (no checkpoint)
Worker failure!
Superstep i+1 Superstep i+2 Superstep i+3
(checkpoint) (no checkpoint) (checkpoint)
Worker failure after
checkpoint complete!
Superstep i+3 Application
(no checkpoint) Complete
• A single worker death fails the superstep
• Application reverts to the last committed superstep automatically
▪ Master detects worker failure during any superstep with a ZooKeeper “health”
znode
▪ Master chooses the last committed superstep and sends a command through
ZooKeeper for all workers to restart from that superstep
21
22. Optional features
• Combiners
▪ Similar to Map-Reduce combiners
▪ Users implement a combine() method that can reduce the
amount of messages sent and received
▪ Run on both the client side (memory, network) and server side
(memory)
• Aggregators
▪ Similar to MPI aggregation routines (i.e. max, min, sum, etc.)
▪ Commutative and associate operations that are performed
globally
▪ Examples include global communication, monitoring, and
statistics
23. Recent Netty IPC implementation
300 50
250
Time (Seconds)
• Big improvement over the 40
Hadoop RPC implementation 200
30
150
• 10-39% overall performance 20
improvement 100
50 10
• Still need more Netty tuning 0 0
10 30 50
Workers
Netty
Hadoop RPC
% improvement
24. Recent benchmarks
• Test cluster of 80 machines
▪ Facebook Hadoop (https://github.com/facebook/hadoop-20)
▪ 72 cores, 64+ GB of memory
▪ org.apache.giraph.benchmark.PageRankBenchmark
▪ 5 supersteps
▪ No checkpointing
▪ 10 edges per vertex
27. Worker / edge scalability
2000 8
Time (Seconds)
Edges (Billions)
1500 6
1000 4
500 2
0 0
10 30 50
Workers
Run Time Workers/Edges
28. Apache Giraph has graduated as of
5/2012
• Incubated for less than a year (entered incubator
9/12)
• Committers from HortonWorks, Twitter, LinkedIn,
Facebook, TrendMicro and various schools (VU
Amsterdam, TU Berlin, Korea University)
• Released 0.1 as of 2/6/2012, will be release 0.2
within a few months
29. Future improvements
• Out-of-core messages/graph
▪ Under memory pressure, dump messages/portions of the graph
to local disk
▪ Ability to run applications without having all needed memory
• Performance improvements
▪ Netty is a good step in the right direction, but need to tune
messaging performance as it takes up a majority of the time
▪ Scale back use of ZooKeeper to only be for health registration,
rather than implementing aggregators and coordination
30. More future improvements
• Adding a master#compute() method
▪ Arbitrary master computation that sends results to workers prior
to a superstep to simplify certain algorithms
▪ GIRAPH-127
• Handling skew
▪ Some vertices have a large number of edges and we need to
break them up and handle them differently to provide better
scalability
31. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0