WOOster: A Map-Reduce based Platform for Graph Mining

WOOster: A Map-Reduce based
Platform for Graph Mining
Aravindan Raghuveer
Yahoo! Inc, Bangalore.

Introduction

“If you squint the right way, graphs
are everywhere” [1]
@ Yahoo! :
• The WOO Graph: All knowledge
assimilated from the web.
- http://iswc2011.semanticweb.org/fileadmin/iswc/Pa
pers/Industry/WOO_ISWC.pptx
[1] http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html 2
Yahoo! Confidential

The What and Why?
What? Family of Graph Query Algorithms.
• Framework:
• For graph storage and invoking the query algorithms
• Hosted Solution on Hadoop

Why?
• Family of Graph Query Algorithms: Present day
algorithms do not scale to billion edge, vertex graphs.
• Framework:
•Optimizes storage layout to suit graph query
algorithms
•Improves throughput of the queries.
3
Yahoo! Confidential

Outline of the talk

• MapReduce 101
• Graph Mining Approaches
• Brief overview of WOOster architecture
• Graph query algorithms in WOOster:
• Sub Graph Matching
• Reachability Query
• Experiments
• Conclusion

Yahoo! Confidential

Map Reduce 101

 Switch to slides from Cloud Computing
with MapReduce and Hadoop
 www.cs.berkeley.edu/~matei/talks/2009/parlab_bo
otcamp_clouds.ppt

5
Yahoo! Confidential

MapReduce Programming Model

• Data type: key-value records

• Map function:
(Kin, Vin)  list(Kinter, Vinter)

• Reduce function:
(Kinter, list(Vinter))  list(Kout, Vout)

Example: Word Count

def mapper(line):
foreach word in line.split():
output(word, 1)

def reducer(key, values):
output(key, sum(values))

Word Count Execution

Input Map Shuffle & Sort Reduce Output

the, 1
brown, 1
the quick fox, 1 brown, 2
Map
brown fox fox, 2
Reduce
how, 1
the, 1
fox, 1
now, 1
the, 1 the, 3
the fox ate
Map
the mouse quick, 1

how, 1
ate, 1 ate, 1
now, 1
mouse, 1
brown, 1 Reduce cow, 1
how now mouse, 1
Map cow, 1
brown cow quick, 1

Graph Mining Approaches : Two Schools
 School-1: Invent a new platform:
- Map-reduce is not best suited for graph mining:
- BSP, PRAM models : circa 1980s
- Pregel, Haloop from Google [1]
 School-2: Ride on Map-Reduce
- MR has wide adoption, open source tools, industry support.
- Invest on one more computing infrastructure
- Apache Giraph: http://incubator.apache.org/giraph/ (BSP on Hadoop)
- Efforts in open source / academia on the same lines:
• Pegasus CMU [2]
• Graph Mining in Apache Mahout[3]
• Rayethon’s Graph Mining [4]
[1] SIGMOD 2010, http://dl.acm.org/citation.cfm?id=1807184
[2] http://www.cs.cmu.edu/~pegasus/
[3] http://www.robust-project.eu/news/robust-project-pushes-large-scale-graph-mining-with-hadoop-apache 9
[4] http://www.cloudera.com/blog/2010/03/how-raytheon-researchers-are-using-hadoop-to-build-a-scalable-distributed-triple-store/
Yahoo! Confidential

WOOster Architecture
• User submits a query
WOOster Web UI & WebService APIs • Planner periodically scans for
newly arrived queries.
• Planner creates a M-R plan that
Graph
Planner re-uses computation, / IO
Indices Jobs
D/B across queries. (Batching)
Executor • Executor executes the M-R
plan.
• Result notified to the user
WOO Graph
(Hosted Solution)
Grid

Yahoo! Confidential

The Sub-Graph Match Query

Find all
instances in graph G
of query Q
Vertices have
attributes (ex age:31)
Vertices and edges have
constraints (ex: age<40) Edges have relationship
labels.

Notation Query Vertex Graph Vertex A matched graph vertex

Why Sub-Graph Match (Exact Graph Isomorphism)?:
A popular and expressive graph query useful to mine patterns.
To our knowledge, a large scale algorithm to operate on a billion vertex graph is
not present.
Yahoo! Confidential

Overview of the Solution

Step-0. Data Layout on HDFS

Step-1. Query Graph Partitioning

Step-2. Edge Selection

Step-3. Query Partition Matching

Step-4. Query Partition Merging
Yahoo! Confidential

Data Layout on HDFS

• How to store a large scale graph?
• Adjacency List like solution:
• Each row/line has information about a vertex:
• Vertex attributes
• Vertex neighbors and the labels associated with each edge.

Implications:
•Enables early pruning of non-matching edges and vertices.
•Each vertex has information about itself and its immediate
neighbors only.

Yahoo! Confidential

Step-1: Query Graph Partitioning

Why?: Parallelized solving of independent sub-
problems
How?
Find minimum number of partitions such that
diameter of partition = 2.
Pivot Vertices
Intuition:
•In a spanning tree of diameter 2, there is one vertex that is
connected to all other vertices  pivot vertex
•Will use this property in steps 2, 3.

Yahoo! Confidential

Step-2: Edge Selection
• What: Select a subset of edges from G that match atleast one
edge in Q.
• How: 3.
g1-g2 emitted:
g1 mapped to a
query vertex.
g2

Map g1 g2 Reduce
g3 g1
g1 Logic Logic

g1 g2
g4

1. g1:Current 2b.
2a. 4.
vertex in For every emits allof
Mapper neigbor g1-g2 emited Reducer emits 5.
mapper. edges if vertex and
q1, there exists a from g2’s an edge if a pair
edge constraints are
corresponding mapper is found
neighbor for g1
met
Yahoo! Confidential

Step-3: Query Partition Matching
Edge Selection:
• Associates a graph vertex to the possible query vertices it could map to
• Associates the graph vertex to its “pivot” graph vertex. g1 g2
• Pivot graph vertex is a graph vertex which is mapped to a pivot query vertex: g1 in this example

Reducer forms
the partition
g1 g2 3.
Edge
Selection Map Reduce g2
g1 g3
output Logic Logic
g1 g3
g1 g4
g4
Mapper emits pivot graph
vertex as key and edge as 2. Reducer receives all
value edges with the same
1.
pivot graph vertex
Yahoo! Confidential

Step-4: Query Partition Merging
• Merges partitions one after another to form the a query match
• More details in paper.

Take-away from Steps1-4: (also for any scalable Map-Reduce
program)
 The mapper/reducer keys are chosen such that:
 # keys is proportional to the number of matches of query Q
in the graph.
 Hence the algorithm scales well for large graphs and complex
queries.
Yahoo! Confidential

Results 160
140
120

Time (sec)
100
80
60
40
20
0
100 150 200 250
Number of Reducers

Edge Selection Query Partition Matching Query Partition Merging

 Graph of 10 million vertices and 50 million edges
 Complex Query of 24 vertices
 Note that the edge selection time reduces with
increasing number of reducers.
Yahoo! Confidential

In the paper…

 Detailed map-reduce algorithms for sub-graph match and
reachability
 Theoretical analysis for scalability
 Construction of the synthetic dataset
 Methodology and more experiments.
 Reachability query: examples, map-reduce algorithm
 Related work

Yahoo! Confidential

Future Work

• Indexing structure for graphs suited for M-R jobs
• Compare with giraph based approach.
• Better batching strategies.
• Right interface for custom graph algorithms to be
plugged in while WOOster providing automatic
batching.
• More graph mining algorithms implemented

Yahoo! Confidential

Questions / Comments
21
Yahoo! Confidential

WOOster: A Map-Reduce based Platform for Graph Mining

Recomendados

Recomendados

Más contenido relacionado

Similar a WOOster: A Map-Reduce based Platform for Graph Mining

Similar a WOOster: A Map-Reduce based Platform for Graph Mining (20)

Último

Último (20)

WOOster: A Map-Reduce based Platform for Graph Mining