SlideShare una empresa de Scribd logo
1 de 21
WOOster: A Map-Reduce based
Platform for Graph Mining
  Aravindan Raghuveer
  Yahoo! Inc, Bangalore.
Introduction

           “If you squint the right way, graphs
             are everywhere” [1]
           @ Yahoo! :
                      • The WOO Graph: All knowledge
                        assimilated from the web.
                      - http://iswc2011.semanticweb.org/fileadmin/iswc/Pa
                        pers/Industry/WOO_ISWC.pptx
     [1] http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html   2
Yahoo! Confidential
The What and Why?
 What?                Family of Graph Query Algorithms.
                      • Framework:
                          • For graph storage and invoking the query algorithms
                          • Hosted Solution on Hadoop

  Why?
                      • Family of Graph Query Algorithms: Present day
                      algorithms do not scale to billion edge, vertex graphs.
                      • Framework:
                          •Optimizes storage layout to suit graph query
                          algorithms
                          •Improves throughput of the queries.
                                                                                  3
Yahoo! Confidential
Outline of the talk

      •     MapReduce 101
      •     Graph Mining Approaches
      •     Brief overview of WOOster architecture
      •     Graph query algorithms in WOOster:
             • Sub Graph Matching
             • Reachability Query
      •     Experiments
      •     Conclusion


Yahoo! Confidential
Map Reduce 101


             Switch to slides from Cloud Computing
              with MapReduce and Hadoop
             www.cs.berkeley.edu/~matei/talks/2009/parlab_bo
              otcamp_clouds.ppt




                                                                5
Yahoo! Confidential
MapReduce Programming Model

• Data type: key-value records

• Map function:
            (Kin, Vin)  list(Kinter, Vinter)

• Reduce function:
         (Kinter, list(Vinter))  list(Kout, Vout)
Example: Word Count

def mapper(line):
    foreach word in line.split():
        output(word, 1)


def reducer(key, values):
    output(key, sum(values))
Word Count Execution

  Input       Map            Shuffle & Sort              Reduce   Output


                         the, 1
                        brown, 1
 the quick               fox, 1                                   brown, 2
              Map
brown fox                                                          fox, 2
                                                         Reduce
                                                                   how, 1
                    the, 1
                    fox, 1
                                                                   now, 1
                    the, 1                                         the, 3
the fox ate
              Map
the mouse                                     quick, 1

                 how, 1
                                    ate, 1                         ate, 1
                 now, 1
                                   mouse, 1
                brown, 1                                 Reduce    cow, 1
 how now                                                          mouse, 1
              Map                   cow, 1
brown cow                                                         quick, 1
Graph Mining Approaches : Two Schools
           School-1: Invent a new platform:
             - Map-reduce is not best suited for graph mining:
             - BSP, PRAM models : circa 1980s
             - Pregel, Haloop from Google [1]
           School-2: Ride on Map-Reduce
             -    MR has wide adoption, open source tools, industry support.
             -    Invest on one more computing infrastructure
             -    Apache Giraph: http://incubator.apache.org/giraph/ (BSP on Hadoop)
             -    Efforts in open source / academia on the same lines:
                    • Pegasus CMU [2]
                    • Graph Mining in Apache Mahout[3]
                    • Rayethon’s Graph Mining [4]
    [1] SIGMOD 2010, http://dl.acm.org/citation.cfm?id=1807184
    [2] http://www.cs.cmu.edu/~pegasus/
    [3] http://www.robust-project.eu/news/robust-project-pushes-large-scale-graph-mining-with-hadoop-apache                          9
    [4] http://www.cloudera.com/blog/2010/03/how-raytheon-researchers-are-using-hadoop-to-build-a-scalable-distributed-triple-store/
Yahoo! Confidential
WOOster Architecture
                                                    •   User submits a query
                 WOOster Web UI & WebService APIs   •   Planner periodically scans for
                                                        newly arrived queries.
                                                    •   Planner creates a M-R plan that
  Graph
                         Planner                        re-uses computation, / IO
 Indices                                    Jobs
                                            D/B         across queries. (Batching)
                         Executor                   •   Executor executes the M-R
                                                        plan.
                                                    •   Result notified to the user
                         WOO Graph
                                                        (Hosted Solution)
                                 Grid


Yahoo! Confidential
The Sub-Graph Match Query

     Find all
     instances                                     in graph G
     of query Q
                                                      Vertices have
                                                 attributes (ex age:31)
                      Vertices and edges have
                      constraints (ex: age<40)                                  Edges have relationship
                                                                                        labels.

          Notation            Query Vertex       Graph Vertex             A matched graph vertex


       Why Sub-Graph Match (Exact Graph Isomorphism)?:
        A popular and expressive graph query useful to mine patterns.
       To our knowledge, a large scale algorithm to operate on a billion vertex graph is
       not present.
Yahoo! Confidential
Overview of the Solution

    Step-0. Data Layout on HDFS


    Step-1. Query Graph Partitioning


   Step-2. Edge Selection


   Step-3. Query Partition Matching


   Step-4. Query Partition Merging
Yahoo! Confidential
Data Layout on HDFS

        •      How to store a large scale graph?
        •      Adjacency List like solution:
                • Each row/line has information about a vertex:
                      • Vertex attributes
                      • Vertex neighbors and the labels associated with each edge.


        Implications:
        •Enables early pruning of non-matching edges and vertices.
        •Each vertex has information about itself and its immediate
        neighbors only.

Yahoo! Confidential
Step-1: Query Graph Partitioning

        Why?: Parallelized solving of independent sub-
         problems
       How?
       Find minimum number of partitions such that
       diameter of partition = 2.
                                                             Pivot Vertices
       Intuition:
       •In a spanning tree of diameter 2, there is one vertex that is
       connected to all other vertices  pivot vertex
       •Will use this property in steps 2, 3.


Yahoo! Confidential
Step-2: Edge Selection
        •     What: Select a subset of edges from G that match atleast one
              edge in Q.
        •     How:                       3.
                                            g1-g2 emitted:
                                                          g1 mapped to a
                                                           query vertex.
            g2

                                        Map                g1           g2            Reduce
        g3                                                                                       g1
                      g1                Logic                                          Logic

                                                            g1          g2
            g4

1. g1:Current              2b.
                           2a.                      4.
      vertex in             For every emits allof
                              Mapper neigbor             g1-g2 emited         Reducer emits 5.
      mapper.               edges if vertex and
                             q1, there exists a            from g2’s         an edge if a pair
                            edge constraints are
                               corresponding                mapper               is found
                              neighbor for g1
                                    met
Yahoo! Confidential
Step-3: Query Partition Matching
   Edge Selection:
           • Associates a graph vertex to the possible query vertices it could map to
           • Associates the graph vertex to its “pivot” graph vertex.                 g1           g2
           • Pivot graph vertex is a graph vertex which is mapped to a pivot query vertex: g1 in this example



                                                                                                Reducer forms
                                                                                                 the partition
                                              g1           g2                              3.
     Edge
   Selection               Map                                             Reduce                       g2
                                              g1           g3
    output                 Logic                                            Logic
                                                                                                  g1      g3
                                              g1           g4
                                                                                                         g4
 Mapper emits pivot graph
 vertex as key and edge as                             2. Reducer receives all
            value                                          edges with the same
                              1.
                                                            pivot graph vertex
Yahoo! Confidential
Step-4: Query Partition Merging
        •     Merges partitions one after another to form the a query match
        •     More details in paper.




         Take-away from Steps1-4: (also for any scalable Map-Reduce
           program)
        The mapper/reducer keys are chosen such that:
        # keys is proportional to the number of matches of query Q
       in the graph.
        Hence the algorithm scales well for large graphs and complex
       queries.
Yahoo! Confidential
Results                     160
                                    140
                                    120


                       Time (sec)
                                    100
                                     80
                                    60
                                    40
                                    20
                                     0
                                          100          150             200          250
                                                     Number of Reducers

                       Edge Selection       Query Partition Matching   Query Partition Merging



             Graph of 10 million vertices and 50 million edges
             Complex Query of 24 vertices
             Note that the edge selection time reduces with
              increasing number of reducers.
Yahoo! Confidential
In the paper…

             Detailed map-reduce algorithms for sub-graph match and
              reachability
             Theoretical analysis for scalability
             Construction of the synthetic dataset
             Methodology and more experiments.
             Reachability query: examples, map-reduce algorithm
             Related work




Yahoo! Confidential
Future Work

        •     Indexing structure for graphs suited for M-R jobs
                • Compare with giraph based approach.
        •     Better batching strategies.
        •     Right interface for custom graph algorithms to be
              plugged in while WOOster providing automatic
              batching.
        •     More graph mining algorithms implemented



Yahoo! Confidential
Questions / Comments
                                             21
Yahoo! Confidential

Más contenido relacionado

Similar a WOOster: A Map-Reduce based Platform for Graph Mining

Gopher A Sub-graph centric framework for large scale graphs
Gopher A Sub-graph centric framework for large scale graphsGopher A Sub-graph centric framework for large scale graphs
Gopher A Sub-graph centric framework for large scale graphs
charithwiki
 
Google_A_Behind_the_Scenes_Tour_-_Jeff_Dean
Google_A_Behind_the_Scenes_Tour_-_Jeff_DeanGoogle_A_Behind_the_Scenes_Tour_-_Jeff_Dean
Google_A_Behind_the_Scenes_Tour_-_Jeff_Dean
Hiroshi Ono
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
sscdotopen
 
1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real World1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real World
Achim Friedland
 
Le projet “Canadian Spatial Data Foundry”: Introduction à PostGIS WKT Raster
Le projet “Canadian Spatial Data Foundry”: Introduction à PostGIS WKT RasterLe projet “Canadian Spatial Data Foundry”: Introduction à PostGIS WKT Raster
Le projet “Canadian Spatial Data Foundry”: Introduction à PostGIS WKT Raster
ACSG Section Montréal
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
npinto
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
MapR Technologies
 

Similar a WOOster: A Map-Reduce based Platform for Graph Mining (20)

L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Gopher A Sub-graph centric framework for large scale graphs
Gopher A Sub-graph centric framework for large scale graphsGopher A Sub-graph centric framework for large scale graphs
Gopher A Sub-graph centric framework for large scale graphs
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
 
Open GeoSocial API
Open GeoSocial APIOpen GeoSocial API
Open GeoSocial API
 
Project Matsu
Project MatsuProject Matsu
Project Matsu
 
Google_A_Behind_the_Scenes_Tour_-_Jeff_Dean
Google_A_Behind_the_Scenes_Tour_-_Jeff_DeanGoogle_A_Behind_the_Scenes_Tour_-_Jeff_Dean
Google_A_Behind_the_Scenes_Tour_-_Jeff_Dean
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
Ling liu part 01:big graph processing
Ling liu part 01:big graph processingLing liu part 01:big graph processing
Ling liu part 01:big graph processing
 
1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real World1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real World
 
Cloud is such stuff as dreams are made on
Cloud is such stuff as dreams are made onCloud is such stuff as dreams are made on
Cloud is such stuff as dreams are made on
 
Bigdata roundtable-storm
Bigdata roundtable-stormBigdata roundtable-storm
Bigdata roundtable-storm
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Le projet “Canadian Spatial Data Foundry”: Introduction à PostGIS WKT Raster
Le projet “Canadian Spatial Data Foundry”: Introduction à PostGIS WKT RasterLe projet “Canadian Spatial Data Foundry”: Introduction à PostGIS WKT Raster
Le projet “Canadian Spatial Data Foundry”: Introduction à PostGIS WKT Raster
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 

WOOster: A Map-Reduce based Platform for Graph Mining

  • 1. WOOster: A Map-Reduce based Platform for Graph Mining Aravindan Raghuveer Yahoo! Inc, Bangalore.
  • 2. Introduction “If you squint the right way, graphs are everywhere” [1] @ Yahoo! : • The WOO Graph: All knowledge assimilated from the web. - http://iswc2011.semanticweb.org/fileadmin/iswc/Pa pers/Industry/WOO_ISWC.pptx [1] http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html 2 Yahoo! Confidential
  • 3. The What and Why? What? Family of Graph Query Algorithms. • Framework: • For graph storage and invoking the query algorithms • Hosted Solution on Hadoop Why? • Family of Graph Query Algorithms: Present day algorithms do not scale to billion edge, vertex graphs. • Framework: •Optimizes storage layout to suit graph query algorithms •Improves throughput of the queries. 3 Yahoo! Confidential
  • 4. Outline of the talk • MapReduce 101 • Graph Mining Approaches • Brief overview of WOOster architecture • Graph query algorithms in WOOster: • Sub Graph Matching • Reachability Query • Experiments • Conclusion Yahoo! Confidential
  • 5. Map Reduce 101  Switch to slides from Cloud Computing with MapReduce and Hadoop  www.cs.berkeley.edu/~matei/talks/2009/parlab_bo otcamp_clouds.ppt 5 Yahoo! Confidential
  • 6. MapReduce Programming Model • Data type: key-value records • Map function: (Kin, Vin)  list(Kinter, Vinter) • Reduce function: (Kinter, list(Vinter))  list(Kout, Vout)
  • 7. Example: Word Count def mapper(line): foreach word in line.split(): output(word, 1) def reducer(key, values): output(key, sum(values))
  • 8. Word Count Execution Input Map Shuffle & Sort Reduce Output the, 1 brown, 1 the quick fox, 1 brown, 2 Map brown fox fox, 2 Reduce how, 1 the, 1 fox, 1 now, 1 the, 1 the, 3 the fox ate Map the mouse quick, 1 how, 1 ate, 1 ate, 1 now, 1 mouse, 1 brown, 1 Reduce cow, 1 how now mouse, 1 Map cow, 1 brown cow quick, 1
  • 9. Graph Mining Approaches : Two Schools  School-1: Invent a new platform: - Map-reduce is not best suited for graph mining: - BSP, PRAM models : circa 1980s - Pregel, Haloop from Google [1]  School-2: Ride on Map-Reduce - MR has wide adoption, open source tools, industry support. - Invest on one more computing infrastructure - Apache Giraph: http://incubator.apache.org/giraph/ (BSP on Hadoop) - Efforts in open source / academia on the same lines: • Pegasus CMU [2] • Graph Mining in Apache Mahout[3] • Rayethon’s Graph Mining [4] [1] SIGMOD 2010, http://dl.acm.org/citation.cfm?id=1807184 [2] http://www.cs.cmu.edu/~pegasus/ [3] http://www.robust-project.eu/news/robust-project-pushes-large-scale-graph-mining-with-hadoop-apache 9 [4] http://www.cloudera.com/blog/2010/03/how-raytheon-researchers-are-using-hadoop-to-build-a-scalable-distributed-triple-store/ Yahoo! Confidential
  • 10. WOOster Architecture • User submits a query WOOster Web UI & WebService APIs • Planner periodically scans for newly arrived queries. • Planner creates a M-R plan that Graph Planner re-uses computation, / IO Indices Jobs D/B across queries. (Batching) Executor • Executor executes the M-R plan. • Result notified to the user WOO Graph (Hosted Solution) Grid Yahoo! Confidential
  • 11. The Sub-Graph Match Query Find all instances in graph G of query Q Vertices have attributes (ex age:31) Vertices and edges have constraints (ex: age<40) Edges have relationship labels. Notation Query Vertex Graph Vertex A matched graph vertex Why Sub-Graph Match (Exact Graph Isomorphism)?: A popular and expressive graph query useful to mine patterns. To our knowledge, a large scale algorithm to operate on a billion vertex graph is not present. Yahoo! Confidential
  • 12. Overview of the Solution Step-0. Data Layout on HDFS Step-1. Query Graph Partitioning Step-2. Edge Selection Step-3. Query Partition Matching Step-4. Query Partition Merging Yahoo! Confidential
  • 13. Data Layout on HDFS • How to store a large scale graph? • Adjacency List like solution: • Each row/line has information about a vertex: • Vertex attributes • Vertex neighbors and the labels associated with each edge. Implications: •Enables early pruning of non-matching edges and vertices. •Each vertex has information about itself and its immediate neighbors only. Yahoo! Confidential
  • 14. Step-1: Query Graph Partitioning Why?: Parallelized solving of independent sub- problems How? Find minimum number of partitions such that diameter of partition = 2. Pivot Vertices Intuition: •In a spanning tree of diameter 2, there is one vertex that is connected to all other vertices  pivot vertex •Will use this property in steps 2, 3. Yahoo! Confidential
  • 15. Step-2: Edge Selection • What: Select a subset of edges from G that match atleast one edge in Q. • How: 3. g1-g2 emitted: g1 mapped to a query vertex. g2 Map g1 g2 Reduce g3 g1 g1 Logic Logic g1 g2 g4 1. g1:Current 2b. 2a. 4. vertex in For every emits allof Mapper neigbor g1-g2 emited Reducer emits 5. mapper. edges if vertex and q1, there exists a from g2’s an edge if a pair edge constraints are corresponding mapper is found neighbor for g1 met Yahoo! Confidential
  • 16. Step-3: Query Partition Matching Edge Selection: • Associates a graph vertex to the possible query vertices it could map to • Associates the graph vertex to its “pivot” graph vertex. g1 g2 • Pivot graph vertex is a graph vertex which is mapped to a pivot query vertex: g1 in this example Reducer forms the partition g1 g2 3. Edge Selection Map Reduce g2 g1 g3 output Logic Logic g1 g3 g1 g4 g4 Mapper emits pivot graph vertex as key and edge as 2. Reducer receives all value edges with the same 1. pivot graph vertex Yahoo! Confidential
  • 17. Step-4: Query Partition Merging • Merges partitions one after another to form the a query match • More details in paper. Take-away from Steps1-4: (also for any scalable Map-Reduce program)  The mapper/reducer keys are chosen such that:  # keys is proportional to the number of matches of query Q in the graph.  Hence the algorithm scales well for large graphs and complex queries. Yahoo! Confidential
  • 18. Results 160 140 120 Time (sec) 100 80 60 40 20 0 100 150 200 250 Number of Reducers Edge Selection Query Partition Matching Query Partition Merging  Graph of 10 million vertices and 50 million edges  Complex Query of 24 vertices  Note that the edge selection time reduces with increasing number of reducers. Yahoo! Confidential
  • 19. In the paper…  Detailed map-reduce algorithms for sub-graph match and reachability  Theoretical analysis for scalability  Construction of the synthetic dataset  Methodology and more experiments.  Reachability query: examples, map-reduce algorithm  Related work Yahoo! Confidential
  • 20. Future Work • Indexing structure for graphs suited for M-R jobs • Compare with giraph based approach. • Better batching strategies. • Right interface for custom graph algorithms to be plugged in while WOOster providing automatic batching. • More graph mining algorithms implemented Yahoo! Confidential
  • 21. Questions / Comments 21 Yahoo! Confidential