SlideShare a Scribd company logo
1 of 38
Hadoop and Graph Data Management:
   Challenges and Opportunities

                 Daniel Abadi
     Assistant Professor at Yale University
           Chief Scientist at Hadapt
            Tuesday, November 8th

      (Collaboration with Jiewen Huang)
Two Major Trends
 Hadoop
   Becoming de facto standard for large scale data
    processing
   Becoming more than just MapReduce
   Ecosystem growing rapidly --- lot’s of great tools
    around it
 Graph data is proliferating; huge value if analyzed
 and processed
     Social graphs
     Linked data
     Telecom
     Advertising
     Defense
Use Hadoop to Process Graph
Data?
 Of course!
 BUT:
   Little automatic optimization (non-declarative
    interface)
   It’s possible to do graphs with Hadoop:
     Really badly
     Suboptimally
     Really well
   Case in point: subgraph pattern-matching
   benchmark on three different Hadoop-centered
   solutions:
     ~1000s,
     ~100s,
     < ~10s
Case study: Linked Data
Example Linked Data
 Entities (or “resources”)
  are nodes in the graph
 Relationships between
  entities are labeled
  directed edges in the
  graph
 (Resources typically have
  unique URI identifies to
  allow for global
  references --- not shown
  in example to the right)
 Any resource can
  connect to any other
  resource
Graph Representation
 Linked data graph
  can be parsed
  into a series of
  vertex-entity-
  vertex triples
 First entity
  referred to as the
  subject; second
  as the object;
  edge connecting
  them as the
  predicate
 We will call these
  “RDF triples”
Querying Linked Data
 Linked Data is typically queried in SPARQL
 Processing a SPARQL query is basically
 equivalent to the subgraph pattern matching
 problem
Scalable SPARQL Processing
 Single-node options are abundant, e.g.,
     Sesame
     Jena
     RDF-3X
     3store
 Fewer options that can scale across multiple
 machines
   This is where Hadoop comes in!
     One cool solution: SHARD (presented by Kurt Rohloff at
      HadoopWorld 2010)
       Uses HDFS to store graph, MapReduce jobs for
        subgraph pattern matching
       Much nicer than a naïve Hadoop solution
Another Example: Twitter Data
                               @joe_hellerst
                                   ein

                      follow                      follow
                      s                           s
                                follow
                                s
           @daniel_ab                               @mikeolso
              adi                                      n
                                 follow
                                 s                                  retweete
retweete
d                                                           retweeted
           retweete            retweete          retweete   d
           d                   d                 d

     @hadoop_is_my_life        @super_hadooper        @hadoop_is_the answer
Example Query over Twitter
 Graph

Who has retweeted both @daniel_abadi and
@mikeolson?

      @daniel_aba
                               @mikeolson
          di



          retweeted          retweeted


                      ???
Slide From Kurt Rohloff Presentation at
         HadoopWorld 2010
Slide From Kurt
    Rohloff
Presentation at
 HadoopWorld
     2010
Issues With Hadoop Defaults
 Hadoop does not default to pipelined algorithms
 between MapReduce jobs
   SHARD algorithm matches each clause of SPARQL
   subgraph in a separate MapReduce job
     Need full barrier synchronization between jobs which is
      unnecessary
 Hadoop does not default to pipelined algorithms
 between Map and Reduce
   Each job performs a join by vertex, where the object
    of one triple is joined with the subject of another
    triple
   Joins work much faster if you can pipeline between
    Map and Reduce (e.g. pipelined hash join)
Issues with Hadoop Defaults
(cont.)
 Hadoop hash partitions data across nodes
   Data in graph is randomly partitioned across nodes
   Data close in the graph could be on a physically
    distant node
   Therefore each join requires a complete
    redistribution of the entire data set across the
    network (and there are many joins)
 Hadoop defaults to replicating all data 3 times
  with no preferences for what is replicated or
  where it is replicated to
 Hadoop defaults to using the Hadoop Distributed
  File System (HDFS) for storage, which is not
  optimized for graph data
All is not lost! Don’t throw away
Hadoop!
 All we have to do is change the defaults and add
 to it a little
System Architecture
Partitioning
 Graphs can be represented as vertex1-edge-
  vertex2 triples
 Hash partitioning by vertex1 is straightforward
 Great for queries like:

Query: Find the names of the strikers that play for FC Barcelona.

SELECT ?name
WHERE { ?player type        footballer      .
          ?player name      ?name           .
          ?player position striker          .
          ?player playsFor FC_Barcelona . }
The problem with hash partitioning
 …

Find football players playing for clubs in a populous region where he was born.
Graph Partitioning
 Data close together in the graph should be
  physically close together (preferably on the same
  machine)
 Subgraph pattern matching can be done without
  require huge amounts of communication via joins
  across the cluster
Graph Partitioning
Graph Partitioning




   Machine 1   Machine 2   Machine 3
Edge/Triple Placement
●   Minimizing data exchange
    ●   Allowing data overlap
●   N-hop guarantee
    ●   The extent of data overlap
    ●   If a vertex is assigned to a machine, any
        vertex that is within n-hop of this vertex is
        also stored in this machine
0 Hop of Machine 1




  Machine 1   Machine 2   Machine 3
1 Hop of Machine 1




  Machine 1   Machine 2   Machine 3
2 Hop of Machine 1




  Machine 1   Machine 2   Machine 3
0 Hop of Machine 3




  Machine 1   Machine 2   Machine 3
1 Hop of Machine 3




  Machine 1   Machine 2   Machine 3
2 Hop of Machine 3




  Machine 1   Machine 2   Machine 3
High Degree Vertexes
●   Problem: High-degree vertexes make the
    graph well-connected and difficult to
    partition
●   Solution: Ignore them during graph
    partitioning

●   Problem: High-degree vertexes cause
    data explosion with a n-hop guarantee
●   Solution: Selectively weaken the n-hop
    guarantee
Query Processing
  ●   Query execution is more efficient if
      pushed to optimized storage (RDF-
      stores)
      ●   Minimizing the number of Hadoop jobs
      ●   The larger the hop guarantee, the more
          work is done in RDF-stores
To Exchange, or not to Exchange?
  ●   Given a query and n-hop guarantee, is
      data exchange (Hadoop job) between
      nodes needed?
      ●   Choose the “center” of the query graph
      ●   Calculate the distance from the “center” to
          the furthest edge
      ●   If distance > n, data exchange is needed;
          not needed otherwise
Data Exchange

Find football players playing for clubs in a populous region where he was born.
Experimental Setup
●   20-machine cluster
●   Leigh University Benchmark (LUBM):
    270 million triples
●   Things to compare:
    ●   Single-node RDF-3X
    ●   SHARD
    ●   Graph partitioning (the proposed system)
    ●   Hash partitioning on subjects
Performance Comparison
Speedup
●   Better than linear speedup
Analysis
 Difference between fastest implementation and
 slowest implementation was a factor of 1340!
   Using Hadoop does not mean that performance is
   fixed
 More improvements are possible
   Experiments used MapReduce whenever data
   communication was necessary
     NextGen Hadoop allows other programming paradigms
     besides MR
      MPI is a good candidate

   Still need to fix the data pipelining problem
 Factor of 1340 possible via focusing on storage --
 - similar in theme to Hadapt
How this fits with Hadapt

                                        Full SQL interface, Map
                                         Reduce, and JDBC Connector
  Flexible Query Interface
    (Full SQL Support, MR, JDBC)
                                        10x-50x faster than Hadoop and
                                         Hive
                                          Queries go from hours to minutes,
                   Split Query             and minutes to seconds
   Hadoop          Execution
                    (Patent Pending)
                                        Analytics across structured and
                                         unstructured data in one
                                         platform
   Hadapt Storage Engine
        (Relational + HDFS)
                                        3.5 Patents Pending

                                        $9.5 Series A financing, lead by
                                         Norwest Venture Partners and
                                         Bessemer Venture Partners
Optimized Storage Matters
 HDFS appropriate for unstructured data
 Relational storage appropriate for relational data
 Graph storage appropriate for graph data
 Hadapt allows for pluggable storage inside
 Hadoop (amongst other things)




 Bottom line: Hadoop can be used for scalable
 graph processing, but it might need some
 Hadapting ;)

More Related Content

What's hot

Managing large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsManaging large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and concepts
Ajay Ohri
 
データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)
Takumi Asai
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
DataWorks Summit
 

What's hot (20)

Big data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop Technology
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Managing large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsManaging large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and concepts
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
RHadoop
RHadoopRHadoop
RHadoop
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
Apache Hadoop: DFS and Map Reduce
Apache Hadoop: DFS and Map ReduceApache Hadoop: DFS and Map Reduce
Apache Hadoop: DFS and Map Reduce
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
 
ffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsffbase, statistical functions for large datasets
ffbase, statistical functions for large datasets
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big data
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
The Exabyte Journey and DataBrew with CICD
The Exabyte Journey and DataBrew with CICDThe Exabyte Journey and DataBrew with CICD
The Exabyte Journey and DataBrew with CICD
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012
 
lec2_ref.pdf
lec2_ref.pdflec2_ref.pdf
lec2_ref.pdf
 

Similar to Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportunities - Daniel Abadi - Yale University & Hadapt

Similar to Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportunities - Daniel Abadi - Yale University & Hadapt (20)

Hadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesHadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and Opportunities
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 

Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportunities - Daniel Abadi - Yale University & Hadapt

  • 1. Hadoop and Graph Data Management: Challenges and Opportunities Daniel Abadi Assistant Professor at Yale University Chief Scientist at Hadapt Tuesday, November 8th (Collaboration with Jiewen Huang)
  • 2. Two Major Trends  Hadoop  Becoming de facto standard for large scale data processing  Becoming more than just MapReduce  Ecosystem growing rapidly --- lot’s of great tools around it  Graph data is proliferating; huge value if analyzed and processed  Social graphs  Linked data  Telecom  Advertising  Defense
  • 3. Use Hadoop to Process Graph Data?  Of course!  BUT:  Little automatic optimization (non-declarative interface)  It’s possible to do graphs with Hadoop:  Really badly  Suboptimally  Really well  Case in point: subgraph pattern-matching benchmark on three different Hadoop-centered solutions:  ~1000s,  ~100s,  < ~10s
  • 5. Example Linked Data  Entities (or “resources”) are nodes in the graph  Relationships between entities are labeled directed edges in the graph  (Resources typically have unique URI identifies to allow for global references --- not shown in example to the right)  Any resource can connect to any other resource
  • 6. Graph Representation  Linked data graph can be parsed into a series of vertex-entity- vertex triples  First entity referred to as the subject; second as the object; edge connecting them as the predicate  We will call these “RDF triples”
  • 7. Querying Linked Data  Linked Data is typically queried in SPARQL  Processing a SPARQL query is basically equivalent to the subgraph pattern matching problem
  • 8. Scalable SPARQL Processing  Single-node options are abundant, e.g.,  Sesame  Jena  RDF-3X  3store  Fewer options that can scale across multiple machines  This is where Hadoop comes in!  One cool solution: SHARD (presented by Kurt Rohloff at HadoopWorld 2010)  Uses HDFS to store graph, MapReduce jobs for subgraph pattern matching  Much nicer than a naïve Hadoop solution
  • 9. Another Example: Twitter Data @joe_hellerst ein follow follow s s follow s @daniel_ab @mikeolso adi n follow s retweete retweete d retweeted retweete retweete retweete d d d d @hadoop_is_my_life @super_hadooper @hadoop_is_the answer
  • 10. Example Query over Twitter Graph Who has retweeted both @daniel_abadi and @mikeolson? @daniel_aba @mikeolson di retweeted retweeted ???
  • 11. Slide From Kurt Rohloff Presentation at HadoopWorld 2010
  • 12. Slide From Kurt Rohloff Presentation at HadoopWorld 2010
  • 13. Issues With Hadoop Defaults  Hadoop does not default to pipelined algorithms between MapReduce jobs  SHARD algorithm matches each clause of SPARQL subgraph in a separate MapReduce job  Need full barrier synchronization between jobs which is unnecessary  Hadoop does not default to pipelined algorithms between Map and Reduce  Each job performs a join by vertex, where the object of one triple is joined with the subject of another triple  Joins work much faster if you can pipeline between Map and Reduce (e.g. pipelined hash join)
  • 14. Issues with Hadoop Defaults (cont.)  Hadoop hash partitions data across nodes  Data in graph is randomly partitioned across nodes  Data close in the graph could be on a physically distant node  Therefore each join requires a complete redistribution of the entire data set across the network (and there are many joins)  Hadoop defaults to replicating all data 3 times with no preferences for what is replicated or where it is replicated to  Hadoop defaults to using the Hadoop Distributed File System (HDFS) for storage, which is not optimized for graph data
  • 15. All is not lost! Don’t throw away Hadoop!  All we have to do is change the defaults and add to it a little
  • 17. Partitioning  Graphs can be represented as vertex1-edge- vertex2 triples  Hash partitioning by vertex1 is straightforward  Great for queries like: Query: Find the names of the strikers that play for FC Barcelona. SELECT ?name WHERE { ?player type footballer . ?player name ?name . ?player position striker . ?player playsFor FC_Barcelona . }
  • 18. The problem with hash partitioning … Find football players playing for clubs in a populous region where he was born.
  • 19. Graph Partitioning  Data close together in the graph should be physically close together (preferably on the same machine)  Subgraph pattern matching can be done without require huge amounts of communication via joins across the cluster
  • 21. Graph Partitioning Machine 1 Machine 2 Machine 3
  • 22. Edge/Triple Placement ● Minimizing data exchange ● Allowing data overlap ● N-hop guarantee ● The extent of data overlap ● If a vertex is assigned to a machine, any vertex that is within n-hop of this vertex is also stored in this machine
  • 23. 0 Hop of Machine 1 Machine 1 Machine 2 Machine 3
  • 24. 1 Hop of Machine 1 Machine 1 Machine 2 Machine 3
  • 25. 2 Hop of Machine 1 Machine 1 Machine 2 Machine 3
  • 26. 0 Hop of Machine 3 Machine 1 Machine 2 Machine 3
  • 27. 1 Hop of Machine 3 Machine 1 Machine 2 Machine 3
  • 28. 2 Hop of Machine 3 Machine 1 Machine 2 Machine 3
  • 29. High Degree Vertexes ● Problem: High-degree vertexes make the graph well-connected and difficult to partition ● Solution: Ignore them during graph partitioning ● Problem: High-degree vertexes cause data explosion with a n-hop guarantee ● Solution: Selectively weaken the n-hop guarantee
  • 30. Query Processing ● Query execution is more efficient if pushed to optimized storage (RDF- stores) ● Minimizing the number of Hadoop jobs ● The larger the hop guarantee, the more work is done in RDF-stores
  • 31. To Exchange, or not to Exchange? ● Given a query and n-hop guarantee, is data exchange (Hadoop job) between nodes needed? ● Choose the “center” of the query graph ● Calculate the distance from the “center” to the furthest edge ● If distance > n, data exchange is needed; not needed otherwise
  • 32. Data Exchange Find football players playing for clubs in a populous region where he was born.
  • 33. Experimental Setup ● 20-machine cluster ● Leigh University Benchmark (LUBM): 270 million triples ● Things to compare: ● Single-node RDF-3X ● SHARD ● Graph partitioning (the proposed system) ● Hash partitioning on subjects
  • 35. Speedup ● Better than linear speedup
  • 36. Analysis  Difference between fastest implementation and slowest implementation was a factor of 1340!  Using Hadoop does not mean that performance is fixed  More improvements are possible  Experiments used MapReduce whenever data communication was necessary  NextGen Hadoop allows other programming paradigms besides MR  MPI is a good candidate  Still need to fix the data pipelining problem  Factor of 1340 possible via focusing on storage -- - similar in theme to Hadapt
  • 37. How this fits with Hadapt  Full SQL interface, Map Reduce, and JDBC Connector Flexible Query Interface (Full SQL Support, MR, JDBC)  10x-50x faster than Hadoop and Hive  Queries go from hours to minutes, Split Query and minutes to seconds Hadoop Execution (Patent Pending)  Analytics across structured and unstructured data in one platform Hadapt Storage Engine (Relational + HDFS)  3.5 Patents Pending  $9.5 Series A financing, lead by Norwest Venture Partners and Bessemer Venture Partners
  • 38. Optimized Storage Matters  HDFS appropriate for unstructured data  Relational storage appropriate for relational data  Graph storage appropriate for graph data  Hadapt allows for pluggable storage inside Hadoop (amongst other things)  Bottom line: Hadoop can be used for scalable graph processing, but it might need some Hadapting ;)