SlideShare a Scribd company logo
1 of 32
Processing Over a Billion Edges
on Apache Giraph
Hadoop Summit 2012



Avery Ching
Software Engineer
6/14/2012
Agenda

1   Motivation and Background

2   Giraph Concepts/API

3   Example Applications

4   Architecture Overview

5   Recent/Future Improvements
What is Apache Giraph?

•  Loose implementation of Google’s Pregel that runs
   as a map-only job on Hadoop

•  “Think like a vertex” that can send messages to any
   other vertex in the graph using the bulk synchronous
   parallel programming model

•  An in-memory scalable system*
 ▪    Will be enhanced with out-of-core messages/vertices to handle
      larger problem sets.
What (social) graphs are we targeting?

•  3/2012 LinkedIn has 161 million users

•  6/2012 Twitter discloses 140 million MAU

•  4/2012 Facebook declares 901 million MAU
Example applications

•  Ranking
 ▪    Popularity, importance, etc.

•  Label Propagation
 ▪    Location, school, gender, etc.

•  Community
 ▪    Groups, interests
Bulk synchronous parallel

•  Supersteps
 ▪    A global epoch followed by a global barrier where components
      do concurrent computation and send messages

•  Point-to-point messages (i.e. vertex to vertex)
 ▪    Sent during a superstep from one component to another and
      then delivered in the following superstep

•  Computation complete when all components
   complete
Computation +             Superstep
         Communication
   Processors




Time                     Barrier
MapReduce -> Giraph
“Think like a vertex”, not a key-value pair!

        MapReduce                         Giraph
public class Mapper<
                              public class Vertex<
     KEYIN,
                                   I extends
     VALUEIN,
                                      WritableComparable,
     KEYOUT,
                                   V extends Writable,
     VALUEOUT> {
                                   E extends Writable,
  void map(KEYIN key,
                                   M extends Writable> {
     VALUEIN value,
                                void compute(
     Context context)
                                   Iterator<M> msgIterator);
     throws IOException,
                              }
     InterruptedException;
}
Basic Giraph API
  Methods available to compute()

 Immediate effect/access                    Next superstep
I getVertexId()                    void sendMsg(I id, M msg)
V getVertexValue()                 void sendMsgToAllEdges(M msg)
void setVertexValue(V vertexValue)
                                   void addVertexRequest(
Iterator<I> iterator()               BasicVertex<I, V, E, M> vertex)
E getEdgeValue(I targetVertexId)   void removeVertexRequest(I vertexId)
boolean hasEdge(I targetVertexId) void addEdgeRequest(
boolean addEdge(I targetVertexId,    I sourceVertexId,
                       E             Edge<I, E> edge)
edgeValue)                         void removeEdgeRequest(
E removeEdge(I targetVertexId)       I sourceVertexId,
                                     I destVertexId)
void voteToHalt()
boolean isHalted()
Why not implement Giraph with multiple
MapReduce jobs?
•  Too much disk, no in-memory caching, a superstep
   becomes a job!

    Input     Map    Intermediate Reduce    Output
   format    tasks        files    tasks    format

   Split 0
                                            Output 0
   Split 1

   Split 2

   Split 3                                  Output 1
Giraph is a single Map-only job in
Hadoop
•  Hadoop is purely a resource manager for Giraph, all
   communication is done through Netty-based IPC

        Vertex input     Map      Vertex output
          format        tasks        format

           Split 0
                                     Output 0
           Split 1

           Split 2

           Split 3                   Output 1
Maximum vertex value implementation
public class MaxValueVertex extends EdgeListVertex<
    IntWritable, IntWritable, IntWritable, IntWritable> {
  @Override
  public void compute(Iterator<IntWritable> msgIterator) {
    boolean changed = false;
    while (msgIterator.hasNext()) {
      IntWritable msgValue = msgIterator.next();
      if (msgValue.get() > getVertexValue().get()) {
        setVertexValue(msgValue);
        changed = true;
      }
    }
    if (getSuperstep() == 0 || changed) {
      sendMsgToAllEdges(getVertexValue());
    } else {
      voteToHalt();
    }
  }
}
Maximum vertex value


 Processor 1   5             5             5             5




 Processor 2   1
                             1
                                           5             5
                             5



                                           2
               2             2                           5
                                           5




 Time              Barrier       Barrier       Barrier
Page rank implementation
public class SimplePageRankVertex extends EdgeListVertex<LongWritable,
DoubleWritable, FloatWritable, DoubleWritable> {
  public void compute(Iterator<DoubleWritable> msgIterator) {
     if (getSuperstep() >= 1) {
         double sum = 0;
         while (msgIterator.hasNext()) {
            sum += msgIterator.next().get();
         }
         setVertexValue(new DoubleWritable((0.15f / getNumVertices()) + 0.85f *
sum);
     }
     if (getSuperstep() < 30) {
         long edges = getNumOutEdges();
         sentMsgToAllEdges(new DoubleWritable(getVertexValue().get() / edges));
     } else {
         voteToHalt();
     }
  }
}
Giraph In MapReduce
Giraph components
•  Master – Application coordinator
 ▪    One active master at a time
 ▪    Assigns partition owners to workers prior to each superstep
 ▪    Synchronizes supersteps

•  Worker – Computation & messaging
 ▪    Loads the graph from input splits
 ▪    Does the computation/messaging of its assigned partitions

•  ZooKeeper
 ▪    Maintains global application state
Graph distribution
•  Master graph partitioner
 ▪    Create initial partitions, generate partition owner changes
      between supersteps

•  Worker graph partitioner
 ▪    Determine which partition a vertex belongs to
 ▪    Create/modify the partition stats (can split/merge partitions)

•  Default is hash partitioning (hashCode())
 ▪    Range-based partitioning is also possible on a per-type basis
Graph distribution example


          Partition 0              Load/Store   Stats 0
                        Worker 0    Compute
          Partition 1              Messages     Stats 1

          Partition 2              Load/Store   Stats 2
 Master                 Worker 1    Compute
          Partition 3              Messages     Stats 3

          Partition 4              Load/Store   Stats 4
                        Worker 2    Compute
          Partition 5              Messages     Stats 5

          Partition 6              Load/Store   Stats 6
                        Worker 3    Compute
          Partition 7              Messages     Stats 7
Customizable fault tolerance
•  No single point of failure from Giraph threads
 ▪    With multiple master threads, if the current master dies, a new one will automatically
      take over.
 ▪    If a worker thread dies, the application is rolled back to a previously checkpointed
      superstep. The next superstep will begin with the new amount of workers
 ▪    If a zookeeper server dies, as long as a quorum remains, the application can proceed

•  Hadoop single points of failure still exist
 ▪    Namenode, jobtracker

 ▪    Restarting manually from a checkpoint is always possible




                                                 19
Master thread fault tolerance
 Before failure of active master 0            After failure of active master 0
  “Active”                                       “Active”
  Master 0                                       Master 0
                            Active                                       Active
                            Master                                       Master
  “Spare”                   State                “Active”                State
  Master 1                                       Master 1

  “Spare”                                        “Spare”
  Master 2                                       Master 2

•  One active master, with spare masters taking over in the event of an active master
   failure

•  All active master state is stored in ZooKeeper so that a spare master can
   immediately step in when an active master fails

•  “Active” master implemented as a queue in ZooKeeper

                                            20
Worker thread fault tolerance
  Superstep i         Superstep i+1       Superstep i+2
(no checkpoint)        (checkpoint)      (no checkpoint)

                                         Worker failure!


                      Superstep i+1       Superstep i+2       Superstep i+3
                       (checkpoint)      (no checkpoint)       (checkpoint)
                                                            Worker failure after
                                                           checkpoint complete!


                                                              Superstep i+3        Application
                                                             (no checkpoint)       Complete
•  A single worker death fails the superstep

•  Application reverts to the last committed superstep automatically
 ▪    Master detects worker failure during any superstep with a ZooKeeper “health”
      znode
 ▪    Master chooses the last committed superstep and sends a command through
      ZooKeeper for all workers to restart from that superstep
                                              21
Optional features
•  Combiners
 ▪    Similar to Map-Reduce combiners
 ▪    Users implement a combine() method that can reduce the
      amount of messages sent and received
 ▪    Run on both the client side (memory, network) and server side
      (memory)

•  Aggregators
 ▪    Similar to MPI aggregation routines (i.e. max, min, sum, etc.)
 ▪    Commutative and associate operations that are performed
      globally
 ▪    Examples include global communication, monitoring, and
      statistics
Recent Netty IPC implementation
                                                   300                   50
                                                   250




                                  Time (Seconds)
•  Big improvement over the                                              40
   Hadoop RPC implementation                       200
                                                                         30
                                                   150
•  10-39% overall performance                                            20
   improvement                                     100
                                                    50                   10
•  Still need more Netty tuning                      0                   0
                                                         10   30    50
                                                              Workers

                                                         Netty
                                                         Hadoop RPC
                                                         % improvement
Recent benchmarks
•  Test cluster of 80 machines
     ▪    Facebook Hadoop (https://github.com/facebook/hadoop-20)
     ▪    72 cores, 64+ GB of memory
▪    org.apache.giraph.benchmark.PageRankBenchmark
     ▪    5 supersteps
     ▪    No checkpointing
     ▪    10 edges per vertex
Worker scalability
                  3000
 Time (Seconds)

                  2500
                  2000
                  1500
                  1000
                   500
                     0
                         10   20   30  40    45   50
                                   Workers
Edge Scalability
                  5000
 Time (Seconds)

                  4000
                  3000
                  2000
                  1000
                     0
                         1    2     3      4    5
                             Edges (Billions)
Worker / edge scalability
                  2000                               8
 Time (Seconds)




                                                         Edges (Billions)
                  1500                               6
                  1000                               4
                   500                               2
                     0                               0
                         10         30        50
                                    Workers
                         Run Time    Workers/Edges
Apache Giraph has graduated as of
5/2012
•  Incubated for less than a year (entered incubator
   9/12)

•  Committers from HortonWorks, Twitter, LinkedIn,
   Facebook, TrendMicro and various schools (VU
   Amsterdam, TU Berlin, Korea University)

•  Released 0.1 as of 2/6/2012, will be release 0.2
   within a few months
Future improvements
•  Out-of-core messages/graph
 ▪    Under memory pressure, dump messages/portions of the graph
      to local disk
 ▪    Ability to run applications without having all needed memory

•  Performance improvements
 ▪    Netty is a good step in the right direction, but need to tune
      messaging performance as it takes up a majority of the time
 ▪    Scale back use of ZooKeeper to only be for health registration,
      rather than implementing aggregators and coordination
More future improvements
•  Adding a master#compute() method
 ▪    Arbitrary master computation that sends results to workers prior
      to a superstep to simplify certain algorithms
 ▪    GIRAPH-127

•  Handling skew
 ▪    Some vertices have a large number of edges and we need to
      break them up and handle them differently to provide better
      scalability
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
Sessions will resume at 4:30pm




                             Page 32

More Related Content

What's hot

Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
QGISプログラミング入門 2016Osaka編
QGISプログラミング入門 2016Osaka編QGISプログラミング入門 2016Osaka編
QGISプログラミング入門 2016Osaka編Kosuke Asahi
 
QGIS はじめてのラスタ解析
QGIS はじめてのラスタ解析QGIS はじめてのラスタ解析
QGIS はじめてのラスタ解析Mayumit
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?DataWorks Summit
 
Dv Pmysqluc Federation At Flickr Doing Billions Of Queries Per Day
Dv Pmysqluc Federation At Flickr Doing Billions Of Queries Per DayDv Pmysqluc Federation At Flickr Doing Billions Of Queries Per Day
Dv Pmysqluc Federation At Flickr Doing Billions Of Queries Per Daywebtel125
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Blosc Talk by Francesc Alted from PyData London 2014
Blosc Talk by Francesc Alted from PyData London 2014Blosc Talk by Francesc Alted from PyData London 2014
Blosc Talk by Francesc Alted from PyData London 2014PyData
 
GeoPackageを使ってみた(おざき様)
GeoPackageを使ってみた(おざき様)GeoPackageを使ってみた(おざき様)
GeoPackageを使ってみた(おざき様)OSgeo Japan
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsJonas Bonér
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVROairisData
 
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.ioTHE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.ioDevOpsDays Tel Aviv
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAPEDB
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in RustAndrew Lamb
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
QGISセミナー中級編(V2.4)
QGISセミナー中級編(V2.4)QGISセミナー中級編(V2.4)
QGISセミナー中級編(V2.4)IWASAKI NOBUSUKE
 
Improving Presto performance with Alluxio at TikTok
Improving Presto performance with Alluxio at TikTokImproving Presto performance with Alluxio at TikTok
Improving Presto performance with Alluxio at TikTokAlluxio, Inc.
 
Apache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonApache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonTimothy Spann
 

What's hot (20)

Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Apache Nifi Crash Course
Apache Nifi Crash CourseApache Nifi Crash Course
Apache Nifi Crash Course
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
QGISプログラミング入門 2016Osaka編
QGISプログラミング入門 2016Osaka編QGISプログラミング入門 2016Osaka編
QGISプログラミング入門 2016Osaka編
 
QGIS はじめてのラスタ解析
QGIS はじめてのラスタ解析QGIS はじめてのラスタ解析
QGIS はじめてのラスタ解析
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
Dv Pmysqluc Federation At Flickr Doing Billions Of Queries Per Day
Dv Pmysqluc Federation At Flickr Doing Billions Of Queries Per DayDv Pmysqluc Federation At Flickr Doing Billions Of Queries Per Day
Dv Pmysqluc Federation At Flickr Doing Billions Of Queries Per Day
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Blosc Talk by Francesc Alted from PyData London 2014
Blosc Talk by Francesc Alted from PyData London 2014Blosc Talk by Francesc Alted from PyData London 2014
Blosc Talk by Francesc Alted from PyData London 2014
 
GeoPackageを使ってみた(おざき様)
GeoPackageを使ってみた(おざき様)GeoPackageを使ってみた(おざき様)
GeoPackageを使ってみた(おざき様)
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.ioTHE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAP
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
QGISセミナー中級編(V2.4)
QGISセミナー中級編(V2.4)QGISセミナー中級編(V2.4)
QGISセミナー中級編(V2.4)
 
Improving Presto performance with Alluxio at TikTok
Improving Presto performance with Alluxio at TikTokImproving Presto performance with Alluxio at TikTok
Improving Presto performance with Alluxio at TikTok
 
Apache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonApache Pulsar Development 101 with Python
Apache Pulsar Development 101 with Python
 

Viewers also liked

2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - HortonworksAvery Ching
 
Tokyo nlp #8 label propagation
Tokyo nlp #8 label propagationTokyo nlp #8 label propagation
Tokyo nlp #8 label propagationYo Ehara
 
Core Messages in Job Hunting
Core Messages in Job HuntingCore Messages in Job Hunting
Core Messages in Job HuntingChrisSteed
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangaloreappaji intelhunt
 
GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014
GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014
GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014鉄平 土佐
 
Deploying WebRTC successfully – A web developer perspective
Deploying WebRTC successfully – A web developer perspectiveDeploying WebRTC successfully – A web developer perspective
Deploying WebRTC successfully – A web developer perspectiveDialogic Inc.
 
Fast, Scalable Graph Processing: Apache Giraph on YARN
Fast, Scalable Graph Processing: Apache Giraph on YARNFast, Scalable Graph Processing: Apache Giraph on YARN
Fast, Scalable Graph Processing: Apache Giraph on YARNDataWorks Summit
 
Introduction of apache giraph project
Introduction of apache giraph projectIntroduction of apache giraph project
Introduction of apache giraph projectChun Cheng Lin
 
大規模グラフデータ処理
大規模グラフデータ処理大規模グラフデータ処理
大規模グラフデータ処理maruyama097
 
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...rhatr
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraphsscdotopen
 
Netty Notes Part 2 - Transports and Buffers
Netty Notes Part 2 - Transports and BuffersNetty Notes Part 2 - Transports and Buffers
Netty Notes Part 2 - Transports and BuffersRick Hightower
 
Initiation à Neo4j
Initiation à Neo4jInitiation à Neo4j
Initiation à Neo4jNeo4j
 
Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例
Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例
Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例Junichi Noda
 
Spark GraphX で始めるグラフ解析
Spark GraphX で始めるグラフ解析Spark GraphX で始めるグラフ解析
Spark GraphX で始めるグラフ解析Yosuke Mizutani
 
Building day 2 upload Building the Internet of Things with Thingsquare and ...
Building day 2   upload Building the Internet of Things with Thingsquare and ...Building day 2   upload Building the Internet of Things with Thingsquare and ...
Building day 2 upload Building the Internet of Things with Thingsquare and ...Adam Dunkels
 
SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?Venu Anuganti
 
GraphX によるグラフ分析処理の実例と入門
GraphX によるグラフ分析処理の実例と入門GraphX によるグラフ分析処理の実例と入門
GraphX によるグラフ分析処理の実例と入門鉄平 土佐
 

Viewers also liked (20)

2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks
 
Tokyo nlp #8 label propagation
Tokyo nlp #8 label propagationTokyo nlp #8 label propagation
Tokyo nlp #8 label propagation
 
Core Messages in Job Hunting
Core Messages in Job HuntingCore Messages in Job Hunting
Core Messages in Job Hunting
 
GETTING YOUR MESSAGE RIGHT
GETTING YOUR MESSAGE RIGHTGETTING YOUR MESSAGE RIGHT
GETTING YOUR MESSAGE RIGHT
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 
GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014
GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014
GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014
 
Deploying WebRTC successfully – A web developer perspective
Deploying WebRTC successfully – A web developer perspectiveDeploying WebRTC successfully – A web developer perspective
Deploying WebRTC successfully – A web developer perspective
 
Fast, Scalable Graph Processing: Apache Giraph on YARN
Fast, Scalable Graph Processing: Apache Giraph on YARNFast, Scalable Graph Processing: Apache Giraph on YARN
Fast, Scalable Graph Processing: Apache Giraph on YARN
 
Introduction of apache giraph project
Introduction of apache giraph projectIntroduction of apache giraph project
Introduction of apache giraph project
 
大規模グラフデータ処理
大規模グラフデータ処理大規模グラフデータ処理
大規模グラフデータ処理
 
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
Netty Notes Part 2 - Transports and Buffers
Netty Notes Part 2 - Transports and BuffersNetty Notes Part 2 - Transports and Buffers
Netty Notes Part 2 - Transports and Buffers
 
Initiation à Neo4j
Initiation à Neo4jInitiation à Neo4j
Initiation à Neo4j
 
Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例
Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例
Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例
 
Spark GraphX で始めるグラフ解析
Spark GraphX で始めるグラフ解析Spark GraphX で始めるグラフ解析
Spark GraphX で始めるグラフ解析
 
Building day 2 upload Building the Internet of Things with Thingsquare and ...
Building day 2   upload Building the Internet of Things with Thingsquare and ...Building day 2   upload Building the Internet of Things with Thingsquare and ...
Building day 2 upload Building the Internet of Things with Thingsquare and ...
 
Best Practices to Build a Multichannel Campaign Plan
Best Practices to Build a Multichannel Campaign Plan Best Practices to Build a Multichannel Campaign Plan
Best Practices to Build a Multichannel Campaign Plan
 
SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?
 
GraphX によるグラフ分析処理の実例と入門
GraphX によるグラフ分析処理の実例と入門GraphX によるグラフ分析処理の実例と入門
GraphX によるグラフ分析処理の実例と入門
 

Similar to Processing edges on apache giraph

2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users GroupNitay Joffe
 
2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwords2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwordsNitay Joffe
 
Mining quasi bicliques using giraph
Mining quasi bicliques using giraphMining quasi bicliques using giraph
Mining quasi bicliques using giraphHsiao-Fei Liu
 
Java Review
Java ReviewJava Review
Java Reviewpdgeorge
 
Graph processing
Graph processingGraph processing
Graph processingyeahjs
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph ProcessingVasia Kalavri
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDataWorks Summit
 
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopMax Tepkeev
 
OpenMI Developers Training
OpenMI Developers TrainingOpenMI Developers Training
OpenMI Developers TrainingJan Gregersen
 
OpenMI Developers Training
OpenMI Developers TrainingOpenMI Developers Training
OpenMI Developers TrainingJan Gregersen
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Vasia Kalavri
 
Introduction to hadoop
Introduction to hadoopIntroduction to hadoop
Introduction to hadoopRon Sher
 
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...The Linux Foundation
 
Reactive Thinking in Java
Reactive Thinking in JavaReactive Thinking in Java
Reactive Thinking in JavaYakov Fain
 
Groovy concurrency
Groovy concurrencyGroovy concurrency
Groovy concurrencyAlex Miller
 

Similar to Processing edges on apache giraph (20)

2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
 
2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwords2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwords
 
Mining quasi bicliques using giraph
Mining quasi bicliques using giraphMining quasi bicliques using giraph
Mining quasi bicliques using giraph
 
Java Review
Java ReviewJava Review
Java Review
 
Javantura v3 - Going Reactive with RxJava – Hrvoje Crnjak
Javantura v3 - Going Reactive with RxJava – Hrvoje CrnjakJavantura v3 - Going Reactive with RxJava – Hrvoje Crnjak
Javantura v3 - Going Reactive with RxJava – Hrvoje Crnjak
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
Graph processing
Graph processingGraph processing
Graph processing
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph Processing
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache Giraph
 
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and Hadoop
 
OpenMI Developers Training
OpenMI Developers TrainingOpenMI Developers Training
OpenMI Developers Training
 
OpenMI Developers Training
OpenMI Developers TrainingOpenMI Developers Training
OpenMI Developers Training
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
 
Introduction to hadoop
Introduction to hadoopIntroduction to hadoop
Introduction to hadoop
 
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
 
Jvm memory model
Jvm memory modelJvm memory model
Jvm memory model
 
Hadoop Jungle
Hadoop JungleHadoop Jungle
Hadoop Jungle
 
Reactive Thinking in Java
Reactive Thinking in JavaReactive Thinking in Java
Reactive Thinking in Java
 
Groovy concurrency
Groovy concurrencyGroovy concurrency
Groovy concurrency
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 

Recently uploaded (20)

UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 

Processing edges on apache giraph

  • 1. Processing Over a Billion Edges on Apache Giraph Hadoop Summit 2012 Avery Ching Software Engineer 6/14/2012
  • 2. Agenda 1 Motivation and Background 2 Giraph Concepts/API 3 Example Applications 4 Architecture Overview 5 Recent/Future Improvements
  • 3. What is Apache Giraph? •  Loose implementation of Google’s Pregel that runs as a map-only job on Hadoop •  “Think like a vertex” that can send messages to any other vertex in the graph using the bulk synchronous parallel programming model •  An in-memory scalable system* ▪  Will be enhanced with out-of-core messages/vertices to handle larger problem sets.
  • 4. What (social) graphs are we targeting? •  3/2012 LinkedIn has 161 million users •  6/2012 Twitter discloses 140 million MAU •  4/2012 Facebook declares 901 million MAU
  • 5. Example applications •  Ranking ▪  Popularity, importance, etc. •  Label Propagation ▪  Location, school, gender, etc. •  Community ▪  Groups, interests
  • 6. Bulk synchronous parallel •  Supersteps ▪  A global epoch followed by a global barrier where components do concurrent computation and send messages •  Point-to-point messages (i.e. vertex to vertex) ▪  Sent during a superstep from one component to another and then delivered in the following superstep •  Computation complete when all components complete
  • 7. Computation + Superstep Communication Processors Time Barrier
  • 8. MapReduce -> Giraph “Think like a vertex”, not a key-value pair! MapReduce Giraph public class Mapper< public class Vertex< KEYIN, I extends VALUEIN, WritableComparable, KEYOUT, V extends Writable, VALUEOUT> { E extends Writable, void map(KEYIN key, M extends Writable> { VALUEIN value, void compute( Context context) Iterator<M> msgIterator); throws IOException, } InterruptedException; }
  • 9. Basic Giraph API Methods available to compute() Immediate effect/access Next superstep I getVertexId() void sendMsg(I id, M msg) V getVertexValue() void sendMsgToAllEdges(M msg) void setVertexValue(V vertexValue) void addVertexRequest( Iterator<I> iterator() BasicVertex<I, V, E, M> vertex) E getEdgeValue(I targetVertexId) void removeVertexRequest(I vertexId) boolean hasEdge(I targetVertexId) void addEdgeRequest( boolean addEdge(I targetVertexId, I sourceVertexId, E Edge<I, E> edge) edgeValue) void removeEdgeRequest( E removeEdge(I targetVertexId) I sourceVertexId, I destVertexId) void voteToHalt() boolean isHalted()
  • 10. Why not implement Giraph with multiple MapReduce jobs? •  Too much disk, no in-memory caching, a superstep becomes a job! Input Map Intermediate Reduce Output format tasks files tasks format Split 0 Output 0 Split 1 Split 2 Split 3 Output 1
  • 11. Giraph is a single Map-only job in Hadoop •  Hadoop is purely a resource manager for Giraph, all communication is done through Netty-based IPC Vertex input Map Vertex output format tasks format Split 0 Output 0 Split 1 Split 2 Split 3 Output 1
  • 12. Maximum vertex value implementation public class MaxValueVertex extends EdgeListVertex< IntWritable, IntWritable, IntWritable, IntWritable> { @Override public void compute(Iterator<IntWritable> msgIterator) { boolean changed = false; while (msgIterator.hasNext()) { IntWritable msgValue = msgIterator.next(); if (msgValue.get() > getVertexValue().get()) { setVertexValue(msgValue); changed = true; } } if (getSuperstep() == 0 || changed) { sendMsgToAllEdges(getVertexValue()); } else { voteToHalt(); } } }
  • 13. Maximum vertex value Processor 1 5 5 5 5 Processor 2 1 1 5 5 5 2 2 2 5 5 Time Barrier Barrier Barrier
  • 14. Page rank implementation public class SimplePageRankVertex extends EdgeListVertex<LongWritable, DoubleWritable, FloatWritable, DoubleWritable> { public void compute(Iterator<DoubleWritable> msgIterator) { if (getSuperstep() >= 1) { double sum = 0; while (msgIterator.hasNext()) { sum += msgIterator.next().get(); } setVertexValue(new DoubleWritable((0.15f / getNumVertices()) + 0.85f * sum); } if (getSuperstep() < 30) { long edges = getNumOutEdges(); sentMsgToAllEdges(new DoubleWritable(getVertexValue().get() / edges)); } else { voteToHalt(); } } }
  • 16. Giraph components •  Master – Application coordinator ▪  One active master at a time ▪  Assigns partition owners to workers prior to each superstep ▪  Synchronizes supersteps •  Worker – Computation & messaging ▪  Loads the graph from input splits ▪  Does the computation/messaging of its assigned partitions •  ZooKeeper ▪  Maintains global application state
  • 17. Graph distribution •  Master graph partitioner ▪  Create initial partitions, generate partition owner changes between supersteps •  Worker graph partitioner ▪  Determine which partition a vertex belongs to ▪  Create/modify the partition stats (can split/merge partitions) •  Default is hash partitioning (hashCode()) ▪  Range-based partitioning is also possible on a per-type basis
  • 18. Graph distribution example Partition 0 Load/Store Stats 0 Worker 0 Compute Partition 1 Messages Stats 1 Partition 2 Load/Store Stats 2 Master Worker 1 Compute Partition 3 Messages Stats 3 Partition 4 Load/Store Stats 4 Worker 2 Compute Partition 5 Messages Stats 5 Partition 6 Load/Store Stats 6 Worker 3 Compute Partition 7 Messages Stats 7
  • 19. Customizable fault tolerance •  No single point of failure from Giraph threads ▪  With multiple master threads, if the current master dies, a new one will automatically take over. ▪  If a worker thread dies, the application is rolled back to a previously checkpointed superstep. The next superstep will begin with the new amount of workers ▪  If a zookeeper server dies, as long as a quorum remains, the application can proceed •  Hadoop single points of failure still exist ▪  Namenode, jobtracker ▪  Restarting manually from a checkpoint is always possible 19
  • 20. Master thread fault tolerance Before failure of active master 0 After failure of active master 0 “Active” “Active” Master 0 Master 0 Active Active Master Master “Spare” State “Active” State Master 1 Master 1 “Spare” “Spare” Master 2 Master 2 •  One active master, with spare masters taking over in the event of an active master failure •  All active master state is stored in ZooKeeper so that a spare master can immediately step in when an active master fails •  “Active” master implemented as a queue in ZooKeeper 20
  • 21. Worker thread fault tolerance Superstep i Superstep i+1 Superstep i+2 (no checkpoint) (checkpoint) (no checkpoint) Worker failure! Superstep i+1 Superstep i+2 Superstep i+3 (checkpoint) (no checkpoint) (checkpoint) Worker failure after checkpoint complete! Superstep i+3 Application (no checkpoint) Complete •  A single worker death fails the superstep •  Application reverts to the last committed superstep automatically ▪  Master detects worker failure during any superstep with a ZooKeeper “health” znode ▪  Master chooses the last committed superstep and sends a command through ZooKeeper for all workers to restart from that superstep 21
  • 22. Optional features •  Combiners ▪  Similar to Map-Reduce combiners ▪  Users implement a combine() method that can reduce the amount of messages sent and received ▪  Run on both the client side (memory, network) and server side (memory) •  Aggregators ▪  Similar to MPI aggregation routines (i.e. max, min, sum, etc.) ▪  Commutative and associate operations that are performed globally ▪  Examples include global communication, monitoring, and statistics
  • 23. Recent Netty IPC implementation 300 50 250 Time (Seconds) •  Big improvement over the 40 Hadoop RPC implementation 200 30 150 •  10-39% overall performance 20 improvement 100 50 10 •  Still need more Netty tuning 0 0 10 30 50 Workers Netty Hadoop RPC % improvement
  • 24. Recent benchmarks •  Test cluster of 80 machines ▪  Facebook Hadoop (https://github.com/facebook/hadoop-20) ▪  72 cores, 64+ GB of memory ▪  org.apache.giraph.benchmark.PageRankBenchmark ▪  5 supersteps ▪  No checkpointing ▪  10 edges per vertex
  • 25. Worker scalability 3000 Time (Seconds) 2500 2000 1500 1000 500 0 10 20 30 40 45 50 Workers
  • 26. Edge Scalability 5000 Time (Seconds) 4000 3000 2000 1000 0 1 2 3 4 5 Edges (Billions)
  • 27. Worker / edge scalability 2000 8 Time (Seconds) Edges (Billions) 1500 6 1000 4 500 2 0 0 10 30 50 Workers Run Time Workers/Edges
  • 28. Apache Giraph has graduated as of 5/2012 •  Incubated for less than a year (entered incubator 9/12) •  Committers from HortonWorks, Twitter, LinkedIn, Facebook, TrendMicro and various schools (VU Amsterdam, TU Berlin, Korea University) •  Released 0.1 as of 2/6/2012, will be release 0.2 within a few months
  • 29. Future improvements •  Out-of-core messages/graph ▪  Under memory pressure, dump messages/portions of the graph to local disk ▪  Ability to run applications without having all needed memory •  Performance improvements ▪  Netty is a good step in the right direction, but need to tune messaging performance as it takes up a majority of the time ▪  Scale back use of ZooKeeper to only be for health registration, rather than implementing aggregators and coordination
  • 30. More future improvements •  Adding a master#compute() method ▪  Arbitrary master computation that sends results to workers prior to a superstep to simplify certain algorithms ▪  GIRAPH-127 •  Handling skew ▪  Some vertices have a large number of edges and we need to break them up and handle them differently to provide better scalability
  • 31. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
  • 32. Sessions will resume at 4:30pm Page 32