SlideShare una empresa de Scribd logo
1 de 27
Descargar para leer sin conexión
Comparison and Evaluation of Open Source 
Implementations of Pregel and Related Systems 
December 2, 2013 
Joshua Woo, Prashant Raghav, Vishnu Prathish 
David R. Cheriton School of Computer Science 
University of Waterloo
Outline 
● Motivation 
● Our Project 
● Setup 
● Preliminary Results 
● Preliminary Analysis 
● In-Progress 
● References
Motivation 
Recall: Pregel 
● Large-scale graph processing system 
● Fault-tolerant framework for graph 
algorithms 
● MapReduce for graph operations? 
● Vertex-centric model (“think like a vertex”)
Motivation 
● Pregel is proprietary 
● Many open source graph processing 
systems 
○ Pregel clones 
○ Pregel-inspired 
○ BSP
Motivation 
● Apache Hama 
● Signal/Collect 
● Apache Giraph 
● GPS 
● GraphLab 
● Phoebus 
● GoldenOrb 
● HipG 
● Mizan
Motivation 
System Impl. Language Type 
Apache Hama Java Pure BSP framework 
Signal/Collect Scala Pregel inspired 
Apache Giraph Java Pregel clone 
GPS Java Advanced Pregel clone 
GraphLab C++ Pregel inspired 
Phoebus Erlang Pregel clone 
GoldenOrb Java Pregel clone 
HipG Java Advanced Pregel clone 
Mizan C++ Advanced Pregel clone
Motivation 
● How do these systems compare? 
○ In terms of performance (runtime)? 
○ In terms of memory footprint? 
○ In terms of network utilization (num. messages)? 
○ Variables: 
■ Algorithm 
■ Graph size (number of vertices) 
■ Cluster size
Our Project 
● Compare at least 3 systems 
○ Apache Hama - general BSP framework 
○ Apache Giraph - Hadoop Map-only job, Facebook 
○ GPS - +dynamic repartitioning, +multi vertex-centric 
○ Signal/Collect - +edges, +async computations 
○ GraphLab 
○ Mizan
Our Project 
● Measure the runtime of at least two 
algorithms on each system 
○ PageRank 
■ Fixed number of supersteps = 30 
○ Single Source Shortest Path (SSSP) 
○ k-means clustering
Setup 
● Experiments on AWS 
○ Ubuntu 12.04 m1.medium EC2 instances 
■ 2 ECUs, 1 vCPU, 3.7 GiB memory, moderate network 
performance 
■ 8 GiB EBS volume per instance 
○ Cluster sizes: 
■ Single-node cluster 
■ 4-node cluster 
■ 8-node cluster
Setup 
● Experiments on AWS 
○ 5 runs per dataset per algorithm per cluster 
■ 35 runs per algorithm per cluster 
■ 70 runs per cluster 
■ 140 runs in total (single-node, 4-node) 
● TODO: another 70 runs (8-node)
Setup 
● Dataset 
○ 7 datasets 
■ tinyEWD: 8 vertices 15 edges 
■ mediumEWD: 250 vertices 2,546 edges 
■ 1000EWD: 1,000 vertices 16,866 edges 
■ rome99: 3,353 vertices 8,870 edges 
■ 10000EWD: 10,000 vertices 16,866 edges 
■ NYC: 264,346 vertices 733,846 edges 
■ largeEWD: 1,000,000 vertices 15,172,126 edges 
○ Source: http://algs4.cs.princeton.edu/44sp/
Setup 
● Systems 
○ Hama 
■ Hadoop 1.03.0 
■ Hama 0.6.3 
○ Giraph 
■ Hadoop 0.20.203rc1 
■ Giraph (trunk@37bc2c80564b45d7e4ce95db76f5411a6b8bdb3a) 
○ GPS 
■ Hadoop 0.20.203rc1 
■ GPS (trunk@Revision 112)
Setup 
● Input Graph 
○ Source files converted into format suitable for each 
system 
■ Time for this conversion excluded from results: 
● Conversion done before algorithms are run (pre-processing?) 
● Negligible for largeEWD (1,000,000 vertices, 15,172,126 
edges)
Preliminary Results 
Average SSSP runtime on 4-node cluster (in seconds) 
Dataset Hama Giraph GPS 
tinyEWD 14.17 41.60 14.40 
mediumEWD 16.36 44.00 36.00 
1000EWD 18.06 48.80 46.60 
rome99 22.95 66.00 50.00 
10000EWD 25.32 67.40 55.00 
NYC 165.01 267.00 310.00 
largeEWD 6,109.20 602.80 618.70
Preliminary Results 
SSSP runtime vs. graph size (num. vertices)
Preliminary Results 
Average PageRank (30 supersteps) runtime on 4-node cluster (in seconds) 
Dataset Hama Giraph GPS 
tinyEWD 29.36 49.40 58.57 
mediumEWD 30.26 53.40 60.42 
1000EWD 37.86 54.60 61.03 
rome99 29.35 56.20 61.80 
10000EWD 302.33 61.80 64.80 
NYC 1,001.24 134.40 68.69 
largeEWD Failed 2,100.00 1,213.56
Preliminary Results 
PageRank runtime vs. graph size (num. vertices)
Preliminary Analysis 
● A point of resource crunch 
○ No significant change in performance until a point 
● Hama does not scale well (vertices ~10^4) 
● Giraph and GPS scale better 
● In general, PageRank runtime > SSSP runtime 
● GPS input reader does not guarantee true partitioning 
for large datasets 
● Which ‘knobs’ to keep constant? - Optimization vs. 
Comparability
In-Progress 
● Output validation 
● Memory footprint 
● Network utilization (num. messages) 
● GraphLab and Signal/Collect 
● Green-Marl? 
○ (DSL) → [Compiler] → (Giraph, GPS)
Questions?
Extras
Preliminary Results 
Number of supersteps for SSSP 
Dataset Hama Giraph GPS 
tinyEWD 10 7 7 
mediumEWD 16 13 18 
1000EWD 27 25 23 
rome99 105 102 18 
10000EWD 85 80 64 
NYC 671 905 438 
largeEWD 806 670 730
Preliminary Results 
Number of supersteps for SSSP
Really, really Preliminary 
PageRank runtime (in seconds) on GPS: native vs. Green-Marl generated 
Dataset Native Green-Marl generated 
tinyEWD 58.57 60.20 
mediumEWD 60.42 60.11 
1000EWD 61.03 62.30 
rome99 61.80 62.32 
10000EWD 64.80 65.78 
NYC 68.69 71.34 
largeEWD 1,213.56 -
Really, really Preliminary 
PageRank runtime (in seconds) on GPS: native vs. Green-Marl generated
References 
● Our Project Proposal 
● http://algs4.cs.princeton.edu/44sp/ 
● https://github.com/apache/hadoop-common 
● https://github.com/apache/giraph 
● https://subversion.assembla.com/svn/phd-projects/ 
gps/trunk/ 
● http://ppl.stanford.edu/main/green_marl.html

Más contenido relacionado

La actualidad más candente

Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...Databricks
 
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Ziemowit Jankowski
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Databricks
 
H2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas NykodymH2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas NykodymSri Ambati
 
NetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaNetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaJosef Niedermeier
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
 
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Databricks
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Alexey Zinoviev
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkAlpine Data
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Spark Summit
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Sigmoid
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistAlexey Zinoviev
 
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache GiraphAvery Ching
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleSpark Summit
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to YarnApache Apex
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopDataWorks Summit
 

La actualidad más candente (20)

Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
 
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
 
H2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas NykodymH2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas Nykodym
 
Giraph
GiraphGiraph
Giraph
 
NetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaNetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and Vertica
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
 
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on Hadoop
 

Similar a Comparing pregel related systems

Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU ClustersScalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU ClustersDatabricks
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Ontico
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Kohei KaiGai
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelMartin Zapletal
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache SparkNaukri.com
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudNicolas Poggi
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Dynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data PlatformsDynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data PlatformsINRIA-OAK
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad ranaData Con LA
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...Codemotion
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal
 
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...Red Hat Developers
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Modern Data Stack France
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...Codemotion Tel Aviv
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceData Works MD
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache SparkLucian Neghina
 

Similar a Comparing pregel related systems (20)

Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU ClustersScalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Dynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data PlatformsDynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data Platforms
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 

Último

Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...drmkjayanthikannan
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesMayuraD1
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxchumtiyababu
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersMairaAshraf6
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"mphochane1998
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadhamedmustafa094
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEselvakumar948
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxMuhammadAsimMuhammad6
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Servicemeghakumariji156
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxmaisarahman1
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network DevicesChandrakantDivate1
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesRAJNEESHKUMAR341697
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Call Girls Mumbai
 

Último (20)

Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 

Comparing pregel related systems

  • 1. Comparison and Evaluation of Open Source Implementations of Pregel and Related Systems December 2, 2013 Joshua Woo, Prashant Raghav, Vishnu Prathish David R. Cheriton School of Computer Science University of Waterloo
  • 2. Outline ● Motivation ● Our Project ● Setup ● Preliminary Results ● Preliminary Analysis ● In-Progress ● References
  • 3. Motivation Recall: Pregel ● Large-scale graph processing system ● Fault-tolerant framework for graph algorithms ● MapReduce for graph operations? ● Vertex-centric model (“think like a vertex”)
  • 4. Motivation ● Pregel is proprietary ● Many open source graph processing systems ○ Pregel clones ○ Pregel-inspired ○ BSP
  • 5. Motivation ● Apache Hama ● Signal/Collect ● Apache Giraph ● GPS ● GraphLab ● Phoebus ● GoldenOrb ● HipG ● Mizan
  • 6. Motivation System Impl. Language Type Apache Hama Java Pure BSP framework Signal/Collect Scala Pregel inspired Apache Giraph Java Pregel clone GPS Java Advanced Pregel clone GraphLab C++ Pregel inspired Phoebus Erlang Pregel clone GoldenOrb Java Pregel clone HipG Java Advanced Pregel clone Mizan C++ Advanced Pregel clone
  • 7. Motivation ● How do these systems compare? ○ In terms of performance (runtime)? ○ In terms of memory footprint? ○ In terms of network utilization (num. messages)? ○ Variables: ■ Algorithm ■ Graph size (number of vertices) ■ Cluster size
  • 8. Our Project ● Compare at least 3 systems ○ Apache Hama - general BSP framework ○ Apache Giraph - Hadoop Map-only job, Facebook ○ GPS - +dynamic repartitioning, +multi vertex-centric ○ Signal/Collect - +edges, +async computations ○ GraphLab ○ Mizan
  • 9. Our Project ● Measure the runtime of at least two algorithms on each system ○ PageRank ■ Fixed number of supersteps = 30 ○ Single Source Shortest Path (SSSP) ○ k-means clustering
  • 10. Setup ● Experiments on AWS ○ Ubuntu 12.04 m1.medium EC2 instances ■ 2 ECUs, 1 vCPU, 3.7 GiB memory, moderate network performance ■ 8 GiB EBS volume per instance ○ Cluster sizes: ■ Single-node cluster ■ 4-node cluster ■ 8-node cluster
  • 11. Setup ● Experiments on AWS ○ 5 runs per dataset per algorithm per cluster ■ 35 runs per algorithm per cluster ■ 70 runs per cluster ■ 140 runs in total (single-node, 4-node) ● TODO: another 70 runs (8-node)
  • 12. Setup ● Dataset ○ 7 datasets ■ tinyEWD: 8 vertices 15 edges ■ mediumEWD: 250 vertices 2,546 edges ■ 1000EWD: 1,000 vertices 16,866 edges ■ rome99: 3,353 vertices 8,870 edges ■ 10000EWD: 10,000 vertices 16,866 edges ■ NYC: 264,346 vertices 733,846 edges ■ largeEWD: 1,000,000 vertices 15,172,126 edges ○ Source: http://algs4.cs.princeton.edu/44sp/
  • 13. Setup ● Systems ○ Hama ■ Hadoop 1.03.0 ■ Hama 0.6.3 ○ Giraph ■ Hadoop 0.20.203rc1 ■ Giraph (trunk@37bc2c80564b45d7e4ce95db76f5411a6b8bdb3a) ○ GPS ■ Hadoop 0.20.203rc1 ■ GPS (trunk@Revision 112)
  • 14. Setup ● Input Graph ○ Source files converted into format suitable for each system ■ Time for this conversion excluded from results: ● Conversion done before algorithms are run (pre-processing?) ● Negligible for largeEWD (1,000,000 vertices, 15,172,126 edges)
  • 15. Preliminary Results Average SSSP runtime on 4-node cluster (in seconds) Dataset Hama Giraph GPS tinyEWD 14.17 41.60 14.40 mediumEWD 16.36 44.00 36.00 1000EWD 18.06 48.80 46.60 rome99 22.95 66.00 50.00 10000EWD 25.32 67.40 55.00 NYC 165.01 267.00 310.00 largeEWD 6,109.20 602.80 618.70
  • 16. Preliminary Results SSSP runtime vs. graph size (num. vertices)
  • 17. Preliminary Results Average PageRank (30 supersteps) runtime on 4-node cluster (in seconds) Dataset Hama Giraph GPS tinyEWD 29.36 49.40 58.57 mediumEWD 30.26 53.40 60.42 1000EWD 37.86 54.60 61.03 rome99 29.35 56.20 61.80 10000EWD 302.33 61.80 64.80 NYC 1,001.24 134.40 68.69 largeEWD Failed 2,100.00 1,213.56
  • 18. Preliminary Results PageRank runtime vs. graph size (num. vertices)
  • 19. Preliminary Analysis ● A point of resource crunch ○ No significant change in performance until a point ● Hama does not scale well (vertices ~10^4) ● Giraph and GPS scale better ● In general, PageRank runtime > SSSP runtime ● GPS input reader does not guarantee true partitioning for large datasets ● Which ‘knobs’ to keep constant? - Optimization vs. Comparability
  • 20. In-Progress ● Output validation ● Memory footprint ● Network utilization (num. messages) ● GraphLab and Signal/Collect ● Green-Marl? ○ (DSL) → [Compiler] → (Giraph, GPS)
  • 23. Preliminary Results Number of supersteps for SSSP Dataset Hama Giraph GPS tinyEWD 10 7 7 mediumEWD 16 13 18 1000EWD 27 25 23 rome99 105 102 18 10000EWD 85 80 64 NYC 671 905 438 largeEWD 806 670 730
  • 24. Preliminary Results Number of supersteps for SSSP
  • 25. Really, really Preliminary PageRank runtime (in seconds) on GPS: native vs. Green-Marl generated Dataset Native Green-Marl generated tinyEWD 58.57 60.20 mediumEWD 60.42 60.11 1000EWD 61.03 62.30 rome99 61.80 62.32 10000EWD 64.80 65.78 NYC 68.69 71.34 largeEWD 1,213.56 -
  • 26. Really, really Preliminary PageRank runtime (in seconds) on GPS: native vs. Green-Marl generated
  • 27. References ● Our Project Proposal ● http://algs4.cs.princeton.edu/44sp/ ● https://github.com/apache/hadoop-common ● https://github.com/apache/giraph ● https://subversion.assembla.com/svn/phd-projects/ gps/trunk/ ● http://ppl.stanford.edu/main/green_marl.html