SlideShare a Scribd company logo
1 of 31
Spark GraphX & Pregel
Challenges and Best Practices
Ashutosh Trivedi (IIIT Bangalore)
Kaushik Ranjan (IIIT Bangalore)
Sigmoid-Meetup Bangalore
https://github.com/anantasty/SparkAlgorithms
Agenda
• Introduction to GraphX
– How to describe a graph
– RDDs to store Graph
– Algorithms available
• Application in graph algorithms
– Feedback Vertex Set of a Graph
– Identifying parallel parts of the solution.
• Challenges we faced
• Best practices
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
2
Graph Representation
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
3
class Graph [ V, E ] {
def Graph(vertices: Table[ (Id, V) ],
edges: Table[ (Id, Id, E) ])
• The VertexRDD[A] extends RDD[(VertexID, A)] and adds the additional
constraint that each VertexID occurs only once.
• Moreover, VertexRDD[A] represents a set of vertices each with an
attribute of type A
• The EdgeRDD[ED], extends RDD[Edge[ED]]
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
4
GraphX - Representation
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
5
GraphX - Representation
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
6
A BA
Vertex and Edges
Vertex Edge
Triplets Join Vertices and Edges
• The triplets operator joins vertices and edges:
TripletsVertices
B
A
C
D
Edges
A B
A C
B C
C D
A BA
B A C
B C
C D
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
7
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
8
Triplets elements
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
9
Subgraphs
Predicates vpred and epred
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
10
Feedback Vertex Set
• A feedback vertex set of a graph is a set of vertices whose removal
leaves a graph without cycles.
• Each feedback vertex set contains at least one vertex of any cycle in the
graph.
• The feedback vertex set problem is an NP-complete problem
in computational complexity theory
• Enumerate each simple cycle.
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
11
1 2
34
5
6
7
8
9
10
Strongly Connected Components
Each strongly connected component can be considered in
parallel since they do not share any cycle
SC1 – (1) SC2 – (5) SC3 – (8) SC4 – (9)
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
12
FVS Algorithm
#Greedy recursive solution
FVS(G)
sccGraph = scc(G)
For each graph in sccGraph
For each vertex
remove vertex and again calculate scc,
vertexV = vertex which give max number of scc #which means it
kills maximum cycles
subGraph = subgraph(removeV )
FVS (subGraph )
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
13
1 2
4 3
2
4 3
Graph Iteration SCC count
3
1
4 3
1
1 2
4
3
1 2
4 3
1 2
4 3
Remove 2
Remove 1
Remove 3
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
14
1
5
8 9
1 5 8 9Feedback Vertex Set
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
15
FVS – Spark Implementation
sccGraph has one more property sccID on each vertices, extract it
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
16
FVS – Spark Implementation
sccGraph = scc(G)
For each graph in sccGraph
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
17
FVS – Spark Implementation
#Greedy recursive function
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
18
FVS – Spark Implementation
For each vertex
remove vertex and again calculate scc,
# Z is a list of scc count after removing each vertex
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
19
vertexV = vertex which give max number of scc #which means it
kills maximum cycles
FVS – Spark Implementation
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
20
subGraph = subgraph(removeV )
FVS (subGraph )
FVS – Spark Implementation
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
21
Pregel
• Graph DB
– Data Storage
– Data Mining
• Advantages
– Large-scale distributed computations
– Parallel-algorithms for graphs on multiple machines
– Fault tolerance and distributability
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
22
Oldest Follower
What is the age of oldest follower of each user ?
Val oldestFollowerAge = graph
.aggregateMessages(
#map word => (word.dst.id, word.src.age),
#reduce (a,b) => max(a, b)
)
.vertices
mapReduceTriplets is now aggregateMessages
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
23
In aggregateMessages :
• EdgeContext which exposes the triplet fields .
• functions to explicitly send messages to the source and
destination vertex.
• It require the user to indicate what fields in the triplet are
actually required.
New in GraphX
Theory – it’s Good
How it works – that’s awesome
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
24
Graph’s are recursive data-structures, where the
property of a vertex is dependent on the properties of
it’s neighbors, which in turn are dependent on the
properties of their neighbors.
Graph.Pregel ( initialMessage ) (
#message consumption
( vertexID, initialProperty, message ) → compute new property
,
#message generation
triplet → .. code ..
Iterator( vertexID, message )
Iterator.empty
,
#message aggregation
( existing message set, new message ) → NEW message set
)
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
25
Architecture
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
26
1 2
4 3
1030
30 20
1 2
4 3
10
30
30 20
max [30,10,20]
max [20] max [10]
1 2
4 3
100
10 10
1 2
4 3
10
0
10 10
max [10] max [10]
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
27
Example - output
1 2
4 3
100
0 0
Applications - GIS
• Algorithm – to compute all vertices in a directed graph, that can
reach out to a given vertex.
• Can be used for watershed delineation in Geographic Information
Systems
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
28
Vertices that can reach out to E are A and B
Algorithm
Graph.Pregel( Seq[vertexID’s] ) (
#message consumption
if vertex.state == 1
vertex.state → 2
else if vertex.state == 0
if ( vertex.adjacentVertices ∩ Seq[ vertexID’s ] ) isNotEmpty
vertex.state → 2
#message aggregator
Seq[existing vertex ID’s] U Seq[new vertex ID]
)
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
29
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
30
#message generation
for each triplet
if destinationVertex.state == 1
message( sourceVertexID, Seq[destinationVertexID] )
message( destinationVertexID, Seq[destinationVertexID] )
else if sourceVertex.state == 1 and destinationVertex.state == 2
message( sourceVertexID, Seq[destinationVertexID] )
else message( empty )
Algorithm
References
• Fork our repository at
• https://github.com/anantasty/SparkAlgorithms
• Follow us at
• https://github.com/codeAshu
• https://github.com/kaushikranjan
• https://spark.apache.org/docs/latest/graphx-programming-guide.html
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
31

More Related Content

What's hot

Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonGraph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Christopher Conlan
 
Digital image processing using matlab (fundamentals)
Digital image processing using matlab (fundamentals)Digital image processing using matlab (fundamentals)
Digital image processing using matlab (fundamentals)
Taimur Adil
 

What's hot (20)

Extending Gremlin with Foundational Steps
Extending Gremlin with Foundational StepsExtending Gremlin with Foundational Steps
Extending Gremlin with Foundational Steps
 
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonGraph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
 
Algorithms 101 for Data Scientists
Algorithms 101 for Data ScientistsAlgorithms 101 for Data Scientists
Algorithms 101 for Data Scientists
 
Data Product Architectures
Data Product ArchitecturesData Product Architectures
Data Product Architectures
 
Graphing Enterprise IT – Representing IT Infrastructure and Business Processe...
Graphing Enterprise IT – Representing IT Infrastructure and Business Processe...Graphing Enterprise IT – Representing IT Infrastructure and Business Processe...
Graphing Enterprise IT – Representing IT Infrastructure and Business Processe...
 
Gremlin Queries with DataStax Enterprise Graph
Gremlin Queries with DataStax Enterprise GraphGremlin Queries with DataStax Enterprise Graph
Gremlin Queries with DataStax Enterprise Graph
 
Gremlin's Graph Traversal Machinery
Gremlin's Graph Traversal MachineryGremlin's Graph Traversal Machinery
Gremlin's Graph Traversal Machinery
 
Fluent14
Fluent14Fluent14
Fluent14
 
Multiple Graphs: Updatable Views
Multiple Graphs: Updatable ViewsMultiple Graphs: Updatable Views
Multiple Graphs: Updatable Views
 
Leveraging Multiple GPUs and CPUs for Graphlet Counting in Large Networks
Leveraging Multiple GPUs and CPUs for  Graphlet Counting in Large Networks Leveraging Multiple GPUs and CPUs for  Graphlet Counting in Large Networks
Leveraging Multiple GPUs and CPUs for Graphlet Counting in Large Networks
 
R-programming-training-in-mumbai
R-programming-training-in-mumbaiR-programming-training-in-mumbai
R-programming-training-in-mumbai
 
WF ED 540, Class Meeting 2 - Identifying & converting data types, 2016
WF ED 540, Class Meeting 2 - Identifying & converting data types, 2016WF ED 540, Class Meeting 2 - Identifying & converting data types, 2016
WF ED 540, Class Meeting 2 - Identifying & converting data types, 2016
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processing
 
Priority queues
Priority queuesPriority queues
Priority queues
 
R and Visualization: A match made in Heaven
R and Visualization: A match made in HeavenR and Visualization: A match made in Heaven
R and Visualization: A match made in Heaven
 
8. R Graphics with R
8. R Graphics with R8. R Graphics with R
8. R Graphics with R
 
Machine Learning with Azure
Machine Learning with AzureMachine Learning with Azure
Machine Learning with Azure
 
Data visualization using R
Data visualization using RData visualization using R
Data visualization using R
 
Digital image processing using matlab (fundamentals)
Digital image processing using matlab (fundamentals)Digital image processing using matlab (fundamentals)
Digital image processing using matlab (fundamentals)
 
Power of Polyglot Search
Power of Polyglot SearchPower of Polyglot Search
Power of Polyglot Search
 

Similar to Graph x pregel

Stockage, manipulation et analyse de données matricielles avec PostGIS Raster
Stockage, manipulation et analyse de données matricielles avec PostGIS RasterStockage, manipulation et analyse de données matricielles avec PostGIS Raster
Stockage, manipulation et analyse de données matricielles avec PostGIS Raster
ACSG Section Montréal
 
Grill at bigdata-cloud conf
Grill at bigdata-cloud confGrill at bigdata-cloud conf
Grill at bigdata-cloud conf
amarsri
 

Similar to Graph x pregel (20)

Stockage, manipulation et analyse de données matricielles avec PostGIS Raster
Stockage, manipulation et analyse de données matricielles avec PostGIS RasterStockage, manipulation et analyse de données matricielles avec PostGIS Raster
Stockage, manipulation et analyse de données matricielles avec PostGIS Raster
 
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
 
Scala 20140715
Scala 20140715Scala 20140715
Scala 20140715
 
Learn basics of Clojure/script and Reagent
Learn basics of Clojure/script and ReagentLearn basics of Clojure/script and Reagent
Learn basics of Clojure/script and Reagent
 
Grill at bigdata-cloud conf
Grill at bigdata-cloud confGrill at bigdata-cloud conf
Grill at bigdata-cloud conf
 
Roadmap y Novedades de producto
Roadmap y Novedades de productoRoadmap y Novedades de producto
Roadmap y Novedades de producto
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUG
 
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY...
 
2014.06.24.what is ubix
2014.06.24.what is ubix2014.06.24.what is ubix
2014.06.24.what is ubix
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
GraphQL & DGraph with Go
GraphQL & DGraph with GoGraphQL & DGraph with Go
GraphQL & DGraph with Go
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetup
 
Shape Safety in Tensor Programming is Easy for a Theorem Prover -SBTB 2021
Shape Safety in Tensor Programming is Easy for a Theorem Prover -SBTB 2021Shape Safety in Tensor Programming is Easy for a Theorem Prover -SBTB 2021
Shape Safety in Tensor Programming is Easy for a Theorem Prover -SBTB 2021
 
RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Challenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache SparkChallenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache Spark
 
2017 RM-URISA Track: Spatial SQL - The Best Kept Secret in the Geospatial World
2017 RM-URISA Track:  Spatial SQL - The Best Kept Secret in the Geospatial World2017 RM-URISA Track:  Spatial SQL - The Best Kept Secret in the Geospatial World
2017 RM-URISA Track: Spatial SQL - The Best Kept Secret in the Geospatial World
 
Spark algorithms
Spark algorithmsSpark algorithms
Spark algorithms
 

More from Sigmoid

Real-Time Stock Market Analysis using Spark Streaming
 Real-Time Stock Market Analysis using Spark Streaming Real-Time Stock Market Analysis using Spark Streaming
Real-Time Stock Market Analysis using Spark Streaming
Sigmoid
 

More from Sigmoid (20)

Monitoring and tuning Spark applications
Monitoring and tuning Spark applicationsMonitoring and tuning Spark applications
Monitoring and tuning Spark applications
 
Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1
 
Real-Time Stock Market Analysis using Spark Streaming
 Real-Time Stock Market Analysis using Spark Streaming Real-Time Stock Market Analysis using Spark Streaming
Real-Time Stock Market Analysis using Spark Streaming
 
Levelling up in Akka
Levelling up in AkkaLevelling up in Akka
Levelling up in Akka
 
Expression Problem: Discussing the problems in OOPs language & their solutions
Expression Problem: Discussing the problems in OOPs language & their solutionsExpression Problem: Discussing the problems in OOPs language & their solutions
Expression Problem: Discussing the problems in OOPs language & their solutions
 
Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time Series
 
Joining Large data at Scale
Joining Large data at ScaleJoining Large data at Scale
Joining Large data at Scale
 
Building bots to automate common developer tasks - Writing your first smart c...
Building bots to automate common developer tasks - Writing your first smart c...Building bots to automate common developer tasks - Writing your first smart c...
Building bots to automate common developer tasks - Writing your first smart c...
 
Failsafe Hadoop Infrastructure and the way they work
Failsafe Hadoop Infrastructure and the way they workFailsafe Hadoop Infrastructure and the way they work
Failsafe Hadoop Infrastructure and the way they work
 
WEBSOCKETS AND WEBWORKERS
WEBSOCKETS AND WEBWORKERSWEBSOCKETS AND WEBWORKERS
WEBSOCKETS AND WEBWORKERS
 
Angular js performance improvements
Angular js performance improvementsAngular js performance improvements
Angular js performance improvements
 
Building high scalable distributed framework on apache mesos
Building high scalable distributed framework on apache mesosBuilding high scalable distributed framework on apache mesos
Building high scalable distributed framework on apache mesos
 
Equation solving-at-scale-using-apache-spark
Equation solving-at-scale-using-apache-sparkEquation solving-at-scale-using-apache-spark
Equation solving-at-scale-using-apache-spark
 
Introduction to apache nutch
Introduction to apache nutchIntroduction to apache nutch
Introduction to apache nutch
 
Approaches to text analysis
Approaches to text analysisApproaches to text analysis
Approaches to text analysis
 
Graph computation
Graph computationGraph computation
Graph computation
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
 
Using spark for timeseries graph analytics
Using spark for timeseries graph analyticsUsing spark for timeseries graph analytics
Using spark for timeseries graph analytics
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 

Graph x pregel

  • 1. Spark GraphX & Pregel Challenges and Best Practices Ashutosh Trivedi (IIIT Bangalore) Kaushik Ranjan (IIIT Bangalore) Sigmoid-Meetup Bangalore https://github.com/anantasty/SparkAlgorithms
  • 2. Agenda • Introduction to GraphX – How to describe a graph – RDDs to store Graph – Algorithms available • Application in graph algorithms – Feedback Vertex Set of a Graph – Identifying parallel parts of the solution. • Challenges we faced • Best practices Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 2
  • 3. Graph Representation Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 3 class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ]) • The VertexRDD[A] extends RDD[(VertexID, A)] and adds the additional constraint that each VertexID occurs only once. • Moreover, VertexRDD[A] represents a set of vertices each with an attribute of type A • The EdgeRDD[ED], extends RDD[Edge[ED]]
  • 4. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 4 GraphX - Representation
  • 5. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 5 GraphX - Representation
  • 6. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 6 A BA Vertex and Edges Vertex Edge
  • 7. Triplets Join Vertices and Edges • The triplets operator joins vertices and edges: TripletsVertices B A C D Edges A B A C B C C D A BA B A C B C C D Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 7
  • 8. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 8 Triplets elements
  • 9. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 9 Subgraphs Predicates vpred and epred
  • 10. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 10 Feedback Vertex Set • A feedback vertex set of a graph is a set of vertices whose removal leaves a graph without cycles. • Each feedback vertex set contains at least one vertex of any cycle in the graph. • The feedback vertex set problem is an NP-complete problem in computational complexity theory • Enumerate each simple cycle.
  • 11. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 11 1 2 34 5 6 7 8 9 10 Strongly Connected Components Each strongly connected component can be considered in parallel since they do not share any cycle SC1 – (1) SC2 – (5) SC3 – (8) SC4 – (9)
  • 12. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 12 FVS Algorithm #Greedy recursive solution FVS(G) sccGraph = scc(G) For each graph in sccGraph For each vertex remove vertex and again calculate scc, vertexV = vertex which give max number of scc #which means it kills maximum cycles subGraph = subgraph(removeV ) FVS (subGraph )
  • 13. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 13 1 2 4 3 2 4 3 Graph Iteration SCC count 3 1 4 3 1 1 2 4 3 1 2 4 3 1 2 4 3 Remove 2 Remove 1 Remove 3
  • 14. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 14 1 5 8 9 1 5 8 9Feedback Vertex Set
  • 15. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 15 FVS – Spark Implementation sccGraph has one more property sccID on each vertices, extract it
  • 16. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 16 FVS – Spark Implementation sccGraph = scc(G) For each graph in sccGraph
  • 17. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 17 FVS – Spark Implementation #Greedy recursive function
  • 18. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 18 FVS – Spark Implementation For each vertex remove vertex and again calculate scc, # Z is a list of scc count after removing each vertex
  • 19. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 19 vertexV = vertex which give max number of scc #which means it kills maximum cycles FVS – Spark Implementation
  • 20. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 20 subGraph = subgraph(removeV ) FVS (subGraph ) FVS – Spark Implementation
  • 21. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 21 Pregel • Graph DB – Data Storage – Data Mining • Advantages – Large-scale distributed computations – Parallel-algorithms for graphs on multiple machines – Fault tolerance and distributability
  • 22. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 22 Oldest Follower What is the age of oldest follower of each user ? Val oldestFollowerAge = graph .aggregateMessages( #map word => (word.dst.id, word.src.age), #reduce (a,b) => max(a, b) ) .vertices mapReduceTriplets is now aggregateMessages
  • 23. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 23 In aggregateMessages : • EdgeContext which exposes the triplet fields . • functions to explicitly send messages to the source and destination vertex. • It require the user to indicate what fields in the triplet are actually required. New in GraphX
  • 24. Theory – it’s Good How it works – that’s awesome Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 24 Graph’s are recursive data-structures, where the property of a vertex is dependent on the properties of it’s neighbors, which in turn are dependent on the properties of their neighbors.
  • 25. Graph.Pregel ( initialMessage ) ( #message consumption ( vertexID, initialProperty, message ) → compute new property , #message generation triplet → .. code .. Iterator( vertexID, message ) Iterator.empty , #message aggregation ( existing message set, new message ) → NEW message set ) Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 25 Architecture
  • 26. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 26 1 2 4 3 1030 30 20 1 2 4 3 10 30 30 20 max [30,10,20] max [20] max [10] 1 2 4 3 100 10 10 1 2 4 3 10 0 10 10 max [10] max [10]
  • 27. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 27 Example - output 1 2 4 3 100 0 0
  • 28. Applications - GIS • Algorithm – to compute all vertices in a directed graph, that can reach out to a given vertex. • Can be used for watershed delineation in Geographic Information Systems Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 28 Vertices that can reach out to E are A and B
  • 29. Algorithm Graph.Pregel( Seq[vertexID’s] ) ( #message consumption if vertex.state == 1 vertex.state → 2 else if vertex.state == 0 if ( vertex.adjacentVertices ∩ Seq[ vertexID’s ] ) isNotEmpty vertex.state → 2 #message aggregator Seq[existing vertex ID’s] U Seq[new vertex ID] ) Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 29
  • 30. Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 30 #message generation for each triplet if destinationVertex.state == 1 message( sourceVertexID, Seq[destinationVertexID] ) message( destinationVertexID, Seq[destinationVertexID] ) else if sourceVertex.state == 1 and destinationVertex.state == 2 message( sourceVertexID, Seq[destinationVertexID] ) else message( empty ) Algorithm
  • 31. References • Fork our repository at • https://github.com/anantasty/SparkAlgorithms • Follow us at • https://github.com/codeAshu • https://github.com/kaushikranjan • https://spark.apache.org/docs/latest/graphx-programming-guide.html Ashutosh & Kaushik, Sigmoid-Meetup Bangalore Dec-2014 31