SlideShare a Scribd company logo
1 of 69
Download to read offline
Distributed Graph Analytics with Gradoop
inovex Meetup Munich
Let‘s talk about Graph Databases
July 2016
Martin Junghanns (@kc1s)
University of Leipzig – Database Research Group
Motivation
Extended Property Graph Model
Operators
Benchmark
Implementation
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 3
Motivation EPGM Operators BenchmarkImplementation
3
Motivation
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 4
Motivation EPGM Operators BenchmarkImplementation
4
Motivation
𝑮𝑟𝑎𝑝ℎ = (𝑽𝑒𝑟𝑡𝑖𝑐𝑒𝑠, 𝑬𝑑𝑔𝑒𝑠)
„Graphs are everywhere“
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 5
Motivation EPGM Operators BenchmarkImplementation
5
Motivation
Alice
Bob
Eve
Dave
Carol
Mallory
Peggy
Trent
𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬, 𝐹𝑜𝑙𝑙𝑜𝑤𝑒𝑟𝑠)
„Graphs are everywhere“
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 6
Motivation EPGM Operators BenchmarkImplementation
6
Motivation
𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠)
Alice
Bob
Eve
Dave
Carol
Mallory
Peggy
Trent
„Graphs are everywhere“
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 7
Motivation EPGM Operators BenchmarkImplementation
7
Motivation
Alice
Bob
AC/DC
Dave
Carol
Mallory
Peggy
Metallica
𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬 ∪ 𝐁𝐚𝐧𝐝𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠 ∪ 𝐿𝑖𝑘𝑒𝑠)
„Graphs are heterogeneous“
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 8
Motivation EPGM Operators BenchmarkImplementation
8
Motivation
Alice
Bob
AC/DC
Dave
Carol
Mallory
Peggy
Metallica
𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬 ∪ 𝐁𝐚𝐧𝐝𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠 ∪ 𝐿𝑖𝑘𝑒𝑠)
„Graphs can be analyzed“
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 9
Motivation EPGM Operators BenchmarkImplementation
9
Motivation
0.2
0.28
0.26
0.33
0.25
0.26
Alice
Bob
AC/DC
Dave
Carol
Mallory
Peggy
Metallica
3.6
2.82
𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬 ∪ 𝐁𝐚𝐧𝐝𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠 ∪ 𝐿𝑖𝑘𝑒𝑠)
„Graphs can be analyzed“
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 10
Motivation EPGM Operators BenchmarkImplementation
10
Motivation
Assuming a social network
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 11
Motivation EPGM Operators BenchmarkImplementation
11
Motivation
Assuming a social network
1. Determine subgraph
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 12
Motivation EPGM Operators BenchmarkImplementation
12
Motivation
Assuming a social network
1. Determine subgraph
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 13
Motivation EPGM Operators BenchmarkImplementation
13
Motivation
Assuming a social network
1. Determine subgraph
2. Find communities
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 14
Motivation EPGM Operators BenchmarkImplementation
14
Motivation
Assuming a social network
1. Determine subgraph
2. Find communities
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 15
Motivation EPGM Operators BenchmarkImplementation
15
Motivation
Assuming a social network
1. Determine subgraph
2. Find communities
3. Filter communities
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 16
Motivation EPGM Operators BenchmarkImplementation
16
Motivation
Assuming a social network
1. Determine subgraph
2. Find communities
3. Filter communities
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 17
Motivation EPGM Operators BenchmarkImplementation
17
Motivation
Assuming a social network
1. Determine subgraph
2. Find communities
3. Filter communities
4. Find common subgraph
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 18
Motivation EPGM Operators BenchmarkImplementation
18
Motivation
Assuming a social network
1. Determine subgraph
2. Find communities
3. Filter communities
4. Find common subgraph
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 19
Motivation EPGM Operators BenchmarkImplementation
19
Motivation
Assuming a social network
• Heterogeneous data
1. Determine subgraph
• Apply graph transformation
2. Find communities
• Handle collections of graphs
3. Filter communities
• Aggregation, Selection
4. Find common subgraph
• Apply dedicated algorithm
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 20
Motivation EPGM Operators BenchmarkImplementation
20
Motivation
Assuming a social network
• Heterogeneous data
1. Determine subgraph
• Apply graph transformation
2. Find communities
• Handle collections of graphs
3. Filter communities
• Aggregation, Selection
4. Find common subgraph
• Apply dedicated algorithm
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 21
Motivation EPGM Operators BenchmarkImplementation
21
Motivation
Assuming a social network
• Heterogeneous data
1. Determine subgraph
• Apply graph transformation
2. Find communities
• Handle collections of graphs
3. Filter communities
• Aggregation, Selection
4. Find common subgraph
• Apply dedicated algorithm
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 22
Motivation EPGM Operators BenchmarkImplementation
22
Motivation
Assuming a social network
• Heterogeneous data
1. Determine subgraph
• Apply graph transformation
2. Find communities
• Handle collections of graphs
3. Filter communities
• Aggregation, Selection
4. Find common subgraph
• Apply dedicated algorithm
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 23
Motivation EPGM Operators BenchmarkImplementation
23
Motivation
Assuming a social network
• Heterogeneous data
1. Determine subgraph
• Apply graph transformation
2. Find communities
• Handle collections of graphs
3. Filter communities
• Aggregation, Selection
4. Find common subgraph
• Apply dedicated algorithm
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 24
Motivation EPGM Operators BenchmarkImplementation
24
Motivation
„And let‘s not forget …“
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 25
Motivation EPGM Operators BenchmarkImplementation
25
Motivation
“...Graphs are large.”
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 26
Motivation EPGM Operators BenchmarkImplementation
26
Motivation
„An open-source framework and research platform for
efficient, distributed and domain independent
management and analytics of heterogeneous graph data.“
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 27
Motivation EPGM Operators BenchmarkImplementation
27
Motivation
Data Volume and Problem Complexity
Ease-of-use
Graph Processing Systems
Graph Databases
Graph Dataflow Systems Gelly
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 28
Motivation EPGM Operators BenchmarkImplementation
28
Motivation
Distributed Graph Store (Apache HBase)
Apache Flink Operator Implementation
Apache Flink Distributed Operator Execution
Extended Property Graph Model (EPGM)
Graph Analytical Language (GrALa)
I/O
Distributed File System (Apache HDFS)
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 29
Motivation EPGM Operators BenchmarkImplementation
29
Extended Property Graph Model
(EPGM)
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 30
Motivation EPGM Operators BenchmarkImplementationEPGM
• Vertices and directed Edges
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 31
Motivation EPGM Operators BenchmarkImplementationEPGM
• Vertices and directed Edges
• Logical Graphs
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 32
Motivation EPGM Operators BenchmarkImplementationEPGM
• Vertices and directed Edges
• Logical Graphs
• Identifiers
1 3
4
5
2
1 2
3
4
5
1
2
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 33
Motivation EPGM Operators BenchmarkImplementationEPGM
• Vertices and directed Edges
• Logical Graphs
• Identifiers
• Type Labels
1 3
4
5
2
1 2
3
4
5
Person Band
Person
Person
Band
likes likes
likes
knows
likes
1|Community
2|Community
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 34
Motivation EPGM Operators BenchmarkImplementationEPGM
• Vertices and directed Edges
• Logical Graphs
• Identifiers
• Type Labels
• Properties
1 3
4
5
2
1 2
3
4
5
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 35
Motivation EPGM Operators BenchmarkImplementation
35
Operators
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 36
Motivation EPGM Operators BenchmarkImplementation
36
Operators
Operators
Unary Binary
GraphCollectionLogicalGraph
Algorithms
Aggregation
Pattern Matching
Transformation
Grouping Equality
Call
Combination
Overlap
Exclusion
Equality
Union
Intersection
Difference
Flink Gelly Library
BTG Extraction
Frequent Subgraphs
Limit
Selection
Distinct
Sort
Apply
Reduce
Call
Adaptive Partitioning
Subgraph
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 37
Motivation EPGM Operators BenchmarkImplementation
37
Operators
1 3
4
5
2
3
1 2
1 3
4
5
2
1
2 4
5
Combination
Overlap
Exclusion
3Basic Binary Operators
LogicalGraph graph3 = graph1.combine(graph2);
LogicalGraph graph4 = graph1.overlap(graph2);
LogicalGraph graph5 = graph1.exclude(graph2);
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 38
Motivation EPGM Operators BenchmarkImplementation
38
Operators
1 3
4
5
2
3
1 3
4
5
2
3 | vertexCount: 5
UDF
Aggregation
graph3 = graph3.aggregate(“vertexCount”, new AggregateFunction<Long>() {
public DataSet<Long> execute(LogicalGraph g) {
return Count.count(g.getVertices());
}
});
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 39
Motivation EPGM Operators BenchmarkImplementation
39
Operators
UDF
3 | vertexCount: 5
name:Alice
f_name:Bob1 3
4
5
2
3 | Community| vCount: 5
f_name:Alice
f_name:Bob1 3
4
5
2
Transformation
graph3 = graph3.transformEdges(new TransformationFunction<Edge>() {
public Edge execute(Edge e) {
e.setLabel(e.getLabel().equals(“orange”) ? “red” : e.getLabel());
return e;
}});
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 40
Motivation EPGM Operators BenchmarkImplementation
40
Operators
3
1 3
4
5
2
3
4
1 2
3
4
1 2
4
3
5
2
UDF
UDF
UDF
Subgraph
LogicalGraph graph4 = graph3.subgraph(
new FilterFunction<Vertex>() {
public boolean execute(Vertex v) { return v.getLabel().equals(“green”); }},
new FilterFunction<Edge>() {
public boolean execute(Edge e) { return e.getLabel().equals(“orange”); }});
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 41
Motivation EPGM Operators BenchmarkImplementation
41
Operators
3
1 3
4
5
2 Pattern
4 5
1 3
4
2
Graph Collection
Pattern Matching
GraphCollection collection = graph3.match(“(:Green)-[:orange]->(:Orange)”);
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 42
Motivation EPGM Operators BenchmarkImplementation
42
Operators
Keys
3
1 3
4
5
2
+Aggregate
3
a:23 a:84
a:42
a:12
1 3
4
5
2
a:13
a:21
4
count:2 count:3
max(a):42
max(a):84
max(a):13 max(a):21
6 7
4
6 7
Grouping
LogicalGraph grouped = graph3.groupBy()
.useVertexLabel()
.useEdgeLabel()
.addVertexAggregate(new CountAggregator())
.addEdgeAggregate(new MaxAggregator(“a”));
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 43
Motivation EPGM Operators BenchmarkImplementation
43
Operators
Operator
1
2
0 2
3
4
1
5 7 86
1 | vertexCount: 5
2 | vertexCount: 4
0 2
3
4
1
5 7 86
Apply (e.g. Aggregation)
collection = collection.apply(new Aggregation<>(“vertexCount”, new VertexCount()));
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 44
Motivation EPGM Operators BenchmarkImplementation
44
Operators
UDF
vertexCount > 4
1 | vertexCount: 5
2 | vertexCount: 4
0 2
3
4
1
5 7 86
1 | vertexCount: 5
0 2
3
4
1
Selection
GraphCollection filtered = collection.select(new FilterFunction<GraphHead>() {
public boolean filter(GraphHead g) {
return g.getPropertyValue(“vertexCount”).getLong() > 4L;
}
});
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 45
Motivation EPGM Operators BenchmarkImplementation
45
Operators
Algorithm
1
0 2
3
4
1
5 7 86
2
3
0 2
3
4
1
5 7 86
Call (e.g. Clustering)
GraphCollection clustering = graph.callForCollection(new ClusteringAlgorithm());
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 46
Motivation EPGM Operators BenchmarkImplementation
46
Operators
Algorithm
2
rank:0.11 rank:0.25
rank:0.11
rank:1.29
rank:1.29
rank:1.58rank:0.11
rank:0.75rank:0.11
0 2
3
4
1
5 7 86
1
0 2
3
4
1
5 7 86
Call (e.g. Page Rank)
LogicalGraph pageRankGraph = graph.callForGraph(new PageRankAlgorithm());
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 47
Motivation EPGM Operators BenchmarkImplementation
47
Implementation
Apache Flink Gradoop on Flink
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 48
Motivation EPGM Operators BenchmarkImplementation
48
Implementation
Apache Flink Gradoop on Flink
„Streaming Dataflow Engine that provides data distribution, communication and fault
tolerance for distributed computations over data streams.“
https://flink.apache.org/
Streaming Dataflow Runtime
DataSet DataStream
HadoopMR
Table
Gelly
FlinkML
Table
Zeppelin
Cascading
MRQL
Dataflow
Storm
Dataflow
SAMOA
GRADOOP
Cluster (e.g. YARN)Local Cloud (e.g. EC2)
Batch Stream
Data Storage (e.g. Files, HDFS, S3, JDBC, HBase, Kafka, …)
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 49
Motivation EPGM Operators BenchmarkImplementation
49
Implementation
Apache Flink Gradoop on Flink
DataSetDataSetDataSet
DataSetDataSetDataSet
DataSetDataSetDataSet
DataSetDataSetDataSet
DataSetDataSetDataSet
DataSetDataSetDataSet
• DataSet := Distributed Collection of Data Objects
• Transformation := Operation on DataSets (Higher-order function)
• Flink Programm := Composition of Transformations
DataSet
DataSet
DataSet
Transformation
Transformation
DataSet
DataSet
Transformation DataSet
Flink Program
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 50
Motivation EPGM Operators BenchmarkImplementation
50
Implementation
Apache Flink Gradoop on Flink
Hadoop-like Transformations
• map
• flatMap
• mapPartition
• reduce
• reduceGroup
• coGroup
Special Flink Operations
• iterate
• iterateDelta
SQL-like Transformations
• filter
• project
• cross
• union
• distinct
• first-N (limit)
• groupBy
• aggregate
• join
• leftOuterJoin
• rightOuterJoin
• fullOuterJoin
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 51
Motivation EPGM Operators BenchmarkImplementation
51
Implementation
Apache Flink Gradoop on Flink
1: ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
2:
3: DataSet<String> text = env.fromElements( // or env.readTextFile(„hdfs://…“)
4: „He who controls the past controls the future.“,
5: „He who controls the present controls the past.“);
6:
7: DataSet<Tuple2<String, Integer>> wordCounts = text
8: .flatMap(new LineSplitter()) // splits the line and outputs (word, 1) tuples
9: .groupBy(0)
10: .sum(1);
11:
12: wordCounts.print(); // trigger execution
flatMap
„He who controls the past controls the future.“
„He who controls the present controls the past.“
(He,1)
(who,1)
(controls,1)
(the,1)
(past,1)
// ...
groupBy(0)
[(He,1),(He,1)]
[(who,1),(who,1)]
[(future,1)]
[(past,1),(past,1)]
[(present,1)]
// ...
sum(1)
(He,2)
(who,2)
(future,1)
(past,2)
(present,1)
// ...
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 52
Motivation EPGM Operators BenchmarkImplementation
52
Implementation
Apache Flink Gradoop on Flink
flatMap
(He,1)
(who,1)
(controls,1)
groupBy(0)
[(He,1),(He,1)]
[(who,1),(who,1)]
sum(1)
(He,2)
(who,2)
Source flatMap
(the,1)
(past,1)
groupBy(0)
[(future,1)]
[(past,1),(past,1)]
sum(1)
(future,1)
(past,2)
flatMap
(future,1)
(past,1)
groupBy(0)
[(present,1)]
sum(1)
(present,1)
Sink
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 53
Motivation EPGM Operators BenchmarkImplementation
53
Implementation
Apache Flink Gradoop on Flink
Id Label Properties Graphs
Id Label Properties SourceId TargetId Graphs
EPGMGraphHead
EPGMVertex
EPGMEdge
Id Label Properties POJO
POJO
POJO
DataSet<EPGMGraphHead>
DataSet<EPGMVertex>
DataSet<EPGMEdge>
Id Label Properties Graphs
EPGMVertex
GradoopId := UUID
128-bit
String PropertyList := List<Property>
Property := (String, PropertyValue)
PropertyValue := byte[]
GradoopIdSet := Set<GradoopId>
EPGM Graph Representation
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 54
Motivation EPGM Operators BenchmarkImplementation
54
Implementation
Apache Flink Gradoop on Flink
Id Label Properties
1 Community {interest:Heavy Metal}
2 Community {interest:Hard Rock}
Id Label Properties Graphs
1 Person {name:Alice, born:1984} {1}
2 Band {name:Metallica,founded:1981} {1}
3 Person {name:Bob} {1,2}
4 Band {name:AC/DC,founded:1973} {2}
5 Person {name:Eve} {2}
Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
3 likes 3 4 {since:2015} {2}
4 knows 3 5 {} {2}
5 likes 5 4 {since:2014} {2}
likes
since : 2014
likes
since : 2013
1 3
4
5
2
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973likes
since : 2015
knows
likes
since : 2014
1 2
3
4
5
DataSet<EPGMGraphHead>
DataSet<EPGMVertex> DataSet<EPGMEdge>
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 55
Motivation EPGM Operators BenchmarkImplementation
55
Implementation
Apache Flink Gradoop on Flink
LogicalGraph grouped = graph1.combine(graph2).groupBy()
.useVertexLabel()
.useEdgeLabel()
.addVertexAggregate(new CountAggregator())
.addEdgeAggregate(new CountAggregator());
6 7
Person
count : 3
Band
count : 2
likes
count : 4
knows
count : 1
6
7
4
likes
since : 2014
likes
since : 2013
1 3
4
5
2
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973likes
since : 2015
knows
likes
since : 2014
1 2
3
4
5
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 56
Motivation EPGM Operators BenchmarkImplementation
56
Implementation
Apache Flink Gradoop on Flink
GroupBy(1,2,3) +
GC + GR* + Map
Assign edges to groups
Compute aggregates
Build super edges
Filter + Map
Extract super vertex tuples
Build super vertices
GroupBy(1) + GroupReduce*
Assign vertices to groups
Compute aggregates
Create super vertex tuples
Forward updated group members
V
E
(1,[Person],[])
(2,[Band],[])
(3,[Person],[])
(4,[Band],[])
(5,[Person],[])
(-,6,[Person],[3])
(1,6,[],[])
(-,7,[Band],[2])
(2,7,[],[])
(3,6,[],[])
(4,7,[],[])
(5,6,[],[])
v6
v7
(1,6)
(2,7)
(3,6)
(4,7)
(5,6)
(1,1,2,[likes],[])
(2,3,2,[likes],[])
(3,3,4,[likes],[])
(4,3,5,[knows],[])
(5,5,4,[likes],[])
(1,6,7,[likes],[])
(2,6,7,[likes],[])
(3,6,7,[likes],[])
(4,6,6,[knows],[])
(5,6,7,[likes],[])
e6
e7
Map
Extract
attributes
Filter + Map
Extract group members
Reduce memory footprint
Join*
Replace Source/TargetId
with corresponding super
vertex id
Map
Extract
attributes
*requires worker communication
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 57
Motivation EPGM Operators BenchmarkImplementation
57
Implementation
Apache Flink Gradoop on Flink
class LogicalGraph<G extends EPGMGraphHead,
V extends EPGMVertex,
E extends EPGMEdge> {
fromCollections(...) : LogicalGraph<G, V, E>
fromDataSets(...) : LogicalGraph<G, V, E>
fromGellyGraph(...) : LogicalGraph<G, V, E>
getGraphHead() : DataSet<G>
getVertices() : DataSet<V>
getEdges() : DataSet<E>
aggregate(...) : LogicalGraph<G, V, E>
match(...) : GraphCollection<G, V, E>
groupBy(...) : LogicalGraph<G, V, E>
subgraph(...) : LogicalGraph<G, V, E>
combine(...) : LogicalGraph<G, V, E>
// ...
}
class GraphCollection<G extends EPGMGraphHead,
V extends EPGMVertex,
E extends EPGMEdge > {
fromCollections(...) : GraphCollection<G, V, E>
fromDataSets(...) : GraphCollection<G, V, E>
getGraphHeads() : DataSet<G>
getVertices() : DataSet<V>
getEdges() : DataSet<E>
select(...) : GraphCollection<G, V, E>
distinct( ) : GraphCollection<G, V, E>
sortBy(...) : GraphCollection<G, V, E>
union(...) : GraphCollection<G, V, E>
difference(...) : GraphCollection<G, V, E>
// ...
}
EPGM API (Operators)
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 58
Motivation EPGM Operators BenchmarkImplementation
58
Implementation
Apache Flink Gradoop on Flink
interface DataSource<G extends EPGMGraphHead,
V extends EPGMVertex,
E extends EPGMEdge> {
getLogicalGraph(...) : LogicalGraph<G, V, E>
getGraphCollection(...) : GraphCollection<G, V, E>
}
interface DataSink<G extends EPGMGraphHead,
V extends EPGMVertex,
E extends EPGMEdge > {
write(LogicalGraph<G, V, E>) : void
write(GraphCollection<G, V, E>) : void
}
class GraphDataSource<...> implements DataSource<...> { }
class HBaseDataSource<...> implements DataSource<...> { }
class JSONDataSource<...> implements DataSource<...> { }
class TLFDataSource<...> implements DataSource<...> { }
class HBaseDataSink<...> implements DataSink<...> { }
class JSONDataSink<...> implements DataSink<...> { }
class TLFDataSink<...> implements DataSource<...> { }
EPGM API (I/O)
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 59
Motivation EPGM Operators BenchmarkImplementation
59
Benchmark
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 60
Motivation EPGM Operators BenchmarkImplementation
60
Benchmark
1. Extract subgraph containing only Persons and knows relations
2. Transform Persons to necessary information
3. Find communities using Label Propagation
4. Aggregate vertex count for each community
5. Select communities with more than 50K users
6. Combine large communities to a single graph
7. Group graph by Persons location and gender
8. Aggregate vertex and edge count of grouped graph
http://ldbcouncil.org/
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 61
Motivation EPGM Operators BenchmarkImplementation
61
Benchmark
1. Extract subgraph containing only Persons and knows relations
2. Transform Persons to necessary information
3. Find communities using Label Propagation
4. Aggregate vertex count for each community
5. Select communities with more than 50K users
6. Combine large communities to a single graph
7. Group graph by Persons location and gender
8. Aggregate vertex and edge count of grouped graph
https://git.io/vgozj
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 62
Motivation EPGM Operators BenchmarkImplementation
62
Benchmark
Dataset # Vertices # Edges Disk size
Graphalytics.1 61,613 2,026,082 570 MB
Graphalytics.10 260,613 16,600,778 4.5 GB
Graphalytics.100 1,695,613 147,437,275 40.2 GB
Graphalytics.1000 12,775,613 1,363,747,260 372 GB
Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB
• 16x Intel(R) Xeon(R) 2.50GHz 6 (12)
• 16x 48 GB RAM
• 1 Gigabit Ethernet
• Hadoop 2.6.0
• Flink 1.0-SNAPSHOT
• slots (per worker) 12
• jobmanager.heap.mb 2048
• taskmanager.heap.mb 40960
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 63
Motivation EPGM Operators BenchmarkImplementation
63
Benchmark
Dataset # Vertices # Edges Disk size
Graphalytics.1 61,613 2,026,082 570 MB
Graphalytics.10 260,613 16,600,778 4.5 GB
Graphalytics.100 1,695,613 147,437,275 40.2 GB
Graphalytics.1000 12,775,613 1,363,747,260 372 GB
Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB
• 16x Intel(R) Xeon(R) 2.50GHz 6 (12)
• 16x 48 GB RAM
• 1 Gigabit Ethernet
• Hadoop 2.6.0
• Flink 1.0-SNAPSHOT
• slots (per worker) 12
• jobmanager.heap.mb 2048
• taskmanager.heap.mb 40960
0
200
400
600
800
1000
1200
1 2 4 8 16
Runtime[s]
Number of workers
Graphalytics.100
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 64
Motivation EPGM Operators BenchmarkImplementation
64
Benchmark
1
2
4
8
16
1 2 4 8 16
Speedup
Number of workers
Graphalytics.100 Linear
Dataset # Vertices # Edges Disk size
Graphalytics.1 61,613 2,026,082 570 MB
Graphalytics.10 260,613 16,600,778 4.5 GB
Graphalytics.100 1,695,613 147,437,275 40.2 GB
Graphalytics.1000 12,775,613 1,363,747,260 372 GB
Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB
• 16x Intel(R) Xeon(R) 2.50GHz 6 (12)
• 16x 48 GB RAM
• 1 Gigabit Ethernet
• Hadoop 2.6.0
• Flink 1.0-SNAPSHOT
• slots (per worker) 12
• jobmanager.heap.mb 2048
• taskmanager.heap.mb 40960
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 65
Motivation EPGM Operators BenchmarkImplementation
65
Benchmark
1
10
100
1000
10000
Runtime[s]
Dataset # Vertices # Edges Disk size
Graphalytics.1 61,613 2,026,082 570 MB
Graphalytics.10 260,613 16,600,778 4.5 GB
Graphalytics.100 1,695,613 147,437,275 40.2 GB
Graphalytics.1000 12,775,613 1,363,747,260 372 GB
Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB
• 16x Intel(R) Xeon(R) 2.50GHz 6 (12)
• 16x 48 GB RAM
• 1 Gigabit Ethernet
• Hadoop 2.6.0
• Flink 1.0-SNAPSHOT
• slots (per worker) 12
• jobmanager.heap.mb 2048
• taskmanager.heap.mb 40960
Summary
• 0.0.1 First Prototype (May 2015)
– Hadoop MapReduce and Giraph for operator implementations
– Too much complexity
– Performance loss through serialization in HDFS/HBase
• 0.0.2 Using Flink as execution layer (June 2015)
– Basic operators
• 0.1 December 2015
– System-side identifiers (UUID)
– Improved property handling
– More operator implementations (e.g., Equality, Bool operators)
– Code refactoring
• 0.2-SNAPSHOT August 2016
– Graph Pattern Matching 
– Frequent Subgraph Mining 
– Memory optimization (96-bit ID, Dictionary Encoding, …)
– Refactoring
Release History
Summary
Contributions welcome!
• Code
• I/O Formats (GraphML, DOT, …)
• Operators and Algorithms
• Tuning (Memory consumption, serialization, …)
• API improvements
• Use cases and data
• Business Intelligence
• Fraud Detection
• Pattern Mining
• …
• Extended Property Graph Model
• Schema flexible: Type Labels and Properties
• Logical Graphs / Graphs Collection
• Graph and Collection Operators
• Combination to analytical workflows
• Implemented on Apache Flink
• Built-in scalability
• Combine with other libraries
Summary
www.gradoop.com
[1] Junghanns, M.; Petermann, A.; Teichmann, N.; Gomez, K.; Rahm, E.,
„Analyzing Extended Property Graphs with Apache Flink“,
Int. Workshop on Network Data Analytics (NDA), SIGMOD 2016.
[2] Petermann, A.; Junghanns, M.,
„Scalable Business Intelligence with Graph Collections“,
it – Special Issue on Big Data Analytics, 2016.
[3] Petermann, A.; Junghanns, M.; Müller, M.; Rahm, E.,
„Graph-based Data Integration and Business Intelligence with BIIIG“,
Proc. VLDB Conf. (Demo), 2014.

More Related Content

Viewers also liked

Gut vernetzt: Skalierbares Graph Mining für Business Intelligence
Gut vernetzt: Skalierbares Graph Mining für Business IntelligenceGut vernetzt: Skalierbares Graph Mining für Business Intelligence
Gut vernetzt: Skalierbares Graph Mining für Business IntelligenceMartin Junghanns
 
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkMaximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkFlink Forward
 
Distributed processing of large graphs in python
Distributed processing of large graphs in pythonDistributed processing of large graphs in python
Distributed processing of large graphs in pythonJose Quesada (hiring)
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineTrey Grainger
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph ProcessingVasia Kalavri
 
Batch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache FlinkBatch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache FlinkVasia Kalavri
 
Data Pipelines with Apache Kafka
Data Pipelines with Apache KafkaData Pipelines with Apache Kafka
Data Pipelines with Apache KafkaBen Stopford
 
Staying Shallow & Lean in a Deep Learning World
Staying Shallow & Lean in a Deep Learning WorldStaying Shallow & Lean in a Deep Learning World
Staying Shallow & Lean in a Deep Learning WorldXavier Amatriain
 

Viewers also liked (8)

Gut vernetzt: Skalierbares Graph Mining für Business Intelligence
Gut vernetzt: Skalierbares Graph Mining für Business IntelligenceGut vernetzt: Skalierbares Graph Mining für Business Intelligence
Gut vernetzt: Skalierbares Graph Mining für Business Intelligence
 
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkMaximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
 
Distributed processing of large graphs in python
Distributed processing of large graphs in pythonDistributed processing of large graphs in python
Distributed processing of large graphs in python
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph Processing
 
Batch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache FlinkBatch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache Flink
 
Data Pipelines with Apache Kafka
Data Pipelines with Apache KafkaData Pipelines with Apache Kafka
Data Pipelines with Apache Kafka
 
Staying Shallow & Lean in a Deep Learning World
Staying Shallow & Lean in a Deep Learning WorldStaying Shallow & Lean in a Deep Learning World
Staying Shallow & Lean in a Deep Learning World
 

Similar to Distributed Graph Analytics with Gradoop

Reco4J @ London Meetup (June 26th)
Reco4J @ London Meetup (June 26th)Reco4J @ London Meetup (June 26th)
Reco4J @ London Meetup (June 26th)Alessandro Negro
 
Reco4J @ Munich Meetup (April 18th)
Reco4J @ Munich Meetup (April 18th)Reco4J @ Munich Meetup (April 18th)
Reco4J @ Munich Meetup (April 18th)Alessandro Negro
 
Ramp up your testing solution, ExpoQA 2023
Ramp up your testing solution, ExpoQA 2023Ramp up your testing solution, ExpoQA 2023
Ramp up your testing solution, ExpoQA 2023Gáspár Nagy
 
Simpda 2014 - A living story: measuring quality of developments in a large in...
Simpda 2014 - A living story: measuring quality of developments in a large in...Simpda 2014 - A living story: measuring quality of developments in a large in...
Simpda 2014 - A living story: measuring quality of developments in a large in...SpagoWorld
 
Reco4 @ Paris Meetup (May 20th)
Reco4 @ Paris Meetup (May 20th)Reco4 @ Paris Meetup (May 20th)
Reco4 @ Paris Meetup (May 20th)Alessandro Negro
 
Kevin Gray Festival of NewMR 2016
Kevin Gray Festival of NewMR 2016Kevin Gray Festival of NewMR 2016
Kevin Gray Festival of NewMR 2016Ray Poynter
 
Semantics-aware Recommender Systems Exploiting Linked Open Data and Graph-bas...
Semantics-aware Recommender Systems Exploiting Linked Open Data and Graph-bas...Semantics-aware Recommender Systems Exploiting Linked Open Data and Graph-bas...
Semantics-aware Recommender Systems Exploiting Linked Open Data and Graph-bas...Cataldo Musto
 
TCP1P.net Meetup Vision, Objectives and Roadmap
TCP1P.net Meetup Vision, Objectives and RoadmapTCP1P.net Meetup Vision, Objectives and Roadmap
TCP1P.net Meetup Vision, Objectives and RoadmapStefan Ianta
 
Graph processing at scale using spark &amp; graph frames
Graph processing at scale using spark &amp; graph framesGraph processing at scale using spark &amp; graph frames
Graph processing at scale using spark &amp; graph framesRon Barabash
 
TEAM 16: GUF API
TEAM 16: GUF APITEAM 16: GUF API
TEAM 16: GUF APIplan4all
 
RIPE Atlas
RIPE AtlasRIPE Atlas
RIPE AtlasRIPE NCC
 
Recommendation for new users at Criteo
Recommendation for new users at CriteoRecommendation for new users at Criteo
Recommendation for new users at CriteoOlivier Koch
 
OpenNeuro: a free online platform for sharing and analysis of neuroimaging data
OpenNeuro: a free online platform for sharing and analysis of neuroimaging dataOpenNeuro: a free online platform for sharing and analysis of neuroimaging data
OpenNeuro: a free online platform for sharing and analysis of neuroimaging dataKrzysztof Gorgolewski
 
Recommender Systems @ Scale, Big Data Europe Conference 2019
Recommender Systems @ Scale, Big Data Europe Conference 2019Recommender Systems @ Scale, Big Data Europe Conference 2019
Recommender Systems @ Scale, Big Data Europe Conference 2019Sonya Liberman
 
Micro patterns in agile software
Micro patterns in agile softwareMicro patterns in agile software
Micro patterns in agile softwareUjjwal Joshi
 
Collaborative model driven software engineering: a Systematic Mapping Study
Collaborative model driven software engineering: a Systematic Mapping StudyCollaborative model driven software engineering: a Systematic Mapping Study
Collaborative model driven software engineering: a Systematic Mapping StudyDavide Ruscio
 
Predicting Influence and Communities Using Graph Algorithms
Predicting Influence and Communities Using Graph AlgorithmsPredicting Influence and Communities Using Graph Algorithms
Predicting Influence and Communities Using Graph AlgorithmsDatabricks
 
Hardcore Data Science - in Practice
Hardcore Data Science - in PracticeHardcore Data Science - in Practice
Hardcore Data Science - in PracticeMikio L. Braun
 

Similar to Distributed Graph Analytics with Gradoop (20)

Reco4J @ London Meetup (June 26th)
Reco4J @ London Meetup (June 26th)Reco4J @ London Meetup (June 26th)
Reco4J @ London Meetup (June 26th)
 
Reco4J @ Munich Meetup (April 18th)
Reco4J @ Munich Meetup (April 18th)Reco4J @ Munich Meetup (April 18th)
Reco4J @ Munich Meetup (April 18th)
 
Ramp up your testing solution, ExpoQA 2023
Ramp up your testing solution, ExpoQA 2023Ramp up your testing solution, ExpoQA 2023
Ramp up your testing solution, ExpoQA 2023
 
Simpda 2014 - A living story: measuring quality of developments in a large in...
Simpda 2014 - A living story: measuring quality of developments in a large in...Simpda 2014 - A living story: measuring quality of developments in a large in...
Simpda 2014 - A living story: measuring quality of developments in a large in...
 
Reco4 @ Paris Meetup (May 20th)
Reco4 @ Paris Meetup (May 20th)Reco4 @ Paris Meetup (May 20th)
Reco4 @ Paris Meetup (May 20th)
 
Kevin Gray Festival of NewMR 2016
Kevin Gray Festival of NewMR 2016Kevin Gray Festival of NewMR 2016
Kevin Gray Festival of NewMR 2016
 
Hobbit presentation at Apache Big Data Europe 2016
Hobbit presentation at Apache Big Data Europe 2016Hobbit presentation at Apache Big Data Europe 2016
Hobbit presentation at Apache Big Data Europe 2016
 
1802_Crossminer_OCF2018
1802_Crossminer_OCF20181802_Crossminer_OCF2018
1802_Crossminer_OCF2018
 
Semantics-aware Recommender Systems Exploiting Linked Open Data and Graph-bas...
Semantics-aware Recommender Systems Exploiting Linked Open Data and Graph-bas...Semantics-aware Recommender Systems Exploiting Linked Open Data and Graph-bas...
Semantics-aware Recommender Systems Exploiting Linked Open Data and Graph-bas...
 
TCP1P.net Meetup Vision, Objectives and Roadmap
TCP1P.net Meetup Vision, Objectives and RoadmapTCP1P.net Meetup Vision, Objectives and Roadmap
TCP1P.net Meetup Vision, Objectives and Roadmap
 
Graph processing at scale using spark &amp; graph frames
Graph processing at scale using spark &amp; graph framesGraph processing at scale using spark &amp; graph frames
Graph processing at scale using spark &amp; graph frames
 
TEAM 16: GUF API
TEAM 16: GUF APITEAM 16: GUF API
TEAM 16: GUF API
 
RIPE Atlas
RIPE AtlasRIPE Atlas
RIPE Atlas
 
Recommendation for new users at Criteo
Recommendation for new users at CriteoRecommendation for new users at Criteo
Recommendation for new users at Criteo
 
OpenNeuro: a free online platform for sharing and analysis of neuroimaging data
OpenNeuro: a free online platform for sharing and analysis of neuroimaging dataOpenNeuro: a free online platform for sharing and analysis of neuroimaging data
OpenNeuro: a free online platform for sharing and analysis of neuroimaging data
 
Recommender Systems @ Scale, Big Data Europe Conference 2019
Recommender Systems @ Scale, Big Data Europe Conference 2019Recommender Systems @ Scale, Big Data Europe Conference 2019
Recommender Systems @ Scale, Big Data Europe Conference 2019
 
Micro patterns in agile software
Micro patterns in agile softwareMicro patterns in agile software
Micro patterns in agile software
 
Collaborative model driven software engineering: a Systematic Mapping Study
Collaborative model driven software engineering: a Systematic Mapping StudyCollaborative model driven software engineering: a Systematic Mapping Study
Collaborative model driven software engineering: a Systematic Mapping Study
 
Predicting Influence and Communities Using Graph Algorithms
Predicting Influence and Communities Using Graph AlgorithmsPredicting Influence and Communities Using Graph Algorithms
Predicting Influence and Communities Using Graph Algorithms
 
Hardcore Data Science - in Practice
Hardcore Data Science - in PracticeHardcore Data Science - in Practice
Hardcore Data Science - in Practice
 

Recently uploaded

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 

Recently uploaded (20)

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 

Distributed Graph Analytics with Gradoop

  • 1. Distributed Graph Analytics with Gradoop inovex Meetup Munich Let‘s talk about Graph Databases July 2016 Martin Junghanns (@kc1s) University of Leipzig – Database Research Group
  • 2. Motivation Extended Property Graph Model Operators Benchmark Implementation
  • 3. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 3 Motivation EPGM Operators BenchmarkImplementation 3 Motivation
  • 4. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 4 Motivation EPGM Operators BenchmarkImplementation 4 Motivation 𝑮𝑟𝑎𝑝ℎ = (𝑽𝑒𝑟𝑡𝑖𝑐𝑒𝑠, 𝑬𝑑𝑔𝑒𝑠) „Graphs are everywhere“
  • 5. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 5 Motivation EPGM Operators BenchmarkImplementation 5 Motivation Alice Bob Eve Dave Carol Mallory Peggy Trent 𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬, 𝐹𝑜𝑙𝑙𝑜𝑤𝑒𝑟𝑠) „Graphs are everywhere“
  • 6. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 6 Motivation EPGM Operators BenchmarkImplementation 6 Motivation 𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠) Alice Bob Eve Dave Carol Mallory Peggy Trent „Graphs are everywhere“
  • 7. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 7 Motivation EPGM Operators BenchmarkImplementation 7 Motivation Alice Bob AC/DC Dave Carol Mallory Peggy Metallica 𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬 ∪ 𝐁𝐚𝐧𝐝𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠 ∪ 𝐿𝑖𝑘𝑒𝑠) „Graphs are heterogeneous“
  • 8. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 8 Motivation EPGM Operators BenchmarkImplementation 8 Motivation Alice Bob AC/DC Dave Carol Mallory Peggy Metallica 𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬 ∪ 𝐁𝐚𝐧𝐝𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠 ∪ 𝐿𝑖𝑘𝑒𝑠) „Graphs can be analyzed“
  • 9. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 9 Motivation EPGM Operators BenchmarkImplementation 9 Motivation 0.2 0.28 0.26 0.33 0.25 0.26 Alice Bob AC/DC Dave Carol Mallory Peggy Metallica 3.6 2.82 𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬 ∪ 𝐁𝐚𝐧𝐝𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠 ∪ 𝐿𝑖𝑘𝑒𝑠) „Graphs can be analyzed“
  • 10. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 10 Motivation EPGM Operators BenchmarkImplementation 10 Motivation Assuming a social network
  • 11. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 11 Motivation EPGM Operators BenchmarkImplementation 11 Motivation Assuming a social network 1. Determine subgraph
  • 12. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 12 Motivation EPGM Operators BenchmarkImplementation 12 Motivation Assuming a social network 1. Determine subgraph
  • 13. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 13 Motivation EPGM Operators BenchmarkImplementation 13 Motivation Assuming a social network 1. Determine subgraph 2. Find communities
  • 14. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 14 Motivation EPGM Operators BenchmarkImplementation 14 Motivation Assuming a social network 1. Determine subgraph 2. Find communities
  • 15. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 15 Motivation EPGM Operators BenchmarkImplementation 15 Motivation Assuming a social network 1. Determine subgraph 2. Find communities 3. Filter communities
  • 16. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 16 Motivation EPGM Operators BenchmarkImplementation 16 Motivation Assuming a social network 1. Determine subgraph 2. Find communities 3. Filter communities
  • 17. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 17 Motivation EPGM Operators BenchmarkImplementation 17 Motivation Assuming a social network 1. Determine subgraph 2. Find communities 3. Filter communities 4. Find common subgraph
  • 18. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 18 Motivation EPGM Operators BenchmarkImplementation 18 Motivation Assuming a social network 1. Determine subgraph 2. Find communities 3. Filter communities 4. Find common subgraph
  • 19. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 19 Motivation EPGM Operators BenchmarkImplementation 19 Motivation Assuming a social network • Heterogeneous data 1. Determine subgraph • Apply graph transformation 2. Find communities • Handle collections of graphs 3. Filter communities • Aggregation, Selection 4. Find common subgraph • Apply dedicated algorithm
  • 20. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 20 Motivation EPGM Operators BenchmarkImplementation 20 Motivation Assuming a social network • Heterogeneous data 1. Determine subgraph • Apply graph transformation 2. Find communities • Handle collections of graphs 3. Filter communities • Aggregation, Selection 4. Find common subgraph • Apply dedicated algorithm
  • 21. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 21 Motivation EPGM Operators BenchmarkImplementation 21 Motivation Assuming a social network • Heterogeneous data 1. Determine subgraph • Apply graph transformation 2. Find communities • Handle collections of graphs 3. Filter communities • Aggregation, Selection 4. Find common subgraph • Apply dedicated algorithm
  • 22. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 22 Motivation EPGM Operators BenchmarkImplementation 22 Motivation Assuming a social network • Heterogeneous data 1. Determine subgraph • Apply graph transformation 2. Find communities • Handle collections of graphs 3. Filter communities • Aggregation, Selection 4. Find common subgraph • Apply dedicated algorithm
  • 23. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 23 Motivation EPGM Operators BenchmarkImplementation 23 Motivation Assuming a social network • Heterogeneous data 1. Determine subgraph • Apply graph transformation 2. Find communities • Handle collections of graphs 3. Filter communities • Aggregation, Selection 4. Find common subgraph • Apply dedicated algorithm
  • 24. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 24 Motivation EPGM Operators BenchmarkImplementation 24 Motivation „And let‘s not forget …“
  • 25. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 25 Motivation EPGM Operators BenchmarkImplementation 25 Motivation “...Graphs are large.”
  • 26. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 26 Motivation EPGM Operators BenchmarkImplementation 26 Motivation „An open-source framework and research platform for efficient, distributed and domain independent management and analytics of heterogeneous graph data.“
  • 27. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 27 Motivation EPGM Operators BenchmarkImplementation 27 Motivation Data Volume and Problem Complexity Ease-of-use Graph Processing Systems Graph Databases Graph Dataflow Systems Gelly
  • 28. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 28 Motivation EPGM Operators BenchmarkImplementation 28 Motivation Distributed Graph Store (Apache HBase) Apache Flink Operator Implementation Apache Flink Distributed Operator Execution Extended Property Graph Model (EPGM) Graph Analytical Language (GrALa) I/O Distributed File System (Apache HDFS)
  • 29. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 29 Motivation EPGM Operators BenchmarkImplementation 29 Extended Property Graph Model (EPGM)
  • 30. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 30 Motivation EPGM Operators BenchmarkImplementationEPGM • Vertices and directed Edges
  • 31. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 31 Motivation EPGM Operators BenchmarkImplementationEPGM • Vertices and directed Edges • Logical Graphs
  • 32. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 32 Motivation EPGM Operators BenchmarkImplementationEPGM • Vertices and directed Edges • Logical Graphs • Identifiers 1 3 4 5 2 1 2 3 4 5 1 2
  • 33. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 33 Motivation EPGM Operators BenchmarkImplementationEPGM • Vertices and directed Edges • Logical Graphs • Identifiers • Type Labels 1 3 4 5 2 1 2 3 4 5 Person Band Person Person Band likes likes likes knows likes 1|Community 2|Community
  • 34. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 34 Motivation EPGM Operators BenchmarkImplementationEPGM • Vertices and directed Edges • Logical Graphs • Identifiers • Type Labels • Properties 1 3 4 5 2 1 2 3 4 5 Person name : Alice born : 1984 Band name : Metallica founded : 1981 Person name : Bob Person name : Eve Band name : AC/DC founded : 1973 likes since : 2014 likes since : 2013 likes since : 2015 knows likes since : 2014 1|Community|interest:Heavy Metal 2|Community|interest:Hard Rock
  • 35. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 35 Motivation EPGM Operators BenchmarkImplementation 35 Operators
  • 36. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 36 Motivation EPGM Operators BenchmarkImplementation 36 Operators Operators Unary Binary GraphCollectionLogicalGraph Algorithms Aggregation Pattern Matching Transformation Grouping Equality Call Combination Overlap Exclusion Equality Union Intersection Difference Flink Gelly Library BTG Extraction Frequent Subgraphs Limit Selection Distinct Sort Apply Reduce Call Adaptive Partitioning Subgraph
  • 37. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 37 Motivation EPGM Operators BenchmarkImplementation 37 Operators 1 3 4 5 2 3 1 2 1 3 4 5 2 1 2 4 5 Combination Overlap Exclusion 3Basic Binary Operators LogicalGraph graph3 = graph1.combine(graph2); LogicalGraph graph4 = graph1.overlap(graph2); LogicalGraph graph5 = graph1.exclude(graph2);
  • 38. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 38 Motivation EPGM Operators BenchmarkImplementation 38 Operators 1 3 4 5 2 3 1 3 4 5 2 3 | vertexCount: 5 UDF Aggregation graph3 = graph3.aggregate(“vertexCount”, new AggregateFunction<Long>() { public DataSet<Long> execute(LogicalGraph g) { return Count.count(g.getVertices()); } });
  • 39. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 39 Motivation EPGM Operators BenchmarkImplementation 39 Operators UDF 3 | vertexCount: 5 name:Alice f_name:Bob1 3 4 5 2 3 | Community| vCount: 5 f_name:Alice f_name:Bob1 3 4 5 2 Transformation graph3 = graph3.transformEdges(new TransformationFunction<Edge>() { public Edge execute(Edge e) { e.setLabel(e.getLabel().equals(“orange”) ? “red” : e.getLabel()); return e; }});
  • 40. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 40 Motivation EPGM Operators BenchmarkImplementation 40 Operators 3 1 3 4 5 2 3 4 1 2 3 4 1 2 4 3 5 2 UDF UDF UDF Subgraph LogicalGraph graph4 = graph3.subgraph( new FilterFunction<Vertex>() { public boolean execute(Vertex v) { return v.getLabel().equals(“green”); }}, new FilterFunction<Edge>() { public boolean execute(Edge e) { return e.getLabel().equals(“orange”); }});
  • 41. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 41 Motivation EPGM Operators BenchmarkImplementation 41 Operators 3 1 3 4 5 2 Pattern 4 5 1 3 4 2 Graph Collection Pattern Matching GraphCollection collection = graph3.match(“(:Green)-[:orange]->(:Orange)”);
  • 42. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 42 Motivation EPGM Operators BenchmarkImplementation 42 Operators Keys 3 1 3 4 5 2 +Aggregate 3 a:23 a:84 a:42 a:12 1 3 4 5 2 a:13 a:21 4 count:2 count:3 max(a):42 max(a):84 max(a):13 max(a):21 6 7 4 6 7 Grouping LogicalGraph grouped = graph3.groupBy() .useVertexLabel() .useEdgeLabel() .addVertexAggregate(new CountAggregator()) .addEdgeAggregate(new MaxAggregator(“a”));
  • 43. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 43 Motivation EPGM Operators BenchmarkImplementation 43 Operators Operator 1 2 0 2 3 4 1 5 7 86 1 | vertexCount: 5 2 | vertexCount: 4 0 2 3 4 1 5 7 86 Apply (e.g. Aggregation) collection = collection.apply(new Aggregation<>(“vertexCount”, new VertexCount()));
  • 44. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 44 Motivation EPGM Operators BenchmarkImplementation 44 Operators UDF vertexCount > 4 1 | vertexCount: 5 2 | vertexCount: 4 0 2 3 4 1 5 7 86 1 | vertexCount: 5 0 2 3 4 1 Selection GraphCollection filtered = collection.select(new FilterFunction<GraphHead>() { public boolean filter(GraphHead g) { return g.getPropertyValue(“vertexCount”).getLong() > 4L; } });
  • 45. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 45 Motivation EPGM Operators BenchmarkImplementation 45 Operators Algorithm 1 0 2 3 4 1 5 7 86 2 3 0 2 3 4 1 5 7 86 Call (e.g. Clustering) GraphCollection clustering = graph.callForCollection(new ClusteringAlgorithm());
  • 46. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 46 Motivation EPGM Operators BenchmarkImplementation 46 Operators Algorithm 2 rank:0.11 rank:0.25 rank:0.11 rank:1.29 rank:1.29 rank:1.58rank:0.11 rank:0.75rank:0.11 0 2 3 4 1 5 7 86 1 0 2 3 4 1 5 7 86 Call (e.g. Page Rank) LogicalGraph pageRankGraph = graph.callForGraph(new PageRankAlgorithm());
  • 47. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 47 Motivation EPGM Operators BenchmarkImplementation 47 Implementation Apache Flink Gradoop on Flink
  • 48. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 48 Motivation EPGM Operators BenchmarkImplementation 48 Implementation Apache Flink Gradoop on Flink „Streaming Dataflow Engine that provides data distribution, communication and fault tolerance for distributed computations over data streams.“ https://flink.apache.org/ Streaming Dataflow Runtime DataSet DataStream HadoopMR Table Gelly FlinkML Table Zeppelin Cascading MRQL Dataflow Storm Dataflow SAMOA GRADOOP Cluster (e.g. YARN)Local Cloud (e.g. EC2) Batch Stream Data Storage (e.g. Files, HDFS, S3, JDBC, HBase, Kafka, …)
  • 49. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 49 Motivation EPGM Operators BenchmarkImplementation 49 Implementation Apache Flink Gradoop on Flink DataSetDataSetDataSet DataSetDataSetDataSet DataSetDataSetDataSet DataSetDataSetDataSet DataSetDataSetDataSet DataSetDataSetDataSet • DataSet := Distributed Collection of Data Objects • Transformation := Operation on DataSets (Higher-order function) • Flink Programm := Composition of Transformations DataSet DataSet DataSet Transformation Transformation DataSet DataSet Transformation DataSet Flink Program
  • 50. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 50 Motivation EPGM Operators BenchmarkImplementation 50 Implementation Apache Flink Gradoop on Flink Hadoop-like Transformations • map • flatMap • mapPartition • reduce • reduceGroup • coGroup Special Flink Operations • iterate • iterateDelta SQL-like Transformations • filter • project • cross • union • distinct • first-N (limit) • groupBy • aggregate • join • leftOuterJoin • rightOuterJoin • fullOuterJoin
  • 51. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 51 Motivation EPGM Operators BenchmarkImplementation 51 Implementation Apache Flink Gradoop on Flink 1: ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); 2: 3: DataSet<String> text = env.fromElements( // or env.readTextFile(„hdfs://…“) 4: „He who controls the past controls the future.“, 5: „He who controls the present controls the past.“); 6: 7: DataSet<Tuple2<String, Integer>> wordCounts = text 8: .flatMap(new LineSplitter()) // splits the line and outputs (word, 1) tuples 9: .groupBy(0) 10: .sum(1); 11: 12: wordCounts.print(); // trigger execution flatMap „He who controls the past controls the future.“ „He who controls the present controls the past.“ (He,1) (who,1) (controls,1) (the,1) (past,1) // ... groupBy(0) [(He,1),(He,1)] [(who,1),(who,1)] [(future,1)] [(past,1),(past,1)] [(present,1)] // ... sum(1) (He,2) (who,2) (future,1) (past,2) (present,1) // ...
  • 52. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 52 Motivation EPGM Operators BenchmarkImplementation 52 Implementation Apache Flink Gradoop on Flink flatMap (He,1) (who,1) (controls,1) groupBy(0) [(He,1),(He,1)] [(who,1),(who,1)] sum(1) (He,2) (who,2) Source flatMap (the,1) (past,1) groupBy(0) [(future,1)] [(past,1),(past,1)] sum(1) (future,1) (past,2) flatMap (future,1) (past,1) groupBy(0) [(present,1)] sum(1) (present,1) Sink
  • 53. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 53 Motivation EPGM Operators BenchmarkImplementation 53 Implementation Apache Flink Gradoop on Flink Id Label Properties Graphs Id Label Properties SourceId TargetId Graphs EPGMGraphHead EPGMVertex EPGMEdge Id Label Properties POJO POJO POJO DataSet<EPGMGraphHead> DataSet<EPGMVertex> DataSet<EPGMEdge> Id Label Properties Graphs EPGMVertex GradoopId := UUID 128-bit String PropertyList := List<Property> Property := (String, PropertyValue) PropertyValue := byte[] GradoopIdSet := Set<GradoopId> EPGM Graph Representation
  • 54. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 54 Motivation EPGM Operators BenchmarkImplementation 54 Implementation Apache Flink Gradoop on Flink Id Label Properties 1 Community {interest:Heavy Metal} 2 Community {interest:Hard Rock} Id Label Properties Graphs 1 Person {name:Alice, born:1984} {1} 2 Band {name:Metallica,founded:1981} {1} 3 Person {name:Bob} {1,2} 4 Band {name:AC/DC,founded:1973} {2} 5 Person {name:Eve} {2} Id Label Source Target Properties Graphs 1 likes 1 2 {since:2014} {1} 2 likes 3 2 {since:2013} {1} 3 likes 3 4 {since:2015} {2} 4 knows 3 5 {} {2} 5 likes 5 4 {since:2014} {2} likes since : 2014 likes since : 2013 1 3 4 5 2 1|Community|interest:Heavy Metal 2|Community|interest:Hard Rock Person name : Alice born : 1984 Band name : Metallica founded : 1981 Person name : Bob Person name : Eve Band name : AC/DC founded : 1973likes since : 2015 knows likes since : 2014 1 2 3 4 5 DataSet<EPGMGraphHead> DataSet<EPGMVertex> DataSet<EPGMEdge>
  • 55. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 55 Motivation EPGM Operators BenchmarkImplementation 55 Implementation Apache Flink Gradoop on Flink LogicalGraph grouped = graph1.combine(graph2).groupBy() .useVertexLabel() .useEdgeLabel() .addVertexAggregate(new CountAggregator()) .addEdgeAggregate(new CountAggregator()); 6 7 Person count : 3 Band count : 2 likes count : 4 knows count : 1 6 7 4 likes since : 2014 likes since : 2013 1 3 4 5 2 1|Community|interest:Heavy Metal 2|Community|interest:Hard Rock Person name : Alice born : 1984 Band name : Metallica founded : 1981 Person name : Bob Person name : Eve Band name : AC/DC founded : 1973likes since : 2015 knows likes since : 2014 1 2 3 4 5
  • 56. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 56 Motivation EPGM Operators BenchmarkImplementation 56 Implementation Apache Flink Gradoop on Flink GroupBy(1,2,3) + GC + GR* + Map Assign edges to groups Compute aggregates Build super edges Filter + Map Extract super vertex tuples Build super vertices GroupBy(1) + GroupReduce* Assign vertices to groups Compute aggregates Create super vertex tuples Forward updated group members V E (1,[Person],[]) (2,[Band],[]) (3,[Person],[]) (4,[Band],[]) (5,[Person],[]) (-,6,[Person],[3]) (1,6,[],[]) (-,7,[Band],[2]) (2,7,[],[]) (3,6,[],[]) (4,7,[],[]) (5,6,[],[]) v6 v7 (1,6) (2,7) (3,6) (4,7) (5,6) (1,1,2,[likes],[]) (2,3,2,[likes],[]) (3,3,4,[likes],[]) (4,3,5,[knows],[]) (5,5,4,[likes],[]) (1,6,7,[likes],[]) (2,6,7,[likes],[]) (3,6,7,[likes],[]) (4,6,6,[knows],[]) (5,6,7,[likes],[]) e6 e7 Map Extract attributes Filter + Map Extract group members Reduce memory footprint Join* Replace Source/TargetId with corresponding super vertex id Map Extract attributes *requires worker communication
  • 57. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 57 Motivation EPGM Operators BenchmarkImplementation 57 Implementation Apache Flink Gradoop on Flink class LogicalGraph<G extends EPGMGraphHead, V extends EPGMVertex, E extends EPGMEdge> { fromCollections(...) : LogicalGraph<G, V, E> fromDataSets(...) : LogicalGraph<G, V, E> fromGellyGraph(...) : LogicalGraph<G, V, E> getGraphHead() : DataSet<G> getVertices() : DataSet<V> getEdges() : DataSet<E> aggregate(...) : LogicalGraph<G, V, E> match(...) : GraphCollection<G, V, E> groupBy(...) : LogicalGraph<G, V, E> subgraph(...) : LogicalGraph<G, V, E> combine(...) : LogicalGraph<G, V, E> // ... } class GraphCollection<G extends EPGMGraphHead, V extends EPGMVertex, E extends EPGMEdge > { fromCollections(...) : GraphCollection<G, V, E> fromDataSets(...) : GraphCollection<G, V, E> getGraphHeads() : DataSet<G> getVertices() : DataSet<V> getEdges() : DataSet<E> select(...) : GraphCollection<G, V, E> distinct( ) : GraphCollection<G, V, E> sortBy(...) : GraphCollection<G, V, E> union(...) : GraphCollection<G, V, E> difference(...) : GraphCollection<G, V, E> // ... } EPGM API (Operators)
  • 58. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 58 Motivation EPGM Operators BenchmarkImplementation 58 Implementation Apache Flink Gradoop on Flink interface DataSource<G extends EPGMGraphHead, V extends EPGMVertex, E extends EPGMEdge> { getLogicalGraph(...) : LogicalGraph<G, V, E> getGraphCollection(...) : GraphCollection<G, V, E> } interface DataSink<G extends EPGMGraphHead, V extends EPGMVertex, E extends EPGMEdge > { write(LogicalGraph<G, V, E>) : void write(GraphCollection<G, V, E>) : void } class GraphDataSource<...> implements DataSource<...> { } class HBaseDataSource<...> implements DataSource<...> { } class JSONDataSource<...> implements DataSource<...> { } class TLFDataSource<...> implements DataSource<...> { } class HBaseDataSink<...> implements DataSink<...> { } class JSONDataSink<...> implements DataSink<...> { } class TLFDataSink<...> implements DataSource<...> { } EPGM API (I/O)
  • 59. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 59 Motivation EPGM Operators BenchmarkImplementation 59 Benchmark
  • 60. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 60 Motivation EPGM Operators BenchmarkImplementation 60 Benchmark 1. Extract subgraph containing only Persons and knows relations 2. Transform Persons to necessary information 3. Find communities using Label Propagation 4. Aggregate vertex count for each community 5. Select communities with more than 50K users 6. Combine large communities to a single graph 7. Group graph by Persons location and gender 8. Aggregate vertex and edge count of grouped graph http://ldbcouncil.org/
  • 61. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 61 Motivation EPGM Operators BenchmarkImplementation 61 Benchmark 1. Extract subgraph containing only Persons and knows relations 2. Transform Persons to necessary information 3. Find communities using Label Propagation 4. Aggregate vertex count for each community 5. Select communities with more than 50K users 6. Combine large communities to a single graph 7. Group graph by Persons location and gender 8. Aggregate vertex and edge count of grouped graph https://git.io/vgozj
  • 62. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 62 Motivation EPGM Operators BenchmarkImplementation 62 Benchmark Dataset # Vertices # Edges Disk size Graphalytics.1 61,613 2,026,082 570 MB Graphalytics.10 260,613 16,600,778 4.5 GB Graphalytics.100 1,695,613 147,437,275 40.2 GB Graphalytics.1000 12,775,613 1,363,747,260 372 GB Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB • 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT • slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960
  • 63. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 63 Motivation EPGM Operators BenchmarkImplementation 63 Benchmark Dataset # Vertices # Edges Disk size Graphalytics.1 61,613 2,026,082 570 MB Graphalytics.10 260,613 16,600,778 4.5 GB Graphalytics.100 1,695,613 147,437,275 40.2 GB Graphalytics.1000 12,775,613 1,363,747,260 372 GB Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB • 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT • slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960 0 200 400 600 800 1000 1200 1 2 4 8 16 Runtime[s] Number of workers Graphalytics.100
  • 64. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 64 Motivation EPGM Operators BenchmarkImplementation 64 Benchmark 1 2 4 8 16 1 2 4 8 16 Speedup Number of workers Graphalytics.100 Linear Dataset # Vertices # Edges Disk size Graphalytics.1 61,613 2,026,082 570 MB Graphalytics.10 260,613 16,600,778 4.5 GB Graphalytics.100 1,695,613 147,437,275 40.2 GB Graphalytics.1000 12,775,613 1,363,747,260 372 GB Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB • 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT • slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960
  • 65. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 65 Motivation EPGM Operators BenchmarkImplementation 65 Benchmark 1 10 100 1000 10000 Runtime[s] Dataset # Vertices # Edges Disk size Graphalytics.1 61,613 2,026,082 570 MB Graphalytics.10 260,613 16,600,778 4.5 GB Graphalytics.100 1,695,613 147,437,275 40.2 GB Graphalytics.1000 12,775,613 1,363,747,260 372 GB Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB • 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT • slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960
  • 66. Summary • 0.0.1 First Prototype (May 2015) – Hadoop MapReduce and Giraph for operator implementations – Too much complexity – Performance loss through serialization in HDFS/HBase • 0.0.2 Using Flink as execution layer (June 2015) – Basic operators • 0.1 December 2015 – System-side identifiers (UUID) – Improved property handling – More operator implementations (e.g., Equality, Bool operators) – Code refactoring • 0.2-SNAPSHOT August 2016 – Graph Pattern Matching  – Frequent Subgraph Mining  – Memory optimization (96-bit ID, Dictionary Encoding, …) – Refactoring Release History
  • 67. Summary Contributions welcome! • Code • I/O Formats (GraphML, DOT, …) • Operators and Algorithms • Tuning (Memory consumption, serialization, …) • API improvements • Use cases and data • Business Intelligence • Fraud Detection • Pattern Mining • …
  • 68. • Extended Property Graph Model • Schema flexible: Type Labels and Properties • Logical Graphs / Graphs Collection • Graph and Collection Operators • Combination to analytical workflows • Implemented on Apache Flink • Built-in scalability • Combine with other libraries Summary
  • 69. www.gradoop.com [1] Junghanns, M.; Petermann, A.; Teichmann, N.; Gomez, K.; Rahm, E., „Analyzing Extended Property Graphs with Apache Flink“, Int. Workshop on Network Data Analytics (NDA), SIGMOD 2016. [2] Petermann, A.; Junghanns, M., „Scalable Business Intelligence with Graph Collections“, it – Special Issue on Big Data Analytics, 2016. [3] Petermann, A.; Junghanns, M.; Müller, M.; Rahm, E., „Graph-based Data Integration and Business Intelligence with BIIIG“, Proc. VLDB Conf. (Demo), 2014.