Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for
Analyzing Time-Varying Electrical Networks
Hadoop Summit 2016, Dublin, April 13th
Thomas VIAL, OCTO Technology - tvial@octo.com
Guillaume GERMAINE, EDF R&D - guillaume.germaine@edf.fr
Marie-Luce Picard, Benoit Grossin, Martin Soppé, Michel Lutz
2
Outline
1. CONTEXT AND PROBLEM DESCRIPTION
2. GRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTS
3. TWO CHALLENGERS: SPARK GRAPHX AND TITAN
4. QUERYING TIME-VARYING GRAPHS
5. SCALING UP ON BIG GRAPHS
6. AS A CONCLUSION
Brice Richard - Flickr
KC Tan Phoyography - Flickr
3
Outline
1. CONTEXT AND PROBLEM DESCRIPTION
2. GRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTS
3. TWO CHALLENGERS: SPARK GRAPHX AND TITAN
4. QUERYING TIME-VARYING GRAPHS
5. SCALING UP ON BIG GRAPHS
6. AS A CONCLUSION
Brice Richard - Flickr
KC Tan Phoyography - Flickr
4
ELECTRICITY GENERATION
623.5 TWH
All electricity-related activities
Generation
Transmission & Distribution
Trading and Sales & Marketing
Energy services
Key figures*
€72.9 billion in sales
38.5 million customers
158,161 employees worldwide
84.7% of generation does not emit CO2
2014 INVESTMENTS
€4.5 BILLION
EDF: A GLOBAL LEADER IN ELECTRICITY
*as of 2015
5
Description of the French electrical network
2200 distribution substations
618 000 kilometers of medium voltage network
697 000 kilometers of low voltage network
A great diversity of components: lines, transformers, switches,
line brakers, meters…
High voltage Medium voltage Low voltage
Transmission network Distribution network
More than 100 million components
6
French electrical network: more than 100m equipments
to manage
Distribution
substation Meters
… X2200
200+
nodes in
depth
~ 50000
nodes
100m+
elements
7
Some business cases to tackle
BC1 Get a global picture of all powered
components, at any given time
BC2 Track the physical evolution of the
network over time (new meters,
reconfiguration of the network…)
BC3 Process electrical energy balance for
any subpart of the network, over any period
of time
BC4 If an equipment fails, figure out the best
way to restore power supply as quickly as
possible, given the state of the network at
this specific time
8
How to modelize the problem?
define relationships between technical equipments
… it’s all about graphs
associate some properties to each component
keep track of events across time
explore the network, with complex calculations
at hand
We have to find a way to:
9
Outline
1. CONTEXT AND PROBLEM DESCRIPTION
2. GRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTS
3. TWO CHALLENGERS: SPARK GRAPHX AND TITAN
4. QUERYING TIME-VARYING GRAPHS
5. SCALING UP ON BIG GRAPHS
6. AS A CONCLUSION
Brice Richard - Flickr
KC Tan Phoyography - Flickr
10
A property graph as a data model
Edge
Vertex
Properties
key1:value
key2:value
key1:value
key2:value
key1:value
key2:value
Label
11
All components of the network are modelized as vertices:
• Distribution substations
• Switches
• Line brakers
• Transformers
• Loads
• Lines
• …
Edges only represent symbolic, non-oriented links between components
Electrical lines are also modelized as vertices!
A property graph as a data model
Load curves (metering data) are not stored as properties into the graph
structure, but in a separate storage (HBase in this study)
12
Graphs and temporal data: not so simple!
A unique graph to record every event that has ever occurred
All events are recorded as vertex properties, in the form of time intervals:
Validity: track equipments life cycle (actual period as a working unit)
Failure: track equipments failures
SwitchState: track network reconfiguration through switches changes of state
(open/close)
Each time an event occurs, a time interval is updated, or created
The concept: apply graph mutations at each new event, enriching the “maximal” graph
14
Outline
1. CONTEXT AND PROBLEM DESCRIPTION
2. GRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTS
3. TWO CHALLENGERS: SPARK GRAPHX AND TITAN
4. QUERYING TIME-VARYING GRAPHS
5. SCALING UP ON BIG GRAPHS
6. AS A CONCLUSION
Brice Richard - Flickr
KC Tan Phoyography - Flickr
15
Spark GraphX: global architecture
Offers a graph API built on top of Spark
Extends RDD to represent property graphs: VertexRDD and EdgeRDD
Implements a collection of graph algorithms (e.g., Page Rank, Triangle
Counting, Connected Components…) and also some fundamental
operations on graphs (e.g., mapVertices, mapEdges, subgraph,
collectNeighbors…)
16
Spark GraphX: Pregel
Implements a variant of Pregel, a very popular graph processing architecture
developed at Google in 2010: “Pregel: a system for large-scale graph
processing - Malewicz et al.”
Based on BSP (Bulk Synchronous Parallel)
Vertex-centric: edges don't carry any computation
Runs in sequence of iterations (supersteps)
For each superstep, every vertex can:
• read messages sent to it during the previous superstep
• execute a user-defined function and generate some messages
• send messages to other vertices (that will be received at the next
superstep)
• vote to halt
18
TITAN: global architecture
Optimized for storing and querying billions of vertices and edges over a cluster
Supports thousands of concurrent users
Can execute local queries (OLTP) or distributed queries across a cluster (OLAP)
19
TITAN: GREMLIN
Gremlin OLTP: for local graphs traversal, queries run in a single process
Gremlin Hadoop (OLAP): for large graph analysis, queries are distributed
across a cluster
Gremlin Hadoop implements BSP-based vertex-centric computing
Gremlin is a graph traversal scripting language and is part
of Tinkerpop, a widely supported open-source graph
computing framework (e.g., Neo4J, OrientDB, Sparksee…)
Blueprints is like a JDBC driver for graphs
+++
Storage backend
Graph Database
Tinkerpop API Graph Processing
20
Outline
1. CONTEXT AND PROBLEM DESCRIPTION
2. GRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTS
3. TWO CHALLENGERS: SPARK GRAPHX AND TITAN
4. QUERYING TIME-VARYING GRAPHS
5. SCALING UP ON BIG GRAPHS
6. AS A CONCLUSION
Brice Richard - Flickr
KC Tan Phoyography - Flickr
21
Traversing a time-varying graph
We have as inputs
• A period of observation [t1, t2]
• A particular vertex as a starting point
We know, for each vertex in the graph,
when it was inactive
• For an equipment: taken down, under
maintenance, commutated, …
We want to traverse the graph, following
the paths as they vary according to the
states of the vertices encountered
We devise a “Trickling” algorithm that can
be easily implemented with either
paradigm, Pregel or Gremlin
23
=
=
=
==
=
What vertices are
reachable from substation
S at time t?
S
CONNECTED
CONNECTED
CONNECTED
CONNECTEDCONNECTED
DISCONNECTED
t t+ε
ACTUAL FLOW
BC1 – Get a picture of all powered components
?
24
=
=
=
==
=
t1 t2
For how long were vertices
reachable from S actually
powered between t1 and t2?
∑ intervals = 100%
∑ intervals = 100%
∑ intervals = 75%
∑ intervals = 55%∑ intervals = 75%
∑ intervals = 0%
ACTUAL FLOWS
BC2 – Track how long equipments are powered
?
25
==
=
How much aggregated power
did flow from substation S to
leaves between t1 and t2?
∑ loads = 1,988 MWh
1,132 MWh 856 MWh
0 MWh
kW
kW
kW
S
t1 t2
BC3 – Manage energy balance
?
28
From what other substations S’ can vertices below
S be reached? Pattern matching
S
S’
SUBSTATION
LINE
SWITCH
SUBSTATION
BC4 – Restore power supply after a failure
?
SWITCH
(many things)
30
Outline
1. CONTEXT AND PROBLEM DESCRIPTION
2. GRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTS
3. TWO CHALLENGERS: SPARK GRAPHX AND TITAN
4. QUERYING TIME-VARYING GRAPHS
5. SCALING UP ON BIG GRAPHS
6. AS A CONCLUSION
Brice Richard - Flickr
KC Tan Phoyography - Flickr
31
Titan at scale
A set of connected trees adding up to ~ 100 million vertices and ~ 100 million
edges, loaded on a 10-node HBase 1.1 cluster
• The quasi-tree below each substation S contains roughly 50,000
vertices
• The Titan table is made of 3 regions that are equally sollicited
• The client machine is on the same network as the HBase servers
Unitary execution with Titan’s OLTP interface
Query Approximate time
BC1 – Get powered components
Trickling
1 min
BC3 – Get aggregated load
Trickling
2 min
BC4 – Find backup substations
Pattern matching
1 min
33
Scaling across substations
The execution times above are for a single source substation S
To compute over several substations, we run the same query in
parallel with a different input
Client node
Titan
+
Gremlin
Query Query Query
Query Query Query
Query Query Query
… … … HBase Cluster
35
What about Titan OLAP?
The execution times above are predictible but quickly become impractical
for interactive querying
We may want to compute some KPIs in advance with OLAP backends,
re-using the same Gremlin queries
Titan 1.0.0 or 1.1.0-snapshot?
TinkerPop 3.0.1 or 3.1.0?
Titan + TinkerPop JARs or the other way around?
Giraph or Spark as the backend?
Be patient!
Support for Hadoop 2 OLAP is still limited. Bits are
being moved around between the two projects,
which are undergoing a big refactoring…
36
Next step: GraphX
Spark GraphX is another natural candidate for OLAP workloads
We have yet to benchmark it against our big graph
But wait!
GraphFrames for Spark is just being
released by Databricks.
What’s going on?
http://graphframes.github.io/
37
Outline
1. CONTEXT AND PROBLEM DESCRIPTION
2. GRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTS
3. TWO CHALLENGERS: SPARK GRAPHX AND TITAN
4. QUERYING TIME-VARYING GRAPHS
5. SCALING UP ON BIG GRAPHS
6. AS A CONCLUSION
Brice Richard - Flickr
KC Tan Phoyography - Flickr
38
A turning point for graph analytics
A lot is happening these days
• ThinkAurelius’s acquisition by Datastax (will they continue to support
HBase?)
• Titan and TinkerPop cross-refactorings
• Improvements in the computation backends (Giraph, Spark)
• Better support for Hadoop 2 and YARN
• Introduction of GraphFrames by Databricks
Graph analytics frameworks are becoming commodity, and this is a good thing!
39
Which framework to use?
A quick sketch about the typical usages of the frameworks:
Be sure to test them thoroughly in your environment before making
a final decision. Be prepared to go deep into Hadoop and Java/Scala
internals!
Or maybe wait a little bit until the situation gets clearer?
Framework Usages
Titan OLTP
Interactive querying, for a relatively small number
of vertices
Titan OLAP
Batch computations, KPI
(but wait for Hadoop 2 support if you’re using HBase)
Spark GraphX Batch computations, KPI
GraphFrames ? (it’s too early)
40
Appendix
Appendix A: The ecosystem of graph-based solutions (Graph Databases)
Appendix B: The ecosystem of graph-based solutions (Graph Processing
frameworks)
Appendix C: Graph traversal
Appendix D: Architecture with Spark GraphX
Appendix E: Architecture with Titan
41
Appendix A: The ecosystem of graph-based solutions
Graph Databases (OLTP)
Optimized for local graph exploration (traversal), with low latency
Optimized for handling multiple concurrent users
Data can be distributed across several machines
Queries themselves are not distributed: global graph analyses are
inefficient
42
Appendix B: The ecosystem of graph-based solutions
Graph Processing Frameworks (OLAP)
Optimized for global graph processing (batch)
Queries and data are distributed across a cluster, and can handle very
large graphs
Has a higher latency than OLTP solutions
Cannot handle a lot of concurrent users
44
Spark
Edges Vertices Vertex States Load curves
Edge RDD Vertex RDD
GraphX processors
Use case 1 Use case 2 Use case 3 …
Final result
Add intervals as properties
Combine
Appendix D: Architecture with GraphX
45
JVM
Groovy scripts library
Edges Vertices Load curves
Titan APIs
Transactional Gremlin
Add intervals as
properties
Combine load curves
HBase API
Use case
1
Use case
2
…
Vertex States
Appendix E: Architecture with Titan
Store load curves as time series