Publicidad
Publicidad

Más contenido relacionado

Presentaciones para ti(20)

Publicidad

Similar a Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks (20)

Más de DataWorks Summit/Hadoop Summit(20)

Publicidad

Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks

  1. Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Hadoop Summit 2016, Dublin, April 13th Thomas VIAL, OCTO Technology - tvial@octo.com Guillaume GERMAINE, EDF R&D - guillaume.germaine@edf.fr Marie-Luce Picard, Benoit Grossin, Martin Soppé, Michel Lutz
  2. 2 Outline 1. CONTEXT AND PROBLEM DESCRIPTION 2. GRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTS 3. TWO CHALLENGERS: SPARK GRAPHX AND TITAN 4. QUERYING TIME-VARYING GRAPHS 5. SCALING UP ON BIG GRAPHS 6. AS A CONCLUSION Brice Richard - Flickr KC Tan Phoyography - Flickr
  3. 3 Outline 1. CONTEXT AND PROBLEM DESCRIPTION 2. GRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTS 3. TWO CHALLENGERS: SPARK GRAPHX AND TITAN 4. QUERYING TIME-VARYING GRAPHS 5. SCALING UP ON BIG GRAPHS 6. AS A CONCLUSION Brice Richard - Flickr KC Tan Phoyography - Flickr
  4. 4 ELECTRICITY GENERATION 623.5 TWH All electricity-related activities Generation Transmission & Distribution Trading and Sales & Marketing Energy services Key figures* €72.9 billion in sales 38.5 million customers 158,161 employees worldwide 84.7% of generation does not emit CO2 2014 INVESTMENTS €4.5 BILLION EDF: A GLOBAL LEADER IN ELECTRICITY *as of 2015
  5. 5 Description of the French electrical network  2200 distribution substations  618 000 kilometers of medium voltage network  697 000 kilometers of low voltage network  A great diversity of components: lines, transformers, switches, line brakers, meters… High voltage Medium voltage Low voltage Transmission network Distribution network  More than 100 million components
  6. 6 French electrical network: more than 100m equipments to manage Distribution substation Meters … X2200 200+ nodes in depth ~ 50000 nodes 100m+ elements
  7. 7 Some business cases to tackle  BC1 Get a global picture of all powered components, at any given time  BC2 Track the physical evolution of the network over time (new meters, reconfiguration of the network…)  BC3 Process electrical energy balance for any subpart of the network, over any period of time  BC4 If an equipment fails, figure out the best way to restore power supply as quickly as possible, given the state of the network at this specific time
  8. 8 How to modelize the problem?  define relationships between technical equipments … it’s all about graphs  associate some properties to each component  keep track of events across time  explore the network, with complex calculations at hand We have to find a way to:
  9. 9 Outline 1. CONTEXT AND PROBLEM DESCRIPTION 2. GRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTS 3. TWO CHALLENGERS: SPARK GRAPHX AND TITAN 4. QUERYING TIME-VARYING GRAPHS 5. SCALING UP ON BIG GRAPHS 6. AS A CONCLUSION Brice Richard - Flickr KC Tan Phoyography - Flickr
  10. 10 A property graph as a data model Edge Vertex Properties key1:value key2:value key1:value key2:value key1:value key2:value Label
  11. 11  All components of the network are modelized as vertices: • Distribution substations • Switches • Line brakers • Transformers • Loads • Lines • …  Edges only represent symbolic, non-oriented links between components  Electrical lines are also modelized as vertices! A property graph as a data model  Load curves (metering data) are not stored as properties into the graph structure, but in a separate storage (HBase in this study)
  12. 12 Graphs and temporal data: not so simple!  A unique graph to record every event that has ever occurred  All events are recorded as vertex properties, in the form of time intervals:  Validity: track equipments life cycle (actual period as a working unit)  Failure: track equipments failures  SwitchState: track network reconfiguration through switches changes of state (open/close)  Each time an event occurs, a time interval is updated, or created The concept: apply graph mutations at each new event, enriching the “maximal” graph
  13. 13 Graphs and temporal data: not so simple! The concept: apply graph mutations at each new event, enriching the “maximal” graph Line LoadSrc Line Load Switch Line Validity : ]-∞, +∞[ Failure : [] LoadLine Validity : ]-∞, +∞[ Failure : [] Validity : ]-∞, +∞[ Failure : [] Validity : [t1, +∞[ Failure : [] SwitchState : [] DefaultState : closed Validity : ]-∞, +∞[ Failure : [] Validity : ]-∞, +∞[ Failure : [] Validity : ]-∞, +∞[ Failure : [] Validity : ]-∞, t2] Failure : [] Validity : ]-∞, +∞[ Failure : [t3, +∞[ Validity : [t1, +∞[ Failure : [] SwitchState : [t4, +∞[ DefaultState : closed
  14. 14 Outline 1. CONTEXT AND PROBLEM DESCRIPTION 2. GRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTS 3. TWO CHALLENGERS: SPARK GRAPHX AND TITAN 4. QUERYING TIME-VARYING GRAPHS 5. SCALING UP ON BIG GRAPHS 6. AS A CONCLUSION Brice Richard - Flickr KC Tan Phoyography - Flickr
  15. 15 Spark GraphX: global architecture  Offers a graph API built on top of Spark  Extends RDD to represent property graphs: VertexRDD and EdgeRDD  Implements a collection of graph algorithms (e.g., Page Rank, Triangle Counting, Connected Components…) and also some fundamental operations on graphs (e.g., mapVertices, mapEdges, subgraph, collectNeighbors…)
  16. 16 Spark GraphX: Pregel  Implements a variant of Pregel, a very popular graph processing architecture developed at Google in 2010: “Pregel: a system for large-scale graph processing - Malewicz et al.”  Based on BSP (Bulk Synchronous Parallel)  Vertex-centric: edges don't carry any computation  Runs in sequence of iterations (supersteps)  For each superstep, every vertex can: • read messages sent to it during the previous superstep • execute a user-defined function and generate some messages • send messages to other vertices (that will be received at the next superstep) • vote to halt
  17. 17 Spark GraphX: Pregel collecte fusion Vertex Vertex Vertex Vertex Vertex Vertex Superstep n-1 Superstep n Superstep n+1 sendMsg sendMsg sendMsg (other vertices) sendMsg sendMsg (other vertices) mergeMsg vprog haltVertex
  18. 18 TITAN: global architecture  Optimized for storing and querying billions of vertices and edges over a cluster  Supports thousands of concurrent users  Can execute local queries (OLTP) or distributed queries across a cluster (OLAP)
  19. 19 TITAN: GREMLIN  Gremlin OLTP: for local graphs traversal, queries run in a single process  Gremlin Hadoop (OLAP): for large graph analysis, queries are distributed across a cluster  Gremlin Hadoop implements BSP-based vertex-centric computing  Gremlin is a graph traversal scripting language and is part of Tinkerpop, a widely supported open-source graph computing framework (e.g., Neo4J, OrientDB, Sparksee…)  Blueprints is like a JDBC driver for graphs +++ Storage backend Graph Database Tinkerpop API Graph Processing
  20. 20 Outline 1. CONTEXT AND PROBLEM DESCRIPTION 2. GRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTS 3. TWO CHALLENGERS: SPARK GRAPHX AND TITAN 4. QUERYING TIME-VARYING GRAPHS 5. SCALING UP ON BIG GRAPHS 6. AS A CONCLUSION Brice Richard - Flickr KC Tan Phoyography - Flickr
  21. 21 Traversing a time-varying graph  We have as inputs • A period of observation [t1, t2] • A particular vertex as a starting point  We know, for each vertex in the graph, when it was inactive • For an equipment: taken down, under maintenance, commutated, …  We want to traverse the graph, following the paths as they vary according to the states of the vertices encountered  We devise a “Trickling” algorithm that can be easily implemented with either paradigm, Pregel or Gremlin
  22. 22 & = & = & = & = & = & = ? t1 t2 What paths from source S can we follow between t1 and t2? VERTEX ACTIVITY OBSERVATION ACTUAL FLOW S Trickling down
  23. 23 = = = == = What vertices are reachable from substation S at time t? S CONNECTED CONNECTED CONNECTED CONNECTEDCONNECTED DISCONNECTED t t+ε ACTUAL FLOW BC1 – Get a picture of all powered components ?
  24. 24 = = = == = t1 t2 For how long were vertices reachable from S actually powered between t1 and t2? ∑ intervals = 100% ∑ intervals = 100% ∑ intervals = 75% ∑ intervals = 55%∑ intervals = 75% ∑ intervals = 0% ACTUAL FLOWS BC2 – Track how long equipments are powered ?
  25. 25 == = How much aggregated power did flow from substation S to leaves between t1 and t2? ∑ loads = 1,988 MWh 1,132 MWh 856 MWh 0 MWh kW kW kW S t1 t2 BC3 – Manage energy balance ?
  26. 26 Trickling with Gremlin
  27. 27 Trickling with GraphX
  28. 28 From what other substations S’ can vertices below S be reached?  Pattern matching S S’ SUBSTATION LINE SWITCH SUBSTATION BC4 – Restore power supply after a failure ? SWITCH (many things)
  29. 29 ^[d]+.*$! // God created comments, and saw that it was good Pattern matching with Gremlin
  30. 30 Outline 1. CONTEXT AND PROBLEM DESCRIPTION 2. GRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTS 3. TWO CHALLENGERS: SPARK GRAPHX AND TITAN 4. QUERYING TIME-VARYING GRAPHS 5. SCALING UP ON BIG GRAPHS 6. AS A CONCLUSION Brice Richard - Flickr KC Tan Phoyography - Flickr
  31. 31 Titan at scale  A set of connected trees adding up to ~ 100 million vertices and ~ 100 million edges, loaded on a 10-node HBase 1.1 cluster • The quasi-tree below each substation S contains roughly 50,000 vertices • The Titan table is made of 3 regions that are equally sollicited • The client machine is on the same network as the HBase servers  Unitary execution with Titan’s OLTP interface Query Approximate time BC1 – Get powered components Trickling 1 min BC3 – Get aggregated load Trickling 2 min BC4 – Find backup substations Pattern matching 1 min
  32. 33 Scaling across substations  The execution times above are for a single source substation S  To compute over several substations, we run the same query in parallel with a different input Client node Titan + Gremlin Query Query Query Query Query Query Query Query Query … … … HBase Cluster
  33. 34 Scaling across substations – Results
  34. 35 What about Titan OLAP?  The execution times above are predictible but quickly become impractical for interactive querying  We may want to compute some KPIs in advance with OLAP backends, re-using the same Gremlin queries Titan 1.0.0 or 1.1.0-snapshot? TinkerPop 3.0.1 or 3.1.0? Titan + TinkerPop JARs or the other way around? Giraph or Spark as the backend? Be patient! Support for Hadoop 2 OLAP is still limited. Bits are being moved around between the two projects, which are undergoing a big refactoring…
  35. 36 Next step: GraphX  Spark GraphX is another natural candidate for OLAP workloads  We have yet to benchmark it against our big graph But wait! GraphFrames for Spark is just being released by Databricks. What’s going on? http://graphframes.github.io/
  36. 37 Outline 1. CONTEXT AND PROBLEM DESCRIPTION 2. GRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTS 3. TWO CHALLENGERS: SPARK GRAPHX AND TITAN 4. QUERYING TIME-VARYING GRAPHS 5. SCALING UP ON BIG GRAPHS 6. AS A CONCLUSION Brice Richard - Flickr KC Tan Phoyography - Flickr
  37. 38 A turning point for graph analytics  A lot is happening these days • ThinkAurelius’s acquisition by Datastax (will they continue to support HBase?) • Titan and TinkerPop cross-refactorings • Improvements in the computation backends (Giraph, Spark) • Better support for Hadoop 2 and YARN • Introduction of GraphFrames by Databricks  Graph analytics frameworks are becoming commodity, and this is a good thing!
  38. 39 Which framework to use?  A quick sketch about the typical usages of the frameworks:  Be sure to test them thoroughly in your environment before making a final decision. Be prepared to go deep into Hadoop and Java/Scala internals!  Or maybe wait a little bit until the situation gets clearer? Framework Usages Titan OLTP Interactive querying, for a relatively small number of vertices Titan OLAP Batch computations, KPI (but wait for Hadoop 2 support if you’re using HBase) Spark GraphX Batch computations, KPI GraphFrames ? (it’s too early)
  39. 40 Appendix Appendix A: The ecosystem of graph-based solutions (Graph Databases) Appendix B: The ecosystem of graph-based solutions (Graph Processing frameworks) Appendix C: Graph traversal Appendix D: Architecture with Spark GraphX Appendix E: Architecture with Titan
  40. 41 Appendix A: The ecosystem of graph-based solutions Graph Databases (OLTP)  Optimized for local graph exploration (traversal), with low latency  Optimized for handling multiple concurrent users  Data can be distributed across several machines  Queries themselves are not distributed: global graph analyses are inefficient
  41. 42 Appendix B: The ecosystem of graph-based solutions Graph Processing Frameworks (OLAP)  Optimized for global graph processing (batch)  Queries and data are distributed across a cluster, and can handle very large graphs  Has a higher latency than OLTP solutions  Cannot handle a lot of concurrent users
  42. 43 Job 4Job 2 Job 3Job 1 Appendix C: Graph traversal Src Vtx Vtx Vtx Vtx Vtx Vtx Vtx Vtx Vtx Vtx Vtx Vtx Vtx Vtx Vtx Vtx Src Vtx Vtx Vtx Vtx Vtx Vtx Vtx Vtx Vtx Vtx Vtx Vtx Vtx Vtx Vtx Vtx Computation nodes Depth-first: Breadth-first: Hard to parallelize ! Low latency High latency
  43. 44 Spark Edges Vertices Vertex States Load curves Edge RDD Vertex RDD GraphX processors Use case 1 Use case 2 Use case 3 … Final result Add intervals as properties Combine Appendix D: Architecture with GraphX
  44. 45 JVM Groovy scripts library Edges Vertices Load curves Titan APIs Transactional Gremlin Add intervals as properties Combine load curves HBase API Use case 1 Use case 2 … Vertex States Appendix E: Architecture with Titan Store load curves as time series
Publicidad