SlideShare a Scribd company logo
1 of 29
Download to read offline
Marcus Paradies

          Challenges in the Design of a Graph Database
          Benchmark
          FOSDEM‘12 – Graph Processing DevRoom




© Prof. Dr.-Ing. Wolfgang Lehner |
> Outline


     Motivation
     Challenges
     Thoughts on Graph Data Generation
     Thoughts on Query Workload
     Summary and Outlook
     Discussion




   Marcus Paradies |                      FOSDEM 2012   |   1
> Motivation

  Graph databases are gaining momentum

  Enterprise corporations are getting interested

  How to compare the available graph database vendors?

  Main issue: Results from benchmarks are not comparable

  Lack of standardization in the data model and query language

  What are “typical“ graph operations?




  Marcus Paradies |                                           FOSDEM 2012   |   2
>




                        Challenges




    Marcus Paradies |                FOSDEM 2012   |   3
> Challenge #1: Application Domain

  Graph data is not homogenous

  Graph data from different domains follows different patterns

  Examples:
      Social Network Analysis (SNA)
      Protein Interaction Analysis
      Recommendation Systems
      Supply Chain Management (Vehicle Routing, CRM)
      Fraud Detection in Financial Systems
      …

 Challenge: Find an application domain which represents a graph data pattern
                common in many different scenarios.

   Marcus Paradies |                                              FOSDEM 2012   |   4
> Challenge #2: Graph Data Model




         What flavours of graph data models
                are commonly used?




   Marcus Paradies |                   FOSDEM 2012   |   5
> Challenge #2: Graph Data Model



                       Directed Graph




   Marcus Paradies |                    FOSDEM 2012   |   6
> Challenge #2: Graph Data Model



                       Directed Graph

                         Undirected Graph




   Marcus Paradies |                        FOSDEM 2012   |   7
> Challenge #2: Graph Data Model



                       Directed Graph

                         Undirected Graph

           Mixed Graph




   Marcus Paradies |                        FOSDEM 2012   |   8
> Challenge #2: Graph Data Model



                       Directed Graph

                         Undirected Graph

           Mixed Graph                  Multi Graph



   Marcus Paradies |                              FOSDEM 2012   |   9
> Challenge #2: Graph Data Model


                                         (Plain) Property
                       Directed Graph
                                              Graph
                         Undirected Graph

           Mixed Graph                  Multi Graph



   Marcus Paradies |                              FOSDEM 2012   |   10
> Challenge #2: Graph Data Model
  (Structured
Property Graph)                     (Plain) Property
         Directed Graph
                                         Graph
                       Undirected Graph

           Mixed Graph             Multi Graph



   Marcus Paradies |                         FOSDEM 2012   |   11
> Challenge #2: Graph Data Model
  (Structured
Property Graph)                     (Plain) Property
         Directed Graph
                                         Graph
                       Undirected Graph

           Mixed Graph     Multi Graph
               Hyper Graph



   Marcus Paradies |                         FOSDEM 2012   |   12
> Challenge #2: Graph Data Model
  (Structured
Property Graph)                                (Plain) Property
         Directed Graph
                                                    Graph
                            Undirected Graph

           Mixed Graph     Multi Graph
               Hyper Graph
  Challenge: Find a graph data model suited for the majority of use cases
                       from various domains.

   Marcus Paradies |                                            FOSDEM 2012   |   13
> Challenge #3: Querying Graph Data




   Large variety in graph processing and manipulation languages
   Each graph database vendor implements own query languages/APIs
   Reason: No standardized graph query language available




   Marcus Paradies |                                           FOSDEM 2012   |   14
> Challenge #3: Querying Graph Data




   Large variety in graph processing and manipulation languages
   Each graph database vendor implements own query languages/APIs
   Reason: No standardized graph query language available


  Challenge: Find a way to abstract from the zoo of available query languages.

   Marcus Paradies |                                            FOSDEM 2012   |   15
> Challenge #4: Defining the Workload

  The workload to be defined is dependent from the underlying
   query/manipulation language

  Should complex (algorithmic) operations be part of a database benchmark?

  Which algorithms to pick?
   Social Network Analysis → Find communities
   Supply Chain Management → Find maximal flow
   Web of Data → Find pattern matches

  How are concurrent users represented?

  What about transactionality?




   Marcus Paradies |                                             FOSDEM 2012   |   16
>




                Thoughts on Graph Data Generation




    Marcus Paradies |                         FOSDEM 2012   |   17
> Graph Data Generation - Patterns


  Understanding graph patterns (characteristics) is crucical for a good graph
   data generator
  What are distinguishing characteristics of graphs?
  How can we identify graph patterns on large graphs?
  Three main patterns [1]:
     Power law distributed
     Small diameters
     Community Effects




                              ?                  ?
                              =                  =

   Marcus Paradies |                                             FOSDEM 2012   |   18
> Pattern 1 – Power law distributed




                            source: [2]                        source: [2]


  Most real-world graph data sets follow a power law distribution
  Examples:
   Internet router graph
   Subsets of the WWW
   Citation Graphs


   Marcus Paradies |                                             FOSDEM 2012   |   19
> Pattern 2 – Small Diameters

   Effective Diameter (eccentricity): Minimum number of hops, in which a
    fraction (e.g. 90%) of all connected pairs of nodes can reach each other
   Other measures exist as well, but are not applicable to disconnected graphs
   In most use cases, diameter is much smaller than the size of the graph
   Examples:
    97% eccentricity of around 16 for path lengths in the WWW
    Average path length around 6 for Epinions social network




                                                     source: [1]
   Marcus Paradies |                                               FOSDEM 2012   |   20
> Pattern 3 – Community Effects


   Community: A set of nodes, where each node in the set is closer to all other
    nodes in the community than to nodes outside the community.
   Communities can be found in many real-world graphs, especially social
    networks and collaboration networks
   Clustering Coefficient C: A measure, which qualifies the „clumpiness“ of a
    graph




   Marcus Paradies |                                             FOSDEM 2012   |   21
>




                        Thoughts on Query Workload




    Marcus Paradies |                                FOSDEM 2012   |   22
> Query Workload - Operations

  Graph Manipulation Operations
     Add/Update/Remove Nodes from the Graph
     Add/Update/Remove Edges from the Graph
     Add/Update/Remove Edge attributes
     Add/Update/Remove Node attributes
  Graph Query Operations
   Retrieve selection of nodes from given filter expression
   Getting the neighbors of a set of nodes (possibly with edge filter constraints)
  Graph Traversals
   Based on basic query operations
   Exploration of neighborhood from a given set of start nodes
   Terminated by the number of steps and/or edge/node filter constraints
  Graph Analytical Operations
   Aggregation operations such as sum, avg, min, max
   Aggregations on node-level and on edge-level



   Marcus Paradies |                                                     FOSDEM 2012   |   23
> Query Workload - Measures


  Closely related to benchmark capabilities

  Measures from relational benchmarks apply such as
   Average query response time
   Transactions per second (throughput)

  Additional measures for graph traversals
   Traversals per second

  What about distributed scenarios?

  What about concurrent users?




   Marcus Paradies |                                   FOSDEM 2012   |   24
> Summary and Outlook

  Graph data distribution highly important for graph database benchmark

  Application domains do have very specific graph characteristics

  A graph database benchmark has to provide abstract and high-level graph
   operation descriptions



  Feel free to contact me if you want to contribute:

                            marcus.paradies@gmail.com




   Marcus Paradies |                                            FOSDEM 2012   |   25
>




                        Discussion




    Marcus Paradies |                FOSDEM 2012   |   26
> Theses



  A benchmark based on social network data is nice, but might be not be that
   representative for large enterprise applications

  Algorithms should NOT be part of a graph database benchmark

  Only support basic operations such as simple lookups and path traversals

  The underlying graph data model should be a simple property graph

  A graph database has to scale in terms of data size as well as number of
   concurrent users

  ....



   Marcus Paradies |                                             FOSDEM 2012   |   27
> References



 [1] Graph Mining: Laws, Generators, and Algorithms (2006)

 [2] http://konect.uni-koblenz.de/

 [3] A Discussion on the Design of Graph Database Benchmarks (2010)




   Marcus Paradies |                                          FOSDEM 2012   |   28

More Related Content

Similar to Challenges in the Design of a Graph Database Benchmark

Cloud software engineering
Cloud software engineeringCloud software engineering
Cloud software engineeringIan Sommerville
 
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...SANGHEE SHIN
 
IRJET- Comparatively Analysis on K-Means++ and Mini Batch K-Means Clustering ...
IRJET- Comparatively Analysis on K-Means++ and Mini Batch K-Means Clustering ...IRJET- Comparatively Analysis on K-Means++ and Mini Batch K-Means Clustering ...
IRJET- Comparatively Analysis on K-Means++ and Mini Batch K-Means Clustering ...IRJET Journal
 
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...dbpublications
 
P209 leithiser-relationaldb-formal-specifications
P209 leithiser-relationaldb-formal-specificationsP209 leithiser-relationaldb-formal-specifications
P209 leithiser-relationaldb-formal-specificationsBob Leithiser
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoopAnusha sweety
 
Petroleum Data Models for spatial data
Petroleum Data Models for spatial dataPetroleum Data Models for spatial data
Petroleum Data Models for spatial dataabsvis
 
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logisticsTed Dunning
 
Innovation in model driven software
Innovation in model driven softwareInnovation in model driven software
Innovation in model driven softwareSagi Schliesser
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKTaposh Roy
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsDilum Bandara
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsEllen Friedman
 

Similar to Challenges in the Design of a Graph Database Benchmark (20)

Cloud software engineering
Cloud software engineeringCloud software engineering
Cloud software engineering
 
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
 
2008.11560v2.pdf
2008.11560v2.pdf2008.11560v2.pdf
2008.11560v2.pdf
 
IRJET- Comparatively Analysis on K-Means++ and Mini Batch K-Means Clustering ...
IRJET- Comparatively Analysis on K-Means++ and Mini Batch K-Means Clustering ...IRJET- Comparatively Analysis on K-Means++ and Mini Batch K-Means Clustering ...
IRJET- Comparatively Analysis on K-Means++ and Mini Batch K-Means Clustering ...
 
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)
 
P209 leithiser-relationaldb-formal-specifications
P209 leithiser-relationaldb-formal-specificationsP209 leithiser-relationaldb-formal-specifications
P209 leithiser-relationaldb-formal-specifications
 
Hadoop Mapreduce
Hadoop MapreduceHadoop Mapreduce
Hadoop Mapreduce
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
 
Petroleum Data Models for spatial data
Petroleum Data Models for spatial dataPetroleum Data Models for spatial data
Petroleum Data Models for spatial data
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
DSM Extraction from Pleiades Images using Micmac
DSM Extraction from Pleiades Images using MicmacDSM Extraction from Pleiades Images using Micmac
DSM Extraction from Pleiades Images using Micmac
 
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
B1803031217
B1803031217B1803031217
B1803031217
 
Innovation in model driven software
Innovation in model driven softwareInnovation in model driven software
Innovation in model driven software
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven Organizations
 

Recently uploaded

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

Challenges in the Design of a Graph Database Benchmark

  • 1. Marcus Paradies Challenges in the Design of a Graph Database Benchmark FOSDEM‘12 – Graph Processing DevRoom © Prof. Dr.-Ing. Wolfgang Lehner |
  • 2. > Outline  Motivation  Challenges  Thoughts on Graph Data Generation  Thoughts on Query Workload  Summary and Outlook  Discussion Marcus Paradies | FOSDEM 2012 | 1
  • 3. > Motivation  Graph databases are gaining momentum  Enterprise corporations are getting interested  How to compare the available graph database vendors?  Main issue: Results from benchmarks are not comparable  Lack of standardization in the data model and query language  What are “typical“ graph operations? Marcus Paradies | FOSDEM 2012 | 2
  • 4. > Challenges Marcus Paradies | FOSDEM 2012 | 3
  • 5. > Challenge #1: Application Domain  Graph data is not homogenous  Graph data from different domains follows different patterns  Examples:  Social Network Analysis (SNA)  Protein Interaction Analysis  Recommendation Systems  Supply Chain Management (Vehicle Routing, CRM)  Fraud Detection in Financial Systems  … Challenge: Find an application domain which represents a graph data pattern common in many different scenarios. Marcus Paradies | FOSDEM 2012 | 4
  • 6. > Challenge #2: Graph Data Model What flavours of graph data models are commonly used? Marcus Paradies | FOSDEM 2012 | 5
  • 7. > Challenge #2: Graph Data Model Directed Graph Marcus Paradies | FOSDEM 2012 | 6
  • 8. > Challenge #2: Graph Data Model Directed Graph Undirected Graph Marcus Paradies | FOSDEM 2012 | 7
  • 9. > Challenge #2: Graph Data Model Directed Graph Undirected Graph Mixed Graph Marcus Paradies | FOSDEM 2012 | 8
  • 10. > Challenge #2: Graph Data Model Directed Graph Undirected Graph Mixed Graph Multi Graph Marcus Paradies | FOSDEM 2012 | 9
  • 11. > Challenge #2: Graph Data Model (Plain) Property Directed Graph Graph Undirected Graph Mixed Graph Multi Graph Marcus Paradies | FOSDEM 2012 | 10
  • 12. > Challenge #2: Graph Data Model (Structured Property Graph) (Plain) Property Directed Graph Graph Undirected Graph Mixed Graph Multi Graph Marcus Paradies | FOSDEM 2012 | 11
  • 13. > Challenge #2: Graph Data Model (Structured Property Graph) (Plain) Property Directed Graph Graph Undirected Graph Mixed Graph Multi Graph Hyper Graph Marcus Paradies | FOSDEM 2012 | 12
  • 14. > Challenge #2: Graph Data Model (Structured Property Graph) (Plain) Property Directed Graph Graph Undirected Graph Mixed Graph Multi Graph Hyper Graph Challenge: Find a graph data model suited for the majority of use cases from various domains. Marcus Paradies | FOSDEM 2012 | 13
  • 15. > Challenge #3: Querying Graph Data  Large variety in graph processing and manipulation languages  Each graph database vendor implements own query languages/APIs  Reason: No standardized graph query language available Marcus Paradies | FOSDEM 2012 | 14
  • 16. > Challenge #3: Querying Graph Data  Large variety in graph processing and manipulation languages  Each graph database vendor implements own query languages/APIs  Reason: No standardized graph query language available Challenge: Find a way to abstract from the zoo of available query languages. Marcus Paradies | FOSDEM 2012 | 15
  • 17. > Challenge #4: Defining the Workload  The workload to be defined is dependent from the underlying query/manipulation language  Should complex (algorithmic) operations be part of a database benchmark?  Which algorithms to pick?  Social Network Analysis → Find communities  Supply Chain Management → Find maximal flow  Web of Data → Find pattern matches  How are concurrent users represented?  What about transactionality? Marcus Paradies | FOSDEM 2012 | 16
  • 18. > Thoughts on Graph Data Generation Marcus Paradies | FOSDEM 2012 | 17
  • 19. > Graph Data Generation - Patterns  Understanding graph patterns (characteristics) is crucical for a good graph data generator  What are distinguishing characteristics of graphs?  How can we identify graph patterns on large graphs?  Three main patterns [1]:  Power law distributed  Small diameters  Community Effects ? ? = = Marcus Paradies | FOSDEM 2012 | 18
  • 20. > Pattern 1 – Power law distributed source: [2] source: [2]  Most real-world graph data sets follow a power law distribution  Examples:  Internet router graph  Subsets of the WWW  Citation Graphs Marcus Paradies | FOSDEM 2012 | 19
  • 21. > Pattern 2 – Small Diameters  Effective Diameter (eccentricity): Minimum number of hops, in which a fraction (e.g. 90%) of all connected pairs of nodes can reach each other  Other measures exist as well, but are not applicable to disconnected graphs  In most use cases, diameter is much smaller than the size of the graph  Examples:  97% eccentricity of around 16 for path lengths in the WWW  Average path length around 6 for Epinions social network source: [1] Marcus Paradies | FOSDEM 2012 | 20
  • 22. > Pattern 3 – Community Effects  Community: A set of nodes, where each node in the set is closer to all other nodes in the community than to nodes outside the community.  Communities can be found in many real-world graphs, especially social networks and collaboration networks  Clustering Coefficient C: A measure, which qualifies the „clumpiness“ of a graph Marcus Paradies | FOSDEM 2012 | 21
  • 23. > Thoughts on Query Workload Marcus Paradies | FOSDEM 2012 | 22
  • 24. > Query Workload - Operations  Graph Manipulation Operations  Add/Update/Remove Nodes from the Graph  Add/Update/Remove Edges from the Graph  Add/Update/Remove Edge attributes  Add/Update/Remove Node attributes  Graph Query Operations  Retrieve selection of nodes from given filter expression  Getting the neighbors of a set of nodes (possibly with edge filter constraints)  Graph Traversals  Based on basic query operations  Exploration of neighborhood from a given set of start nodes  Terminated by the number of steps and/or edge/node filter constraints  Graph Analytical Operations  Aggregation operations such as sum, avg, min, max  Aggregations on node-level and on edge-level Marcus Paradies | FOSDEM 2012 | 23
  • 25. > Query Workload - Measures  Closely related to benchmark capabilities  Measures from relational benchmarks apply such as  Average query response time  Transactions per second (throughput)  Additional measures for graph traversals  Traversals per second  What about distributed scenarios?  What about concurrent users? Marcus Paradies | FOSDEM 2012 | 24
  • 26. > Summary and Outlook  Graph data distribution highly important for graph database benchmark  Application domains do have very specific graph characteristics  A graph database benchmark has to provide abstract and high-level graph operation descriptions  Feel free to contact me if you want to contribute: marcus.paradies@gmail.com Marcus Paradies | FOSDEM 2012 | 25
  • 27. > Discussion Marcus Paradies | FOSDEM 2012 | 26
  • 28. > Theses  A benchmark based on social network data is nice, but might be not be that representative for large enterprise applications  Algorithms should NOT be part of a graph database benchmark  Only support basic operations such as simple lookups and path traversals  The underlying graph data model should be a simple property graph  A graph database has to scale in terms of data size as well as number of concurrent users  .... Marcus Paradies | FOSDEM 2012 | 27
  • 29. > References [1] Graph Mining: Laws, Generators, and Algorithms (2006) [2] http://konect.uni-koblenz.de/ [3] A Discussion on the Design of Graph Database Benchmarks (2010) Marcus Paradies | FOSDEM 2012 | 28