SlideShare a Scribd company logo
1 of 58
Download to read offline
Fishing Graphs in a Hadoop Data
Lake
Max Neunhöffer
Munich, 6 April 2017
www.arangodb.com
What is a graph?
E
A
C
D
F
B
pq
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
sin(x)
What is a graph?
E
A
C
D
F
B
pq
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
sin(x)
Social networks (edges are friendship)
Dependency chains
Computer networks
Citations
Hierarchies
What is a graph?
E
A
C
D
F
B
pq
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
sin(x)
Social networks (edges are friendship)
Dependency chains
Computer networks
Citations
Hierarchies
Indeed any relation
What is a graph?
E
A
C
D
F
B
pq
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
sin(x)
Social networks (edges are friendship)
Dependency chains
Computer networks
Citations
Hierarchies
Indeed any relation
Sometimes directed, sometimes undirected.
Usual approach: data in HDFS, use Spark/GraphFrames
v = spark.read.option("header",true).csv("hdfs://...")
e = spark.read.option("header",true).csv("hdfs://...")
g = GraphFrame(v,e)
g.inDegrees.show()
g.outDegrees.groupBy("outDegree").count().sort("outDegree").show(1000)
g.vertices.groupBy("GYEAR").count().sort("GYEAR").show()
g.find("(a)-[e]->(b);(b)-[ee]->(c)").filter("a.id = 6009536").count()
results = g.pageRank(resetProbability=0.01, maxIter=3)
Limitations/missed opportunities
Ad hoc queries
Often, one would like to perform smallish ad hoc queries on graph data.
Limitations/missed opportunities
Ad hoc queries
Often, one would like to perform smallish ad hoc queries on graph data.
Want to bring down latency from minutes to seconds or from seconds
to milliseconds.
Limitations/missed opportunities
Ad hoc queries
Often, one would like to perform smallish ad hoc queries on graph data.
Want to bring down latency from minutes to seconds or from seconds
to milliseconds. Usually, we would like to run many of them.
Limitations/missed opportunities
Ad hoc queries
Often, one would like to perform smallish ad hoc queries on graph data.
Want to bring down latency from minutes to seconds or from seconds
to milliseconds. Usually, we would like to run many of them.
Examples:
friends of friends of one person
find all immediate dependencies of one item
find all direct and indirect citations of one article
find all descendants of one member of a hierarchy
Limitations/missed opportunities
Ad hoc queries
Often, one would like to perform smallish ad hoc queries on graph data.
Want to bring down latency from minutes to seconds or from seconds
to milliseconds. Usually, we would like to run many of them.
Examples:
friends of friends of one person
find all immediate dependencies of one item
find all direct and indirect citations of one article
find all descendants of one member of a hierarchy
IDEA: Use a Graph Database
Graph Databases
Graph Databases
Can store and persist graphs.
Graph Databases
Graph Databases
Can store and persist graphs. However, the crucial ingredient of a graph
database is their ability to do graph queries.
Graph Databases
Graph Databases
Can store and persist graphs. However, the crucial ingredient of a graph
database is their ability to do graph queries.
Graph queries:
Find paths in graphs according to a pattern.
Find everything reachable from a vertex.
Find shortest paths between two given vertices.
Graph Databases
Graph Databases
Can store and persist graphs. However, the crucial ingredient of a graph
database is their ability to do graph queries.
Graph queries:
Find paths in graphs according to a pattern.
Find everything reachable from a vertex.
Find shortest paths between two given vertices.
=⇒ Graph Traversals
Graph Databases
Graph Databases
Can store and persist graphs. However, the crucial ingredient of a graph
database is their ability to do graph queries.
Graph queries:
Find paths in graphs according to a pattern.
Find everything reachable from a vertex.
Find shortest paths between two given vertices.
=⇒ Graph Traversals Crucial: Number of steps a priori unknown!
Graph Traversals
A
B
C
D
J
E
H
F
G
A
Graph Traversals
A
B
C
D
J
E
H
F
G
AB
Graph Traversals
A
B
C
D
J
E
H
F
G
ABC
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCE
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCED
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJ
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJ
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJF
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJFG
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJFG
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJFGH
The Multi-Model Approach
Multi-model database
A multi-model database combines a document store with a graph
database and is at the same time a key/value store,
The Multi-Model Approach
Multi-model database
A multi-model database combines a document store with a graph
database and is at the same time a key/value store,
with a common query language for all three data models.
The Multi-Model Approach
Multi-model database
A multi-model database combines a document store with a graph
database and is at the same time a key/value store,
with a common query language for all three data models.
Important:
Is able to compete with specialised products on their turf.
The Multi-Model Approach
Multi-model database
A multi-model database combines a document store with a graph
database and is at the same time a key/value store,
with a common query language for all three data models.
Important:
Is able to compete with specialised products on their turf.
Allows for polyglot persistence using a single database technology.
The Multi-Model Approach
Multi-model database
A multi-model database combines a document store with a graph
database and is at the same time a key/value store,
with a common query language for all three data models.
Important:
Is able to compete with specialised products on their turf.
Allows for polyglot persistence using a single database technology.
In a microservice architecture, there will be several different deployments.
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
with transaction semantics,
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
with transaction semantics,
allowing to do joins,
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
with transaction semantics,
allowing to do joins,
and to do graph queries,
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
with transaction semantics,
allowing to do joins,
and to do graph queries,
AQL is independent of the driver used and
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
with transaction semantics,
allowing to do joins,
and to do graph queries,
AQL is independent of the driver used and
offers protection against injections by design.
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
Idea
Distributed applications can be deployed as easily as one installs a mobile
app on a phone.
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
Idea
Distributed applications can be deployed as easily as one installs a mobile
app on a phone.
Cluster resource management is automatic.
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
Idea
Distributed applications can be deployed as easily as one installs a mobile
app on a phone.
Cluster resource management is automatic.
This leads to significantly better resource utilization.
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
Idea
Distributed applications can be deployed as easily as one installs a mobile
app on a phone.
Cluster resource management is automatic.
This leads to significantly better resource utilization.
Fault tolerance, self-healing and automatic failover is guaranteed.
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
Idea
Distributed applications can be deployed as easily as one installs a mobile
app on a phone.
Cluster resource management is automatic.
This leads to significantly better resource utilization.
Fault tolerance, self-healing and automatic failover is guaranteed.
runs on Apache Mesos and Mesosphere DC/OS clusters.
Back to topic: DC/OS as infrastructure
DC/OS is the perfect environment for our needs
DC/OS manages for us:
Software deployment
Resource management (increased utilization)
Service discovery
Back to topic: DC/OS as infrastructure
DC/OS is the perfect environment for our needs
DC/OS manages for us:
Software deployment
Resource management (increased utilization)
Service discovery
Allows to plug things together!
Back to topic: DC/OS as infrastructure
DC/OS is the perfect environment for our needs
DC/OS manages for us:
Software deployment
Resource management (increased utilization)
Service discovery
Allows to plug things together!
Consequence: We can easily deploy multiple systems alongside each other.
Back to topic: DC/OS as infrastructure
DC/OS is the perfect environment for our needs
DC/OS manages for us:
Software deployment
Resource management (increased utilization)
Service discovery
Allows to plug things together!
Consequence: We can easily deploy multiple systems alongside each other.
Example: HDFS, Spark and ArangoDB
Import data into ArangoDB
hdfs dfs -get hdfs://name-1-node.hdfs.mesos:9001/patents.csv
hdfs dfs -get hdfs://name-1-node.hdfs.mesos:9001/citations.csv
dcos package install arangodb3
arangosh 
--server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos
var g = require("@arangodb/general-graph");
var G = g._create("G",[g._relation("citations",["patents"],["patents"])]);
arangoimp --collection patents --file patents.csv --type csv 
--server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos
arangoimp --collection citations --file citations.csv --type csv 
--server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos
Run a graph traversal
This query finds patents cited by patents/6009503 (depth ≤ 3) recursively:
Recursive traversal, 500 results, 317 ms
FOR v IN 1..3 OUTBOUND "patents/6009503" GRAPH "G"
RETURN v
Run a graph traversal
This query finds patents cited by patents/6009503 (depth ≤ 3) recursively:
Recursive traversal, 500 results, 317 ms
FOR v IN 1..3 OUTBOUND "patents/6009503" GRAPH "G"
RETURN v
This one finds all patents that cite any of those cited by patents/6009503:
One step forward and one back, 35 results, 59 ms
FOR v IN 1..1 OUTBOUND "patents/6009503" GRAPH "G"
FOR w IN 1..1 INBOUND v._id GRAPH "G"
FILTER w._id != v._id
RETURN w
Run a graph traversal
This query finds all patents that cite patents/3541687 directly or in two steps:
Recursive traversal backwards, 22 results, 15 ms
FOR v IN 1..2 INBOUND "patents/3541687" GRAPH "G"
RETURN v._key
Run a graph traversal
This query finds all patents that cite patents/3541687 directly or in two steps:
Recursive traversal backwards, 22 results, 15 ms
FOR v IN 1..2 INBOUND "patents/3541687" GRAPH "G"
RETURN v._key
This one counts all patents that cite patents/3541687 recursively:
Deep recursion backwards, count 398, 311 ms
FOR v IN 1..10 INBOUND "patents/3541687" GRAPH "G"
COLLECT WITH COUNT INTO c
RETURN c
Yet another approach
If your graph data changes rapidly in a transactional fashion...
Yet another approach
If your graph data changes rapidly in a transactional fashion...
Graph database as primary data store
You can turn things around:
Keep and maintain the graph data in a graph database.
Yet another approach
If your graph data changes rapidly in a transactional fashion...
Graph database as primary data store
You can turn things around:
Keep and maintain the graph data in a graph database.
Regularly dump to HDFS and run larger analysis jobs there.
Yet another approach
If your graph data changes rapidly in a transactional fashion...
Graph database as primary data store
You can turn things around:
Keep and maintain the graph data in a graph database.
Regularly dump to HDFS and run larger analysis jobs there.
Or: Use ArangoDB’s Spark Connector:
https://github.com/arangodb/arangodb-spark-connector
Links
http://hadoop.apache.org/
http://spark.apache.org/
https://graphframes.github.io/
https://www.arangodb.com
https://github.com/arangodb/arangodb-spark-connector
https://docs.arangodb.com/cookbook/index.html
http://mesos.apache.org/
https://mesosphere.com/
https://github.com/dcos/demos

More Related Content

What's hot

Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamDataWorks Summit
 
Cloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World ConsiderationsCloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World ConsiderationsDataWorks Summit/Hadoop Summit
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDataWorks Summit
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit/Hadoop Summit
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Databricks
 
HAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged DataHAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged DataDataWorks Summit
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jDataWorks Summit
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...DataWorks Summit
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSDataWorks Summit
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsDataWorks Summit/Hadoop Summit
 

What's hot (20)

Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
 
Deep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profitDeep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profit
 
Cloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World ConsiderationsCloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World Considerations
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
HAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged DataHAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged Data
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4j
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the Experts
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the experts
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Spark + HBase
Spark + HBase Spark + HBase
Spark + HBase
 

Similar to Fishing Graphs in a Hadoop Data Lake

Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeFishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeArangoDB Database
 
Deep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBDeep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBArangoDB Database
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?Samet KILICTAS
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentationrenjan131
 
GraphTech Ecosystem - part 1: Graph Databases
GraphTech Ecosystem - part 1: Graph DatabasesGraphTech Ecosystem - part 1: Graph Databases
GraphTech Ecosystem - part 1: Graph DatabasesLinkurious
 
aRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RaRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RGraphRM
 
The Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) HadThe Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) HadDeborah Gastineau
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling TechniqueCarmen Sanborn
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital.AI
 
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneDoug Needham
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and HadoopJosh Patterson
 
Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?Max Neunhöffer
 
Processing large-scale graphs with Google Pregel
Processing large-scale graphs with Google PregelProcessing large-scale graphs with Google Pregel
Processing large-scale graphs with Google PregelMax Neunhöffer
 
Ssas dmx ile kurum içi verilerin i̇şlenmesi
Ssas dmx ile kurum içi verilerin i̇şlenmesiSsas dmx ile kurum içi verilerin i̇şlenmesi
Ssas dmx ile kurum içi verilerin i̇şlenmesiKoray Kocabas
 

Similar to Fishing Graphs in a Hadoop Data Lake (20)

Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeFishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
 
Deep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBDeep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDB
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
 
Mr bi
Mr biMr bi
Mr bi
 
Oslo bekk2014
Oslo bekk2014Oslo bekk2014
Oslo bekk2014
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentation
 
GraphTech Ecosystem - part 1: Graph Databases
GraphTech Ecosystem - part 1: Graph DatabasesGraphTech Ecosystem - part 1: Graph Databases
GraphTech Ecosystem - part 1: Graph Databases
 
aRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RaRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con R
 
DataHub
DataHubDataHub
DataHub
 
Mr bi amrp
Mr bi amrpMr bi amrp
Mr bi amrp
 
The Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) HadThe Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) Had
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling Technique
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
 
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZone
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?
 
Processing large-scale graphs with Google Pregel
Processing large-scale graphs with Google PregelProcessing large-scale graphs with Google Pregel
Processing large-scale graphs with Google Pregel
 
Ssas dmx ile kurum içi verilerin i̇şlenmesi
Ssas dmx ile kurum içi verilerin i̇şlenmesiSsas dmx ile kurum içi verilerin i̇şlenmesi
Ssas dmx ile kurum içi verilerin i̇şlenmesi
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 

Recently uploaded

A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Recently uploaded (20)

A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Fishing Graphs in a Hadoop Data Lake

  • 1. Fishing Graphs in a Hadoop Data Lake Max Neunhöffer Munich, 6 April 2017 www.arangodb.com
  • 2. What is a graph? E A C D F B pq -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 sin(x)
  • 3. What is a graph? E A C D F B pq -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 sin(x) Social networks (edges are friendship) Dependency chains Computer networks Citations Hierarchies
  • 4. What is a graph? E A C D F B pq -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 sin(x) Social networks (edges are friendship) Dependency chains Computer networks Citations Hierarchies Indeed any relation
  • 5. What is a graph? E A C D F B pq -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 sin(x) Social networks (edges are friendship) Dependency chains Computer networks Citations Hierarchies Indeed any relation Sometimes directed, sometimes undirected.
  • 6. Usual approach: data in HDFS, use Spark/GraphFrames v = spark.read.option("header",true).csv("hdfs://...") e = spark.read.option("header",true).csv("hdfs://...") g = GraphFrame(v,e) g.inDegrees.show() g.outDegrees.groupBy("outDegree").count().sort("outDegree").show(1000) g.vertices.groupBy("GYEAR").count().sort("GYEAR").show() g.find("(a)-[e]->(b);(b)-[ee]->(c)").filter("a.id = 6009536").count() results = g.pageRank(resetProbability=0.01, maxIter=3)
  • 7. Limitations/missed opportunities Ad hoc queries Often, one would like to perform smallish ad hoc queries on graph data.
  • 8. Limitations/missed opportunities Ad hoc queries Often, one would like to perform smallish ad hoc queries on graph data. Want to bring down latency from minutes to seconds or from seconds to milliseconds.
  • 9. Limitations/missed opportunities Ad hoc queries Often, one would like to perform smallish ad hoc queries on graph data. Want to bring down latency from minutes to seconds or from seconds to milliseconds. Usually, we would like to run many of them.
  • 10. Limitations/missed opportunities Ad hoc queries Often, one would like to perform smallish ad hoc queries on graph data. Want to bring down latency from minutes to seconds or from seconds to milliseconds. Usually, we would like to run many of them. Examples: friends of friends of one person find all immediate dependencies of one item find all direct and indirect citations of one article find all descendants of one member of a hierarchy
  • 11. Limitations/missed opportunities Ad hoc queries Often, one would like to perform smallish ad hoc queries on graph data. Want to bring down latency from minutes to seconds or from seconds to milliseconds. Usually, we would like to run many of them. Examples: friends of friends of one person find all immediate dependencies of one item find all direct and indirect citations of one article find all descendants of one member of a hierarchy IDEA: Use a Graph Database
  • 12. Graph Databases Graph Databases Can store and persist graphs.
  • 13. Graph Databases Graph Databases Can store and persist graphs. However, the crucial ingredient of a graph database is their ability to do graph queries.
  • 14. Graph Databases Graph Databases Can store and persist graphs. However, the crucial ingredient of a graph database is their ability to do graph queries. Graph queries: Find paths in graphs according to a pattern. Find everything reachable from a vertex. Find shortest paths between two given vertices.
  • 15. Graph Databases Graph Databases Can store and persist graphs. However, the crucial ingredient of a graph database is their ability to do graph queries. Graph queries: Find paths in graphs according to a pattern. Find everything reachable from a vertex. Find shortest paths between two given vertices. =⇒ Graph Traversals
  • 16. Graph Databases Graph Databases Can store and persist graphs. However, the crucial ingredient of a graph database is their ability to do graph queries. Graph queries: Find paths in graphs according to a pattern. Find everything reachable from a vertex. Find shortest paths between two given vertices. =⇒ Graph Traversals Crucial: Number of steps a priori unknown!
  • 28. The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store,
  • 29. The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store, with a common query language for all three data models.
  • 30. The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store, with a common query language for all three data models. Important: Is able to compete with specialised products on their turf.
  • 31. The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store, with a common query language for all three data models. Important: Is able to compete with specialised products on their turf. Allows for polyglot persistence using a single database technology.
  • 32. The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store, with a common query language for all three data models. Important: Is able to compete with specialised products on their turf. Allows for polyglot persistence using a single database technology. In a microservice architecture, there will be several different deployments.
  • 33. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries,
  • 34. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics,
  • 35. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics, allowing to do joins,
  • 36. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics, allowing to do joins, and to do graph queries,
  • 37. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics, allowing to do joins, and to do graph queries, AQL is independent of the driver used and
  • 38. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics, allowing to do joins, and to do graph queries, AQL is independent of the driver used and offers protection against injections by design.
  • 39. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems.
  • 40. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone.
  • 41. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone. Cluster resource management is automatic.
  • 42. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone. Cluster resource management is automatic. This leads to significantly better resource utilization.
  • 43. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone. Cluster resource management is automatic. This leads to significantly better resource utilization. Fault tolerance, self-healing and automatic failover is guaranteed.
  • 44. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone. Cluster resource management is automatic. This leads to significantly better resource utilization. Fault tolerance, self-healing and automatic failover is guaranteed. runs on Apache Mesos and Mesosphere DC/OS clusters.
  • 45. Back to topic: DC/OS as infrastructure DC/OS is the perfect environment for our needs DC/OS manages for us: Software deployment Resource management (increased utilization) Service discovery
  • 46. Back to topic: DC/OS as infrastructure DC/OS is the perfect environment for our needs DC/OS manages for us: Software deployment Resource management (increased utilization) Service discovery Allows to plug things together!
  • 47. Back to topic: DC/OS as infrastructure DC/OS is the perfect environment for our needs DC/OS manages for us: Software deployment Resource management (increased utilization) Service discovery Allows to plug things together! Consequence: We can easily deploy multiple systems alongside each other.
  • 48. Back to topic: DC/OS as infrastructure DC/OS is the perfect environment for our needs DC/OS manages for us: Software deployment Resource management (increased utilization) Service discovery Allows to plug things together! Consequence: We can easily deploy multiple systems alongside each other. Example: HDFS, Spark and ArangoDB
  • 49. Import data into ArangoDB hdfs dfs -get hdfs://name-1-node.hdfs.mesos:9001/patents.csv hdfs dfs -get hdfs://name-1-node.hdfs.mesos:9001/citations.csv dcos package install arangodb3 arangosh --server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos var g = require("@arangodb/general-graph"); var G = g._create("G",[g._relation("citations",["patents"],["patents"])]); arangoimp --collection patents --file patents.csv --type csv --server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos arangoimp --collection citations --file citations.csv --type csv --server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos
  • 50. Run a graph traversal This query finds patents cited by patents/6009503 (depth ≤ 3) recursively: Recursive traversal, 500 results, 317 ms FOR v IN 1..3 OUTBOUND "patents/6009503" GRAPH "G" RETURN v
  • 51. Run a graph traversal This query finds patents cited by patents/6009503 (depth ≤ 3) recursively: Recursive traversal, 500 results, 317 ms FOR v IN 1..3 OUTBOUND "patents/6009503" GRAPH "G" RETURN v This one finds all patents that cite any of those cited by patents/6009503: One step forward and one back, 35 results, 59 ms FOR v IN 1..1 OUTBOUND "patents/6009503" GRAPH "G" FOR w IN 1..1 INBOUND v._id GRAPH "G" FILTER w._id != v._id RETURN w
  • 52. Run a graph traversal This query finds all patents that cite patents/3541687 directly or in two steps: Recursive traversal backwards, 22 results, 15 ms FOR v IN 1..2 INBOUND "patents/3541687" GRAPH "G" RETURN v._key
  • 53. Run a graph traversal This query finds all patents that cite patents/3541687 directly or in two steps: Recursive traversal backwards, 22 results, 15 ms FOR v IN 1..2 INBOUND "patents/3541687" GRAPH "G" RETURN v._key This one counts all patents that cite patents/3541687 recursively: Deep recursion backwards, count 398, 311 ms FOR v IN 1..10 INBOUND "patents/3541687" GRAPH "G" COLLECT WITH COUNT INTO c RETURN c
  • 54. Yet another approach If your graph data changes rapidly in a transactional fashion...
  • 55. Yet another approach If your graph data changes rapidly in a transactional fashion... Graph database as primary data store You can turn things around: Keep and maintain the graph data in a graph database.
  • 56. Yet another approach If your graph data changes rapidly in a transactional fashion... Graph database as primary data store You can turn things around: Keep and maintain the graph data in a graph database. Regularly dump to HDFS and run larger analysis jobs there.
  • 57. Yet another approach If your graph data changes rapidly in a transactional fashion... Graph database as primary data store You can turn things around: Keep and maintain the graph data in a graph database. Regularly dump to HDFS and run larger analysis jobs there. Or: Use ArangoDB’s Spark Connector: https://github.com/arangodb/arangodb-spark-connector