SlideShare a Scribd company logo
1 of 58
Download to read offline
Fishing Graphs in a Hadoop Data
Lake
Max Neunhöffer
Munich, 6 April 2017
www.arangodb.com
What is a graph?
E
A
C
D
F
B
pq
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
sin(x)
What is a graph?
E
A
C
D
F
B
pq
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
sin(x)
Social networks (edges are friendship)
Dependency chains
Computer networks
Citations
Hierarchies
What is a graph?
E
A
C
D
F
B
pq
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
sin(x)
Social networks (edges are friendship)
Dependency chains
Computer networks
Citations
Hierarchies
Indeed any relation
What is a graph?
E
A
C
D
F
B
pq
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
sin(x)
Social networks (edges are friendship)
Dependency chains
Computer networks
Citations
Hierarchies
Indeed any relation
Sometimes directed, sometimes undirected.
Usual approach: data in HDFS, use Spark/GraphFrames
v = spark.read.option("header",true).csv("hdfs://...")
e = spark.read.option("header",true).csv("hdfs://...")
g = GraphFrame(v,e)
g.inDegrees.show()
g.outDegrees.groupBy("outDegree").count().sort("outDegree").show(1000)
g.vertices.groupBy("GYEAR").count().sort("GYEAR").show()
g.find("(a)-[e]->(b);(b)-[ee]->(c)").filter("a.id = 6009536").count()
results = g.pageRank(resetProbability=0.01, maxIter=3)
Limitations/missed opportunities
Ad hoc queries
Often, one would like to perform smallish ad hoc queries on graph data.
Limitations/missed opportunities
Ad hoc queries
Often, one would like to perform smallish ad hoc queries on graph data.
Want to bring down latency from minutes to seconds or from seconds
to milliseconds.
Limitations/missed opportunities
Ad hoc queries
Often, one would like to perform smallish ad hoc queries on graph data.
Want to bring down latency from minutes to seconds or from seconds
to milliseconds. Usually, we would like to run many of them.
Limitations/missed opportunities
Ad hoc queries
Often, one would like to perform smallish ad hoc queries on graph data.
Want to bring down latency from minutes to seconds or from seconds
to milliseconds. Usually, we would like to run many of them.
Examples:
friends of friends of one person
find all immediate dependencies of one item
find all direct and indirect citations of one article
find all descendants of one member of a hierarchy
Limitations/missed opportunities
Ad hoc queries
Often, one would like to perform smallish ad hoc queries on graph data.
Want to bring down latency from minutes to seconds or from seconds
to milliseconds. Usually, we would like to run many of them.
Examples:
friends of friends of one person
find all immediate dependencies of one item
find all direct and indirect citations of one article
find all descendants of one member of a hierarchy
IDEA: Use a Graph Database
Graph Databases
Graph Databases
Can store and persist graphs.
Graph Databases
Graph Databases
Can store and persist graphs. However, the crucial ingredient of a graph
database is their ability to do graph queries.
Graph Databases
Graph Databases
Can store and persist graphs. However, the crucial ingredient of a graph
database is their ability to do graph queries.
Graph queries:
Find paths in graphs according to a pattern.
Find everything reachable from a vertex.
Find shortest paths between two given vertices.
Graph Databases
Graph Databases
Can store and persist graphs. However, the crucial ingredient of a graph
database is their ability to do graph queries.
Graph queries:
Find paths in graphs according to a pattern.
Find everything reachable from a vertex.
Find shortest paths between two given vertices.
=⇒ Graph Traversals
Graph Databases
Graph Databases
Can store and persist graphs. However, the crucial ingredient of a graph
database is their ability to do graph queries.
Graph queries:
Find paths in graphs according to a pattern.
Find everything reachable from a vertex.
Find shortest paths between two given vertices.
=⇒ Graph Traversals Crucial: Number of steps a priori unknown!
Graph Traversals
A
B
C
D
J
E
H
F
G
A
Graph Traversals
A
B
C
D
J
E
H
F
G
AB
Graph Traversals
A
B
C
D
J
E
H
F
G
ABC
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCE
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCED
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJ
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJ
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJF
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJFG
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJFG
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJFGH
The Multi-Model Approach
Multi-model database
A multi-model database combines a document store with a graph
database and is at the same time a key/value store,
The Multi-Model Approach
Multi-model database
A multi-model database combines a document store with a graph
database and is at the same time a key/value store,
with a common query language for all three data models.
The Multi-Model Approach
Multi-model database
A multi-model database combines a document store with a graph
database and is at the same time a key/value store,
with a common query language for all three data models.
Important:
Is able to compete with specialised products on their turf.
The Multi-Model Approach
Multi-model database
A multi-model database combines a document store with a graph
database and is at the same time a key/value store,
with a common query language for all three data models.
Important:
Is able to compete with specialised products on their turf.
Allows for polyglot persistence using a single database technology.
The Multi-Model Approach
Multi-model database
A multi-model database combines a document store with a graph
database and is at the same time a key/value store,
with a common query language for all three data models.
Important:
Is able to compete with specialised products on their turf.
Allows for polyglot persistence using a single database technology.
In a microservice architecture, there will be several different deployments.
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
with transaction semantics,
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
with transaction semantics,
allowing to do joins,
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
with transaction semantics,
allowing to do joins,
and to do graph queries,
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
with transaction semantics,
allowing to do joins,
and to do graph queries,
AQL is independent of the driver used and
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
with transaction semantics,
allowing to do joins,
and to do graph queries,
AQL is independent of the driver used and
offers protection against injections by design.
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
Idea
Distributed applications can be deployed as easily as one installs a mobile
app on a phone.
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
Idea
Distributed applications can be deployed as easily as one installs a mobile
app on a phone.
Cluster resource management is automatic.
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
Idea
Distributed applications can be deployed as easily as one installs a mobile
app on a phone.
Cluster resource management is automatic.
This leads to significantly better resource utilization.
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
Idea
Distributed applications can be deployed as easily as one installs a mobile
app on a phone.
Cluster resource management is automatic.
This leads to significantly better resource utilization.
Fault tolerance, self-healing and automatic failover is guaranteed.
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
Idea
Distributed applications can be deployed as easily as one installs a mobile
app on a phone.
Cluster resource management is automatic.
This leads to significantly better resource utilization.
Fault tolerance, self-healing and automatic failover is guaranteed.
runs on Apache Mesos and Mesosphere DC/OS clusters.
Back to topic: DC/OS as infrastructure
DC/OS is the perfect environment for our needs
DC/OS manages for us:
Software deployment
Resource management (increased utilization)
Service discovery
Back to topic: DC/OS as infrastructure
DC/OS is the perfect environment for our needs
DC/OS manages for us:
Software deployment
Resource management (increased utilization)
Service discovery
Allows to plug things together!
Back to topic: DC/OS as infrastructure
DC/OS is the perfect environment for our needs
DC/OS manages for us:
Software deployment
Resource management (increased utilization)
Service discovery
Allows to plug things together!
Consequence: We can easily deploy multiple systems alongside each other.
Back to topic: DC/OS as infrastructure
DC/OS is the perfect environment for our needs
DC/OS manages for us:
Software deployment
Resource management (increased utilization)
Service discovery
Allows to plug things together!
Consequence: We can easily deploy multiple systems alongside each other.
Example: HDFS, Spark and ArangoDB
Import data into ArangoDB
hdfs dfs -get hdfs://name-1-node.hdfs.mesos:9001/patents.csv
hdfs dfs -get hdfs://name-1-node.hdfs.mesos:9001/citations.csv
dcos package install arangodb3
arangosh 
--server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos
var g = require("@arangodb/general-graph");
var G = g._create("G",[g._relation("citations",["patents"],["patents"])]);
arangoimp --collection patents --file patents.csv --type csv 
--server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos
arangoimp --collection citations --file citations.csv --type csv 
--server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos
Run a graph traversal
This query finds patents cited by patents/6009503 (depth ≤ 3) recursively:
Recursive traversal, 500 results, 317 ms
FOR v IN 1..3 OUTBOUND "patents/6009503" GRAPH "G"
RETURN v
Run a graph traversal
This query finds patents cited by patents/6009503 (depth ≤ 3) recursively:
Recursive traversal, 500 results, 317 ms
FOR v IN 1..3 OUTBOUND "patents/6009503" GRAPH "G"
RETURN v
This one finds all patents that cite any of those cited by patents/6009503:
One step forward and one back, 35 results, 59 ms
FOR v IN 1..1 OUTBOUND "patents/6009503" GRAPH "G"
FOR w IN 1..1 INBOUND v._id GRAPH "G"
FILTER w._id != v._id
RETURN w
Run a graph traversal
This query finds all patents that cite patents/3541687 directly or in two steps:
Recursive traversal backwards, 22 results, 15 ms
FOR v IN 1..2 INBOUND "patents/3541687" GRAPH "G"
RETURN v._key
Run a graph traversal
This query finds all patents that cite patents/3541687 directly or in two steps:
Recursive traversal backwards, 22 results, 15 ms
FOR v IN 1..2 INBOUND "patents/3541687" GRAPH "G"
RETURN v._key
This one counts all patents that cite patents/3541687 recursively:
Deep recursion backwards, count 398, 311 ms
FOR v IN 1..10 INBOUND "patents/3541687" GRAPH "G"
COLLECT WITH COUNT INTO c
RETURN c
Yet another approach
If your graph data changes rapidly in a transactional fashion...
Yet another approach
If your graph data changes rapidly in a transactional fashion...
Graph database as primary data store
You can turn things around:
Keep and maintain the graph data in a graph database.
Yet another approach
If your graph data changes rapidly in a transactional fashion...
Graph database as primary data store
You can turn things around:
Keep and maintain the graph data in a graph database.
Regularly dump to HDFS and run larger analysis jobs there.
Yet another approach
If your graph data changes rapidly in a transactional fashion...
Graph database as primary data store
You can turn things around:
Keep and maintain the graph data in a graph database.
Regularly dump to HDFS and run larger analysis jobs there.
Or: Use ArangoDB’s Spark Connector:
https://github.com/arangodb/arangodb-spark-connector
Links
http://hadoop.apache.org/
http://spark.apache.org/
https://graphframes.github.io/
https://www.arangodb.com
https://github.com/arangodb/arangodb-spark-connector
https://docs.arangodb.com/cookbook/index.html
http://mesos.apache.org/
https://mesosphere.com/
https://github.com/dcos/demos

More Related Content

What's hot

Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamDataWorks Summit
 
Cloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World ConsiderationsCloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World ConsiderationsDataWorks Summit/Hadoop Summit
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDataWorks Summit
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit/Hadoop Summit
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Databricks
 
HAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged DataHAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged DataDataWorks Summit
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jDataWorks Summit
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...DataWorks Summit
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSDataWorks Summit
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsDataWorks Summit/Hadoop Summit
 

What's hot (20)

Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
 
Deep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profitDeep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profit
 
Cloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World ConsiderationsCloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World Considerations
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
HAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged DataHAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged Data
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4j
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the Experts
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the experts
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Spark + HBase
Spark + HBase Spark + HBase
Spark + HBase
 

Similar to Fishing Graphs in a Hadoop Data Lake

Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeFishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeArangoDB Database
 
Deep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBDeep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBArangoDB Database
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?Samet KILICTAS
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentationrenjan131
 
GraphTech Ecosystem - part 1: Graph Databases
GraphTech Ecosystem - part 1: Graph DatabasesGraphTech Ecosystem - part 1: Graph Databases
GraphTech Ecosystem - part 1: Graph DatabasesLinkurious
 
aRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RaRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RGraphRM
 
The Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) HadThe Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) HadDeborah Gastineau
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling TechniqueCarmen Sanborn
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital.AI
 
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneDoug Needham
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and HadoopJosh Patterson
 
Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?Max Neunhöffer
 
Processing large-scale graphs with Google Pregel
Processing large-scale graphs with Google PregelProcessing large-scale graphs with Google Pregel
Processing large-scale graphs with Google PregelMax Neunhöffer
 
Ssas dmx ile kurum içi verilerin i̇şlenmesi
Ssas dmx ile kurum içi verilerin i̇şlenmesiSsas dmx ile kurum içi verilerin i̇şlenmesi
Ssas dmx ile kurum içi verilerin i̇şlenmesiKoray Kocabas
 

Similar to Fishing Graphs in a Hadoop Data Lake (20)

Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeFishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
 
Deep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBDeep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDB
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
 
Mr bi
Mr biMr bi
Mr bi
 
Oslo bekk2014
Oslo bekk2014Oslo bekk2014
Oslo bekk2014
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentation
 
GraphTech Ecosystem - part 1: Graph Databases
GraphTech Ecosystem - part 1: Graph DatabasesGraphTech Ecosystem - part 1: Graph Databases
GraphTech Ecosystem - part 1: Graph Databases
 
aRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RaRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con R
 
DataHub
DataHubDataHub
DataHub
 
Mr bi amrp
Mr bi amrpMr bi amrp
Mr bi amrp
 
The Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) HadThe Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) Had
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling Technique
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
 
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZone
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?
 
Processing large-scale graphs with Google Pregel
Processing large-scale graphs with Google PregelProcessing large-scale graphs with Google Pregel
Processing large-scale graphs with Google Pregel
 
Ssas dmx ile kurum içi verilerin i̇şlenmesi
Ssas dmx ile kurum içi verilerin i̇şlenmesiSsas dmx ile kurum içi verilerin i̇şlenmesi
Ssas dmx ile kurum içi verilerin i̇şlenmesi
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 

Recently uploaded

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Fishing Graphs in a Hadoop Data Lake

  • 1. Fishing Graphs in a Hadoop Data Lake Max Neunhöffer Munich, 6 April 2017 www.arangodb.com
  • 2. What is a graph? E A C D F B pq -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 sin(x)
  • 3. What is a graph? E A C D F B pq -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 sin(x) Social networks (edges are friendship) Dependency chains Computer networks Citations Hierarchies
  • 4. What is a graph? E A C D F B pq -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 sin(x) Social networks (edges are friendship) Dependency chains Computer networks Citations Hierarchies Indeed any relation
  • 5. What is a graph? E A C D F B pq -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 sin(x) Social networks (edges are friendship) Dependency chains Computer networks Citations Hierarchies Indeed any relation Sometimes directed, sometimes undirected.
  • 6. Usual approach: data in HDFS, use Spark/GraphFrames v = spark.read.option("header",true).csv("hdfs://...") e = spark.read.option("header",true).csv("hdfs://...") g = GraphFrame(v,e) g.inDegrees.show() g.outDegrees.groupBy("outDegree").count().sort("outDegree").show(1000) g.vertices.groupBy("GYEAR").count().sort("GYEAR").show() g.find("(a)-[e]->(b);(b)-[ee]->(c)").filter("a.id = 6009536").count() results = g.pageRank(resetProbability=0.01, maxIter=3)
  • 7. Limitations/missed opportunities Ad hoc queries Often, one would like to perform smallish ad hoc queries on graph data.
  • 8. Limitations/missed opportunities Ad hoc queries Often, one would like to perform smallish ad hoc queries on graph data. Want to bring down latency from minutes to seconds or from seconds to milliseconds.
  • 9. Limitations/missed opportunities Ad hoc queries Often, one would like to perform smallish ad hoc queries on graph data. Want to bring down latency from minutes to seconds or from seconds to milliseconds. Usually, we would like to run many of them.
  • 10. Limitations/missed opportunities Ad hoc queries Often, one would like to perform smallish ad hoc queries on graph data. Want to bring down latency from minutes to seconds or from seconds to milliseconds. Usually, we would like to run many of them. Examples: friends of friends of one person find all immediate dependencies of one item find all direct and indirect citations of one article find all descendants of one member of a hierarchy
  • 11. Limitations/missed opportunities Ad hoc queries Often, one would like to perform smallish ad hoc queries on graph data. Want to bring down latency from minutes to seconds or from seconds to milliseconds. Usually, we would like to run many of them. Examples: friends of friends of one person find all immediate dependencies of one item find all direct and indirect citations of one article find all descendants of one member of a hierarchy IDEA: Use a Graph Database
  • 12. Graph Databases Graph Databases Can store and persist graphs.
  • 13. Graph Databases Graph Databases Can store and persist graphs. However, the crucial ingredient of a graph database is their ability to do graph queries.
  • 14. Graph Databases Graph Databases Can store and persist graphs. However, the crucial ingredient of a graph database is their ability to do graph queries. Graph queries: Find paths in graphs according to a pattern. Find everything reachable from a vertex. Find shortest paths between two given vertices.
  • 15. Graph Databases Graph Databases Can store and persist graphs. However, the crucial ingredient of a graph database is their ability to do graph queries. Graph queries: Find paths in graphs according to a pattern. Find everything reachable from a vertex. Find shortest paths between two given vertices. =⇒ Graph Traversals
  • 16. Graph Databases Graph Databases Can store and persist graphs. However, the crucial ingredient of a graph database is their ability to do graph queries. Graph queries: Find paths in graphs according to a pattern. Find everything reachable from a vertex. Find shortest paths between two given vertices. =⇒ Graph Traversals Crucial: Number of steps a priori unknown!
  • 28. The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store,
  • 29. The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store, with a common query language for all three data models.
  • 30. The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store, with a common query language for all three data models. Important: Is able to compete with specialised products on their turf.
  • 31. The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store, with a common query language for all three data models. Important: Is able to compete with specialised products on their turf. Allows for polyglot persistence using a single database technology.
  • 32. The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store, with a common query language for all three data models. Important: Is able to compete with specialised products on their turf. Allows for polyglot persistence using a single database technology. In a microservice architecture, there will be several different deployments.
  • 33. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries,
  • 34. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics,
  • 35. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics, allowing to do joins,
  • 36. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics, allowing to do joins, and to do graph queries,
  • 37. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics, allowing to do joins, and to do graph queries, AQL is independent of the driver used and
  • 38. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics, allowing to do joins, and to do graph queries, AQL is independent of the driver used and offers protection against injections by design.
  • 39. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems.
  • 40. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone.
  • 41. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone. Cluster resource management is automatic.
  • 42. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone. Cluster resource management is automatic. This leads to significantly better resource utilization.
  • 43. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone. Cluster resource management is automatic. This leads to significantly better resource utilization. Fault tolerance, self-healing and automatic failover is guaranteed.
  • 44. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone. Cluster resource management is automatic. This leads to significantly better resource utilization. Fault tolerance, self-healing and automatic failover is guaranteed. runs on Apache Mesos and Mesosphere DC/OS clusters.
  • 45. Back to topic: DC/OS as infrastructure DC/OS is the perfect environment for our needs DC/OS manages for us: Software deployment Resource management (increased utilization) Service discovery
  • 46. Back to topic: DC/OS as infrastructure DC/OS is the perfect environment for our needs DC/OS manages for us: Software deployment Resource management (increased utilization) Service discovery Allows to plug things together!
  • 47. Back to topic: DC/OS as infrastructure DC/OS is the perfect environment for our needs DC/OS manages for us: Software deployment Resource management (increased utilization) Service discovery Allows to plug things together! Consequence: We can easily deploy multiple systems alongside each other.
  • 48. Back to topic: DC/OS as infrastructure DC/OS is the perfect environment for our needs DC/OS manages for us: Software deployment Resource management (increased utilization) Service discovery Allows to plug things together! Consequence: We can easily deploy multiple systems alongside each other. Example: HDFS, Spark and ArangoDB
  • 49. Import data into ArangoDB hdfs dfs -get hdfs://name-1-node.hdfs.mesos:9001/patents.csv hdfs dfs -get hdfs://name-1-node.hdfs.mesos:9001/citations.csv dcos package install arangodb3 arangosh --server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos var g = require("@arangodb/general-graph"); var G = g._create("G",[g._relation("citations",["patents"],["patents"])]); arangoimp --collection patents --file patents.csv --type csv --server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos arangoimp --collection citations --file citations.csv --type csv --server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos
  • 50. Run a graph traversal This query finds patents cited by patents/6009503 (depth ≤ 3) recursively: Recursive traversal, 500 results, 317 ms FOR v IN 1..3 OUTBOUND "patents/6009503" GRAPH "G" RETURN v
  • 51. Run a graph traversal This query finds patents cited by patents/6009503 (depth ≤ 3) recursively: Recursive traversal, 500 results, 317 ms FOR v IN 1..3 OUTBOUND "patents/6009503" GRAPH "G" RETURN v This one finds all patents that cite any of those cited by patents/6009503: One step forward and one back, 35 results, 59 ms FOR v IN 1..1 OUTBOUND "patents/6009503" GRAPH "G" FOR w IN 1..1 INBOUND v._id GRAPH "G" FILTER w._id != v._id RETURN w
  • 52. Run a graph traversal This query finds all patents that cite patents/3541687 directly or in two steps: Recursive traversal backwards, 22 results, 15 ms FOR v IN 1..2 INBOUND "patents/3541687" GRAPH "G" RETURN v._key
  • 53. Run a graph traversal This query finds all patents that cite patents/3541687 directly or in two steps: Recursive traversal backwards, 22 results, 15 ms FOR v IN 1..2 INBOUND "patents/3541687" GRAPH "G" RETURN v._key This one counts all patents that cite patents/3541687 recursively: Deep recursion backwards, count 398, 311 ms FOR v IN 1..10 INBOUND "patents/3541687" GRAPH "G" COLLECT WITH COUNT INTO c RETURN c
  • 54. Yet another approach If your graph data changes rapidly in a transactional fashion...
  • 55. Yet another approach If your graph data changes rapidly in a transactional fashion... Graph database as primary data store You can turn things around: Keep and maintain the graph data in a graph database.
  • 56. Yet another approach If your graph data changes rapidly in a transactional fashion... Graph database as primary data store You can turn things around: Keep and maintain the graph data in a graph database. Regularly dump to HDFS and run larger analysis jobs there.
  • 57. Yet another approach If your graph data changes rapidly in a transactional fashion... Graph database as primary data store You can turn things around: Keep and maintain the graph data in a graph database. Regularly dump to HDFS and run larger analysis jobs there. Or: Use ArangoDB’s Spark Connector: https://github.com/arangodb/arangodb-spark-connector