Lingua talk: Italiano.
Descrizione:
In questo talk parleremo di come integrare e utilizzare ArangoDB, un database multi-modello con supporto nativo ai grafi, con R. Presenteremo quindi aRangodb, il package che abbiamo sviluppato per interfacciarsi in modo più semplice e intuitivo al database. Nel corso del talk mostreremo come il package possa essere utilizzato in ambito data science usando alcuni case studies concreti.
Speaker:
Gabriele Galatolo - Data Scientist - Kode srl
7. Overview
● Relational Model vs Graph Model
● Data Science with R
● ArangoDB, a multi-model database
● aRangodb, an R package to interact with ArangoDB
● Usage of aRangodb in a real context: Tweetmood
11. Relational Model vs Graph Model
Why we 💗 the Relational Model:
○ Most popular way to structure and store data
○ Map reality with well-known paradigm (Entity-Relationship)
○ Supported by consolidated applications in almost every field
■ DBMS ↔ SQL
( E.F. Codd (1970), A Relational Model of Data for Large Shared Data Banks @ https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf )
12. Relational Model vs Graph Model
Drawbacks of the Relational Model:
○ DBMS are complex tools
■ Require deep knowledge to use them at best
○ In many contexts a small subset of features is used
■ Triggers? Stored Procedures? Functions? Full-text Search?
○ Used to model everything without critical thinking
■ A database cannot be denied to anyone
■ Used to model graphs too
13. Relational Model vs Graph Model
Sensor
Id: string batteryDuration: int productionDate: date
s001 30 2019-04-02
s002 21 2019-04-02
14. Relational Model vs Graph Model
Sensor
Id: string batteryDuration: int productionDate: date
s001 30 2019-04-02
s002 21 2019-04-02
Sensor
Id: string batteryDuration: int productionDate: date xy:string type: string desc: string
s001 30 2019-04-02 lat=30.11,
long=12.49
accelerometer prod=11123
s002 21 2019-04-02 (30,134;
14.333)
temperature max temp, 50°
next version
( source https://www.red-gate.com/simple-talk/sql/database-administration/ten-common-database-design-mistakes/ )
15. Relational Model vs Graph Model
“ People (myself included) do a lot
of really stupid things, at times, in
the name of ‘getting it done.’ ”
( source https://www.red-gate.com/simple-talk/sql/database-administration/ten-common-database-design-mistakes/ )
[Louis Davidson]
16. Relational Model vs Graph Model
Will graph databases/graph-based modelling save the
world?
17. Relational Model vs Graph Model
Will graph databases/graph-based modelling save the
world?
NO
18. Relational Model vs Graph Model
Anyway we 💗💗 graph-based modelling.
Many problems can be faced with a better modelization
with the use of graphs, for several reasons:
● relationships are everywhere
● it’s a more intuitive modelling than the relational one
● less structure implies flexibility and even better
performances
○ with OBVIOUS other side of the coin
22. Data Science with R
Why we use R for data science projects:
● Data Science is strictly related to statistics…
● … and R is a statistics-focused programming language
○ Multi-paradigm
○ Interpreted
● Well supported and open source
○ programmers, statisticians, data scientist and more
○ lot of packages available to do almost everything
23. assignment to right
indexing from 1
round to even
( more @ https://ironholds.org/projects/rbitrary/#why-doesnt-round-work-like-you-think-it-should )
identifier with period
24.
25. R in a nutshell
ASSIGNMENT
aNumeric <- 9
aString <- “hello world”
ABoolean <- TRUE
aVector <- c(1, 2, 3, 4)
DATA FRAME
df <- data.frame(”A”, “B”)
df[1, ] <- c(1, 2)
df[2, ] <- c(3, 4)
print( df[, df$B == 4] )
# A B
#[1, ] 3 4
LIST
l <- list(name=“anna”, age=31)
print(l$name)
print(l$[[”age”]])
# ”anna”
# 31
LOOP
sqr <- function(x) x*x
res <- apply(df, 2, sqr)
print(res)
# 4 16
R6 CLASSES
Cl <- R6::R6Class ( "Cl",
public = list(
f = function(),
... ),
private = list( a = NULL, … )
)
PIPE
res <- df %>%
filter(A==3) %>%
select(B)
print(res)
# B
#[1, ] 4
27. ArangoDB, a multi-model database
“ArangoDB is a multi-model, open-source database with
flexible data models for documents, graphs, and key-values.
Build high performance applications using a convenient
SQL-like query language or JavaScript extensions.”
( more @ https://github.com/arangodb/arangodb )
28. ArangoDB, a multi-model database
Main concepts and internal data organization
COLLECTIONS GRAPHS
_key: a
k1
: v1
k2
: v2
_key: b
k3
: v3
k4
: v4
29. ArangoDB, a multi-model database
EDGES
COLLECTIONS
COLLECTIONS GRAPHS
_key: a
k1
: v1
k2
: v2
_key: b
k3
: v3
k4
: v4
_key: ...
_from: a
_to: b
ke
: ve
_key: ...
_from: a
_to: b
_key: ...
_from: a
_to: b
Main concepts and internal data organization
30. ArangoDB, alternatives?
● Multi-Model Database
○ Graph
○ Document
○ Key-Value
● First release: 2007
● Written in:
○ Java
● Server Side Scripts:
○ Java, Javascript
● DQL:
○ SQL-Like
● Multi-Model Database
○ Graph
○ Document
○ Key-Value
● First release: 2012
● Written in:
○ C, C++, JavaScript
● Server Side Scripts:
○ Javascript
● DQL:
○ AQL
● Graph Database
● First release: 2010
● Written in:
○ Java, Scala
● Server Side Scripts:
○ Java, Cypher
● DQL:
○ Cypher
( more @ https://db-engines.com/en/system/ArangoDB%3BNeo4j%3BOrientDB )
34. aRangodb, an R package for ArangoDB
An R package to interact with ArangoDB instances:
● collections management
○ CRUD of collections
● graphs management
○ CRUD of graphs
○ Traversal
● graph visualization
● (only for the brave) transform an AQL query in an R
function
42. aRangodb, graphs management
Adjacency tensor of a graph:
nodes
nodes
edges
(Adjacency and Tensor Representation in General Hypergraphs Part 1: e-adjacency Tensor Uniformisation Using
Homogeneous Polynomials
X. Ouvrard, J.M. Le Goff, S. Marchand-Maillet
@ https://arxiv.org/pdf/1712.08189.pdf )
47. aRangodb, (bonus) AQL and R functions
Compile an AQL query into an R function where binding
parameters become function parameters:
48. aRangodb, current status
● 0.0.1-beta version available
○ first stable release will be available by the end of
June (full test coverage, major bug fixes)
● https://gitlab.com/krpack/arango-driver
● Feel free to...
○ … share your suggestions
○ … give us your feedback
49. aRangodb, current status
● 0.0.1-beta version available
○ first stable release will be available by the end of
June (full test coverage, major bug fixes)
● https://gitlab.com/krpack/arango-driver
● Feel free to...
○ … share your suggestions
○ … give us your feedback
● Don’t feel free to
○ ...open issues (I’m joking 😬)
55. TweetMood, secondary information
We added a secondary-source collection and two
additional relations from users’ and tweet data:
● Hashtags
○ “Hashtags” (contained_in)→ “Tweets”
● Mentions
○ “Users” (mentioned_in)→ “Tweets”
56. TweetMood, association rules on hashtags
Tweets (may) contain list of hashtags:
(Tutorial on association rules @ https://www.kdnuggets.com/2016/04/association-rules-apriori-algorithm-tutorial.html )
57. TweetMood, association rules on hashtags
Are there hidden patterns connecting tweet hashtags?
(Tutorial on association rules @ https://www.kdnuggets.com/2016/04/association-rules-apriori-algorithm-tutorial.html )
{#facebookdown} => {#instagramdown, #whatsappdown}
58. TweetMood, association rules (quick note)
(Tutorial on association rules @ https://www.kdnuggets.com/2016/04/association-rules-apriori-algorithm-tutorial.html )
“Association rules analysis is a technique to uncover how items are
associated to each other. For example: if there is a pair of items, X and
Y, that are frequently bought together:
● both X and Y can be placed on the same shelf;
● advertisements on X could be targeted at buyers who purchase Y;
● ...
how transactions must be
represented
61. TweetMood, association rules on hashtags
this may be the feeling when
producing elegant packages that
work efficiently...
62. Before closing...
What does the future hold for us?
● Add support for bulk insert of data from csv/JSON
● Native support for well-known graphs algorithm
● Improve graph representation
○ Introduction of Laplacian matrix
○ Improve internal representation
● Presentation of aRangodb package @ useR!2019
○ See you there!