A comparison of relational and graph model theories, with an eye towards DataStax's implementation of Graph. Note: I'm working on a concise, formal mathematical definition of relational, based on Codd's 1970 paper. (Thanks to Artem Chebotko for suggesting this.)
27. "relational" - DseGraphFrames
● DseGraphFrames are an extenstion of Spark GraphFrames
○ GraphFrames are built on DataFrames. (DataFrames add a bit of
relational ease to Spark.)
● Some simple examples (from our docs and databricks docs):
g.E().groupCount().by(T.label)
g.V().has("age", P.gt(30)).show
g.V().hasLabel("person").drop().iterate()
● DseGraphFrames support GraphX (and other) Spark libraries
○ GraphX has functions for PageRank, connected components,
shortest path, etc
27
29. My vision - all-in-one
● Seemlessly conceptualize and access your graph in many ways
○ declarative, imperative, graphy, "relational" - avoid FUD!
● We're going in that direction!
● Use the best tools for the job at hand.
29
30. How does DSE integrate all this and help?
● DSE Graph implements Tinkerpop, gremlin, DseGraphFrames, and more
(eg, Solr integration).
● gremlin gives you powerful imperative and declarative methods, for
traversals and graph analyses.
● SparkSQL/DseGraphFrames do too, with the power of Spark.
● gremlin language variants (GLVs) allow you to program in your favorite
(supported) language.
● Solr can be integrated, which adds a declarative element to narrow
queries.
● gremlin OLAP uses SparkSQL to speed complex queries.
● DSE Studio offers graphical views and gremlin and SparkSQL access.
30
interest in theory and history of data models
do we need to throw out the baby with the bath water?
how many have some experience with, or knowledge of tinkerpop (especially gremlin)?
It's a bit philosphical
want to convince you that graph and relational are not as different as you think they are - they are different, but not *as* different
not experienced in semantic or knowledge graphs, they may be less amenable to the coming ideas
In mathematics, a multiset (aka bag or mset) is a modification of the concept of a set that, unlike a set, allows for multiple instances for each of its elements.
also, I think: relational algebra would be more about access methods than the data model
rule 6: All views that are theoretically updatable are also updatable by the system.
Note: this means that I'm calling Cassandra w/CQL "relational". It's a very loose use of the word.
Now, I want to clear the air a bit about "relational vs graph". If we're going to consider graph DBs, we need to honestly look at their features and uses (and advantages), not just say "relational bad, graph good".
the quote is just an example I found on the a first try google search. Not the worst!
native vs non-native, and Marko's paper - "native storage of a graph" - how do you serialize a graph (my problem for this talk - how to serialize all these related ideas!)
also (more importantly?), both can expand exponentially, as in Denise's talk, as you repeat.
in reality, look at each implementation, and its performance on the kinds of accesses you'll need to do.
schema changes also depend on the implementation - graphs can have schemas, and this was a question after Denise's talk.
In particular, for property graphs, that have some type of schema.
note - this is not the way we, or SparkSQL, map the graph to tables! The idea is, they are not separate universes.
FKs - otoh, there's no system-enforced "on delete cascade"
Where do these similarities lead us?...
what kind of languages should we use? Is there a "best"?
one note: declarative are nice for optimizers, imperative are nice if you know best; gremlin > SQL, but relational languages were designed to be declarative in order to enforce and protect data integrity, not by lack of imagination.
CQL, Cassandra(NoSQL), relational - "just when I thought I was out, they pull me back in!"
the point is not that it's truly imperative, but that imperative traversals and declarative ones, in themselves, may not be so different, as far as "imperative vs declarative" go.
next, gremlin
this is a bit like the "imperative" SQL traversal query mentioned - both do the same thing, may well take the same steps.
next SparkSQL
couldn't do gremlin2sql w/all of gremlin
site uses T-SQL, so possibly you could do gremlin2TSQL?
The point is to compare, and also, if you're used to relational, this can help you get started with gremlin queries. (Or, if you are, also see https://academy.datastax.com/content/gremlin-traversals - google "gremlin recipes datastax")
tradeoffs - knowledge/semantic graph example (no schema) - BUT, you get the power of Spark for analyzing pieces of the graph
Studio is a great and easy tool to do a lot of these things … discuss its options, schema views (hidden tables in CQL view), etc
I say "relational" in that this is a declarative element in Spark, as if you're dealing with a vertices table and an edges table.
Can also combine graph and non-graph data using SparkSQL, Spark, DseGraphFrames, DataFrames
useful for loading data into a graph
nodes get _p tables, edges get _e tables. Note, this is like x$ tables in Oracle - not documented, subject to change, etc
bicycle and car example - one day, trying to adjust seat height (wrong spot), then also disconnecting battery (engine light)