How do You Graph

© DataStax, All Rights Reserved. Confidential
How Do *You* Do
Graph?
Ben Krug
Technical Support Engineer, DataStax
1

Who am I?
© 2016 DataStax, All Rights Reserved. 2
A Technical Support Engineer at DataStax.
Previously, Support Engineer at MySQL, then Sun, then Oracle.
Before that, a DBA / sysadmin for banks, utilities, startups, medical and
insurance companies, etc.
Over 25 years in DBMSs, from ISAM and hierarchical to relational,
NoSQL, and graph.
Blogs: formerly oracle2mysql.wordpress.com, now
intertubes.wordpress.com
Disclaimer: Any opinions given are my own!

My topic:
How to look at graphs (the best way?)
● This will be an opinionated discussion!
● Is there a best way?
● We've probably all done a lot of relational - does that help?

Goals:
● Discuss DM theory (compare and contrast) and some FUD
● Give an overview of some tools, in the context of the discussion (focused
on Tinkerpop, Spark, etc, relating to the DSE Graph implementations)

1st, what's a (property) graph?
● A collection of labeled nodes and (directed) edges
● Formally, one example of a definition is:
G = (V,E,λ), where V is a set of vertices, E (V ×V) is a multi-set of directed binary edges, and λ : ((V⊆ ∪
E) × Σ ) → (U (V E)) is a partial function that maps an element/string pair to an object in the universal∗ ∪
set U (excluding vertices and edges as allowed property values).*
* The Gremlin Graph Traversal Machine and Language, Marko A. Rodriguez, 2015 Proceedings of the ACM Database
Programming Languages Conference

By contrast, what's a relational
database?
● A collection of rows and columns, organized into tables?
● wikipedia: a digital database based on the relational model of data, as proposed by E. F. Codd in 1970.
● google dictionary: a database structured to recognize relations among stored items of information.
● Formally, one example of a definition is: ?
○ Maybe we could base one on the relational algebra, but it's all very
wordy and difficult to pin down concisely.
● Or, we can say an RDBMS is one that adheres to "Codd's 12 rules"
which might mean that none truly exist (see, eg, rule 6, the "view updating rule)!

By contrast, what's a relational
database?
● A collection of rows and columns, organized into tables?
● wikipedia: a digital database based on the relational model of data, as proposed by E. F. Codd in 1970.
● google dictionary: a database structured to recognize relations among stored items of information.
● Formally, one example of a definition is: ?
● We can base it on the relational algebra, but it's all very wordy and
difficult to pin down concisely.
● Or, we can say an RDBMS is one that adheres to "Codd's 12 rules"
which might mean that none truly exist (see, eg, rule 6, the "view updating rule)
Let's pretend that we know what we mean - basically, tables of rows and
columns, normalized to some degree, with integrity constraints, "etc".
Importantly, it separates logical view from physical storage.

Things we might hear (that I disagree
with)
● Graph is an entirely new world, wholly distinct and separate from
relational. Relational is just a ball and chain.
"The data explosion demands new solutions, yet the hoary old RDBMS still rules." (InfoWorld)

with)
● Relational is about a static view of bits of data, not relations. (!)
"Joins are bad, mkay?" (https://oracle2mysql.wordpress.com/2016/02/18/joins-are-bad-mkay/)

with)
● "Native" graph databases must be better than "non-native".
(for a good - and fun - rebuttal of this, see https://www.datastax.com/dev/blog/a-letter-
regarding-native-graph-databases)

with)
● "Native" graph databases must be better than "non-native".
(for a fun rebuttal of this, see https://www.datastax.com/dev/blog/a-letter-regarding-native-
graph-databases)
● Joins are inherently slower than graph traversals.
this one calls for its own slide...

Are joins slower than traversals? It
depends...
Is O(k) faster than O(log(n)) when k << n? (k is constant) Not always!
Eg, Friends of Friends - say k=avg # of friends, n=number of people.
(This is a common example given.)
O(k) = O(1) - it must be fast, right? Not necessarily… For example,
"read an entry from a list of length n":
This is not entirely facetious! The clock time depends on the
algorithms that store, read, and deserialize the data, and what data
they need to process in order to find the results.
O(1) algorithm:
read the entire disk, and
return the the entry asked for.
O(log(n)) algorithm:
use an index to find the entry,
and return it.
(eg, see An Evaluation of Alternative Physical Graph Data Designs for Processing Interactive Social Networking Actions, Ghandeharizadeh,
Boghrati, and Barahmand, Database Laboratory Technical Report, Computer Science Department, USC, 2014)

Examples from a graph company's site
This all comes from the first page that came up (paid) when I googled "index free adjacency". Or, it's the first link if you google "graph
databases future".
"Some graph databases use native graph storage that is specifically designed to store and manage
graphs – from bare metal on up. Other graph technologies use relational, columnar or object-
oriented databases as their storage layer. Non-native storage is often slower than a native approach
because all of the graph connections have to be translated into a different data model."
This, we've already discussed - what is "native" graph storage? What guarantees that it's faster?
"Native graph processing (a.k.a. index-free adjacency) is the most efficient means of processing data
in a graph because connected nodes physically point to each other in the database. Non-native
graph processing engines use other means to process Create, Read, Update or Delete (CRUD)
operations that aren’t optimized for handling connected data."
As you scale to multiple nodes, "physically pointing" becomes meaningless. And again, what guarantees that
it's faster? And the last sentence is just FUD, any mature database is heavily optimized for CRUD, as are
RDBMSs. (For a nice rebuttal, focusing on a single server implementation, see
https://www.arangodb.com/2016/04/index-free-adjacency-hybrid-indexes-graph-databases/ .)

Examples from a graph company's site
(cont)
This all comes from the first page that came up (paid) when I googled "index free adjacency". Or, it's the first link if you google "graph
databases future".
"With traditional databases, relationship queries come to a grinding halt as the number and depth of
relationships increase. In contrast, graph database performance stays constant even as your data
grows year over year."
- will traversals stay constant as edges increase? What about queries that span the graph?
- for one example of numbers - https://intertubes.wordpress.com/2017/11/28/benchmarketing-neo4j-and-
mysql/
"With graph databases, your IT and data architecture teams move at the speed of business because
the structure and schema of a graph data model flex as your solutions and industry change. Your
team doesn’t have to exhaustively model your domain ahead of time (and then exhaustively remodel
and migrate the DB after some exec asks for a change)."
This overstates the impacts of a schema change in a well-designed system.
"your application doesn’t have to infer data connections using things like foreign keys or out-of-band
processing, like MapReduce."
Edges are not so different than foreign keys; is it really so difficult to "infer"? What does "out-of-band" mean?

Now, on to the reality (as I see it)

● For data models, there's often a mapping between graph DBs with and
relational DBs.

● For data models, there's often a mapping between graph DBs and
relational DBs
It's hard to formalize, because relational is hard to formalize.

relational DBs
A picture, with a common example of a graph DB:

relational DBs
● How about integrity constraints?
○ attributes must be of the types specified in the schema
○ (nullable) FKs must reference a field in child table; an edge must point
to a vertex.

As a result, many ways to view a graph
● A graph view (gremlin)
● A relational view (SparkSQL, dseGraphFrames, Studio (JDBC/ODBC))
● A graphical graph view (Studio)
● sometimes, the underlying storage view (in DSE, CQL (for 7eet h4x0rs))

How about data access methods?
● It's often said that SQL/CQL are declarative
● gremlin offers an imperative access method
● Is declarative for relational, and imperative for graphs?
● It's not so simple...

CQL is declarative; is SQL declarative?
● A lot of SQL is declarative. declarative is "relational", esp in early SQL.
● SQL '99 offers CTEs, which have an imperative element.
● Many people use UDFs with SQL/CQL, or embed SQL/CQL in imperative
(procedural) code.
● Most SQL vendors offer extenstions, like PL/SQL or TSQL, with
imperative (procedural) elements.

Even declarative SQL can be
imperative-ish
● Consider an equality lookup, joined to a child, joined to a child. (Eg,
Friends of Friends - a "graphy" query.)
● We know this will be implemented in one way - look up the key, go look
up the child, then look at its child (like a traversal).
● Does that make declarative SQL imperative?

gremlin is declarative and imperative
● An example: managers of "gremlin's" collaborators:
declarative:
g.V().match(
as("a").has("name","gremlin"),
as("a").out("created").as("b"),
as("b").in("created").as("c"),
as("c").in("manages").as("d"),
where("a",neq("c"))).
select("d").
groupCount().by("name")
imperative:
g.V().has("name","gremlin").as("a"
).
out("created").in("created").
where(neq("a")).
in("manages").
groupCount().by("name")
taken from https://tinkerpop.apache.org/gremlin.html

comparing: SQL and gremlin
There's a nice site comparing SQL and gremlin for a graph database with a
schema: sql2gremlin.com. Some examples:
SELECT DISTINCT LEN(CategoryName)
FROM Categories
SELECT ProductName, UnitPrice
FROM Products
WHERE UnitPrice >= 5 AND UnitPrice < 10
SELECT Products.ProductName
FROM Products
INNER JOIN Categories
ON Categories.CategoryID =
Products.CategoryID
WHERE Categories.CategoryName =
'Beverages'
g.V().hasLabel("category").values("name").
map {it.get().length()}.dedup()
g.V().has("product", "unitPrice",
between(5f,10f)).valueMap("name", "unitPrice")
g.V().has("name","Beverages").in("inCategory").
values("name")

SparkSQL is declarative and imperative
● The data model currently exposed is a bit hard, very unnormalized -
vertices table and edges table
● eg (using Studio), to find killrvideo movies Daniel Day-Lewis acted in:
select movie.title
from killrvideo_vertices actor
join killrvideo_edges acts_in
join killrvideo_vertices movie
on actor.name = 'Daniel Day-Lewis'
and acts_in.dst = actor.id
and acts_in.src = movie.id;
● imperative, because you can use spark functionality

"relational" - DseGraphFrames
● DseGraphFrames are an extenstion of Spark GraphFrames
○ GraphFrames are built on DataFrames. (DataFrames add a bit of
relational ease to Spark.)
● Some simple examples (from our docs and databricks docs):
g.E().groupCount().by(T.label)
g.V().has("age", P.gt(30)).show
g.V().hasLabel("person").drop().iterate()
● DseGraphFrames support GraphX (and other) Spark libraries
○ GraphX has functions for PageRank, connected components,
shortest path, etc
27

You can also use CQL to peek under
the hood
● However, this is generally not so useful
cassandra@cqlsh:killrvideo> select * from person_p where name = 'Daniel Day-Lewis' allow filtering;
community_id | member_id | ~~property_key_id | ~~property_id | name | personId | ~~vertex_exists
----------------------+----------------+----------------------------+----------------------------------------------------------+-------------------------+--------------+----------------------
1916035712 | 76 | 32771 | 00000000-0000-8003-0000-000000000000 | Daniel Day-Lewis | null | null
…
cassandra@cqlsh:killrvideo> select * from person_e where community_id = 1916035712 and member_id = 76;
community_id | member_id | ~~edge_label_id | ~~adjacent_vertex_id | ~~adjacent_label_id | ~~edge_id |
~~edge_exists | ~~simple_edge_id
---------------------+-----------------+-------------------------+-------------------------------------------+------------------------------+--------------------------------------------------------
+-----------------------------+------------------
1916035712 | 76 | 65571 | 0x1121e2800000000000000205 | 4 | a018fe3c-b87f-11e8-9882-f533cf6b1c0d |
True | null
...
This is an implementation, storage detail, not a logical view!

My vision - all-in-one
● Seemlessly conceptualize and access your graph in many ways
○ declarative, imperative, graphy, "relational" - avoid FUD!
● We're going in that direction!
● Use the best tools for the job at hand.
29

How does DSE integrate all this and help?
● DSE Graph implements Tinkerpop, gremlin, DseGraphFrames, and more
(eg, Solr integration).
● gremlin gives you powerful imperative and declarative methods, for
traversals and graph analyses.
● SparkSQL/DseGraphFrames do too, with the power of Spark.
● gremlin language variants (GLVs) allow you to program in your favorite
(supported) language.
● Solr can be integrated, which adds a declarative element to narrow
queries.
● gremlin OLAP uses SparkSQL to speed complex queries.
● DSE Studio offers graphical views and gremlin and SparkSQL access.
30

How do You Graph

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a How do You Graph

Similar a How do You Graph (20)

Último

Último (20)

How do You Graph

Notas del editor