Architecture decision records - How not to get lost in the past
Cypher to SQL online mapper
1. GraHPEr: Graph queries
on relational data
Luis Vaquero, Marco Lotz, James Brook, Joan Varvenne,
Suksant Sae Lor, David Subiros, Herry Herry, Brian Monahan
March 2016
6. CC by Ole Rinnan.
http://www.vg.no/forbruker/bil-baat-og-motor/bil-og-trafikk/post-it-feberen-brer-seg/a/165769/
7. Outline
1. Problem
2. Our solution
3. Underlying magic/technology
4. Competition
5. Status and timeline
6. Summary and call to action
8. The Case for Graph Analytics on Relational Databases
• Lots of data sitting in relational databases (accumulated over the last few decades)
• Some data are simply too bulky to move around
• Consistency / Cascading issues slow down write throughput (key in big data apps)
• Simple graph syntax and semantics to build our queries
9. Relational Data as Graphs: Problems
1. Raw SQL or stored procedures on relational DBs (“monster SQL queries”)
2. Copy data from its original source to construct a new graph (duplication)
from Pixabay under CC
by Chris Downer under CC
from Pixabay under CC
10. Outline
1. Problem
2. Our solution
3. Underlying magic/technology
4. Competition
5. Status and timeline
6. Summary and call to action
12. Outline
1. Problem
2. Our solution
3. Underlying magic/technology
4. Competition
5. Status and timeline
6. Summary and call to action
13. Relational GraHPEr
Query ProcessorGraph Schema Extractor
2 Related (but Independent) main functionalities:
Database Schema
Set of Graph Topologies
Graph Topology Cypher Query
Equivalent SQL Query
14. Relational GraHPEr
Query ProcessorGraph Schema Extractor
2 Related (but Independent) main functionalities:
Database Schema
Set of Graph Topologies
Graph Topology Cypher Query
Equivalent SQL Query
15. Relational Tables
Id Title Released Tagline
01 Matrix 1999 Enter the Matrix
Id Name Born
01 Keanu Reeves 1964
person_id Movie_id
01 02
Person_id Movie_id Role
01 01 Neo
person_id Movie_id
01 03
Movie
Person
Directed Produced
Acted In
16. The Equivalent Graph Topology (Gtop)
Movie
Properties:
ID
Title
Released
Tagline
Person
Properties:
ID
Name
Born
Acted in
Attributes: Role
Produced
Attributes: None
Directed
Attributes: None
By default, an entity-relationship diagram
Advanced ML enables finding different graphs in the data
17. Relational GraHPEr
Query ProcessorGraph Schema Extractor
2 Related (but Independent) main functionalities:
Database Schema
Set of Graph Topologies
Graph Topology Cypher Query
Equivalent SQL Query
34. Query Processor
Parser
MATCH (m:Movie) RETURN m.title
SELECT m.title FROM Movie
Visitor
Query Builder
Pattern
Match - false
NodePattern – m - movie
Return
ReturnItem – m - title
35. Query Processor SQL Builder
Pattern
Match - false
NodePattern – m - movie
Return
ReturnItem – m - title
Template Matcher SQL Templates
36. Query Processor SQL Builder
Pattern
Match - false
NodePattern – m - movie
Return
Template Matcher SQL Templates
ReturnItem – m - title
37. Query Processor SQL Builder
Pattern
Match - false
NodePattern – m - movie
Return
Template Matcher SQL Templates
ReturnItem – m - title
38. Query Processor SQL Builder
Pattern
Match - false
NodePattern – m - movie
Return
Template Matcher SQL Templates
ReturnItem – m - title
39. Query Processor SQL Builder
Pattern
Match - false
NodePattern – m - movie
Return
Template Matcher SQL Templates
ReturnItem – m - title
@*
This is a template for the Return cause of Cypher language.
*@
@args List returnItems
@args boolean distinct
@for(Map properties: returnItems) {@properties.get("property")
@if(!properties_isLast){,}} HPE Confidential
40. Query Processor SQL Builder
Pattern
Match - false
NodePattern – m - movie
Template Matcher SQL Templates
SELECT m.title
41. Query Processor SQL Builder
Pattern
Match - false
NodePattern – m - movie
Template Matcher SQL Templates
SELECT m.title
42. Query Processor SQL Builder
Pattern
Match - false
NodePattern – m - movie
Template Matcher SQL Templates
SELECT m.title
43. Query Processor SQL Builder
Pattern
Match - false
NodePattern – m - movie
Template Matcher SQL Templates
SELECT m.title
44. Query Processor SQL Builder
Pattern
Match - false
NodePattern – m - movie
Template Matcher SQL Templates
SELECT m.title
45. Query Processor SQL Builder
Pattern
Match - false
NodePattern – m - movie
Template Matcher SQL Templates
SELECT m.title
46. Query Processor SQL Builder
Pattern
Match - false
NodePattern – m - movie
Template Matcher SQL Templates
SELECT m.title
47. Query Processor SQL Builder
Pattern
Match - false
NodePattern – m - movie
Template Matcher SQL Templates
SELECT m.title
51. I/O
MATCH (p: Person)-[:person_id_acted_in_person_id]->(m: Movie) RETURN p.name, m.title
Input Cypher Query:
Expected output SQL:
SELECT p.name, m.title
FROM person AS p
JOIN acted_in ON (acted_in.person_id = p.id)
JOIN movie AS m ON (m.id = acted_in.movie_id)
52. It doesn’t stop there!
MATCH (p: Person) --> (m) return m
Input Cypher Query:
Expected output SQL: >>>>>>>>>
53. Hidden SQL Monsters
SELECT 'movie['||m.id||']' AS m
FROM person AS p
JOIN directed ON (directed.person_id = p.id)
JOIN movie AS m ON (m.id = directed.movie_id)
UNION ALL
SELECT 'movie['||m.id||']' AS m
FROM person AS p
JOIN acted_in ON (acted_in.person_id = p.id)
JOIN movie AS m ON (m.id = acted_in.movie_id)
UNION ALL
SELECT 'movie['||m.id||']' AS m
FROM person AS p
JOIN produced ON (produced.person_id = p.id)
JOIN movie AS m ON (m.id = produced.movie_id)
54. It doesn’t stop there!
MATCH (keanu: Person { name: 'Keanu Reeves' }) --> (m: Movie {released: '1999'}) return m
Input Cypher Query:
Expected output SQL: >>>>>>>>>
55. Hidden SQL Monster
SELECT 'movie['||m.id ||']' AS m
FROM person AS keanu
JOIN directed ON ( directed.person_id = keanu.id )
JOIN movie AS m ON ( m.id = directed.movie_id )
WHERE keanu.name = "keanu reeves" AND m.released = "1999"
UNION ALL
SELECT 'movie['||m.id||']' AS m
FROM person AS keanu
JOIN acted_in ON ( acted_in.person_id = keanu.id )
JOIN movie AS m ON ( m.id = acted_in.movie_id )
WHERE keanu.name = "keanu reeves" AND m.released = "1999"
UNION ALL
SELECT 'movie['||m.id||']' AS m
FROM person AS keanu
JOIN produced ON ( produced.person_id = keanu.id )
JOIN movie AS m ON ( m.id = produced.movie_id )
WHERE keanu.name = "keanu reeves" AND m.released = "1999"
56. Outline
1. Problem
2. Our solution
3. Underlying magic/technology
4. Competition
5. Status and timeline
6. Summary and call to action
57. Vendor Landscape
Market largely dominated by Neo4J3 -> This is why we chose Cypher as our query language
Me too: JSON databases are jumping into the space by enabling links between docs with
properties associated (e.g. ArangoDB)
Trends towards multimodal DB (OrientDB, DataStax, )
Consolidation: Experian acquired 4Store (now for internal use only) and, DataStax has acquired
Aurelius (Titan graph database).
1. https://en.wikipedia.org/wiki/Oracle_Spatial_and_Graph
2. http://www.teradata.com/SQL-GR-Engine
3. http://zion-city.blogspot.co.uk/2012/05/graphdb-market-share.html
* Find a more detailed comparison in the two last backup slides below
59. Outline
1. Problem
2. Our solution
3. Underlying magic/technology
4. Competition
5. Status and timeline
6. Summary and call to action
60. Cypher Language Coverage
Cypher clause Supported Details
Return Y
Order by Y
Limit Y
With Y http://wes.skeweredrook.com/the-mythical-with-neo4js-
cypher-query-language/
Skip N Can be implemented as a post-processing stage
Union N Current GraHPEr syntactic parser to split query in two
Unwind N No support for in-query collection/function handling yet
Using N Hint neo to use “right” index
General Clauses
50% of general clauses implemented
25% are easy to implement with minimum effort based on our current code base
12.5% require us to invest time in in-query collection/function processing
12.5% are neo4j specific
61. Reading Clauses
68% of read clauses implemented
20% are easy to implement with minimum effort based on our current code base
8% require us to invest time in in-query collection/function processing or build a REP
4% are for use with legacy indices in neo4jCypher clause Supported Details
Match by id Y
Match by type Y
Match by rel patter Y
Match by multiple types Y
Match multiple relationships Y
Match variable length relationships Y
Match anonymous edges and nodes Y
Match zero-path length Y
Where Y
Where on property Y
Where on label Y
Where patterns Y MATCH (n)WHERE (n)-[:KNOWS]-({ name:'Tobias' })RETURN n
Where range Y
Count Y
Distinct Y
Sum, avg, max, min Y
Case Y
Optional match N the Cypher equivalent of the outer join in SQL
Match rels with uncommon chars N
Where with string matching N
Where with regexes N
Percentile, std N can be implemented as a post-processing stage
Where on dynamic property N Requires REPL like utility
Where collection patterns N (partial) MATCH (tobias { name: 'Tobias' }),(others)WHERE others.name IN
['Andres', 'Peter'] AND (tobias)<--(others) RETURN others
Start N Deprecated/legacy usage. No plans to support.
Cypher Language Coverage
62. Cypher Language Coverage
Cypher clause Supported Details
ALL/ANY/NONE/SINGLE/E
XISTS
SIZE on collection
SIZE on pattern
LENGTH on collection
LENGTH on pattern
TYPE
Id
COALESCE
HEAD/LAST
Timestamp
Startnode / Endnode
Toint / Tofloat
Nodes
Relationships
Labels
Keys
Extract (map)
Filter
Tail
Range
Reduce
Math functions
String functions
Functions
64. Outline
1. Problem
2. Our solution
3. Underlying magic/technology
4. Competition
5. Status and timeline
6. Summary and call to action
65. Summary
For: customers with large RDBMs deployments
Who: would like to do some graph analytics (multi modal)
without migrating massive amounts of data to other platform
GraHPEr
Provides: read-only easy installation library
discovery of graphs in relational data
to query relational data in a graphy way
without data duplication or cascade effects
single-system administration
Unlike: solutions that need data to be copied and adapted to a graph format
or expose complex / verbose graph functions as stored procedures
66. GraHPEr: Unique Selling Points
• Storage and transfer time savings
o No data duplication
o No separate system to manage
• Easy to install / minimally intrusive -> multimodal DBs made easy
o Just a read-only library on top of existing DB deployments
• Declarative graph query language (compatibility with the market leader, Neo4J1)
o Tap on large existing communities / reuse current code
1. http://neo4j.com/top-ten-reasons/
67. Thank you
Luis M. Vaquero
Hewlett Packard Enterprise
Contact: luis.vaquero@hpe.com
68. Graph Analytics
Not just startups in Sillicon-Valley: the Lufthansas, Walmarts, the USBs, and the AT&Ts too
ScriptHop: A motion-picture is graph among interconnected stakeholders, including producers,
directors, casting agents, cinematographers, actors, and so on.
Determine scripts with characters whose particular attributes (such as minorities) make them likely
to require loots of screen time, which might be excessively costly and time-consuming to produce
ORiGAMI – Oak Ridge Graph Analytics for Medical Innovation
http://www.forbes.com/sites/danwoods/2015/12/29/why-graph-technology-is-ready-for-its-close-up-in-2016
69. Quick Figures
• 1% market penetration today
• Forrester Research: it will reach over 25 percent of all enterprises by 2017
• Popular tools:
o GraphConnect (Neo4J, SF’15):
more than 1000 developers
more than 350 organisations
o 1000000+ downloads
o 124 contributors
o 36500 commits
70. Performance, Really?
“Relational DBs have 40 years of success behind them”
http://istc-bigdata.org/index.php/benchmarking-graph-databases/
HPE Confidential
72. Graph to SQL
• Plenty of tools converting from SQL to graph languages.
• We want the opposite: Graph to SQL
Feature SQLGraph (IBM) GraphiQL (MIT) GraHPEr (HPE)
Language non-side-effecting Gremlin
to SQL compilation
Pig-Latin inspired new
declarative language
compiled into SQL
OpenCypher (with time
extensions) compiled to SQL
SQL Exploits recursive/iterative
queries
Exploits recursive/iterative
queries
ANSI92 with Vertica-friendly
optimisations
Additional
tables
Created relational tables (to
represent edges and
nodes)
Separate GraphTables Maximise reuse of existing tables
Integration
with pre-
existing
installations
Requires migration Requires migration No migration
Type of
analysis
Bulk Bulk Time-based
Benchmarks Large-scale Mid-scale (SNAP data) TBD (goal is large-scale, but time
constraints are key)