SlideShare una empresa de Scribd logo
1 de 127
Descargar para leer sin conexión
Outrageous Ideas
Data Day Texas - January 28, 2023
For Graph Databases
Welcome to this talk on Outrageous Ideas for Graph Databases.
@maxdemarzi
maxdemarzi.com
GitHub.com/maxdemarzi
Max De Marzi
My name is Max De Marzi. Follow me on twitter at maxdemarzi, checkout my blog at maxdemarzi.com or read my bad code on github.com slash maxdemarzi. I’ve spent
the last part of my career teaching people about graph databases.
In fact, if you go to the earliest blog post and check out the date. You’ll see… January 2012.
Ten Years
In the Graph game.
That’s 10 years, in the graph game telling people about graphs.
But I am not an ivory tower academic, writing about stu
ff
I don’t actually have
fi
rst hand experience with.
I work the
fi
eld. I write code. I get my hands dirty and I am the one getting yelled at when the stu
ff
doesn’t work.
But I’m not here to talk about me, I’m here to talk about Graphs… and that ladies, gentlemen and everyone else, is called a chart not a graph. That chart tells us that
Graph Databases have grown in popularity more than the other categories.
1.8%
However, even after 10 years, we’re haven’t broken past 2%. Look at Document Stores which broke two digits and we can’t break 2 percent.
But why?
Because pretty much everyone sucks.
2018
Back in 2018, Peter Boncz gave a talk on why graph databases suck.
Ideas are Wrong
• Too Many Back-ends (aka
Tinkerpop is wrong)
• No lessons applied from
Relational Databases
• API is incomplete (bulk)
• Query Languages are
Incompetent
- Peter Boncz 2018
He started o
ff
saying that the ideas were wrong. Tinkerpop as a front end has too many back end systems (we’ll get to that). That we learned nothing from relational
databases. That we provided an incomplete API, speci
fi
cally APIs to do bulk operations. And then he said the query languages were incompetent.
Implementation is Wrong
• Nodes as Objects sucks
• No internal algebras
• Incompetent Query Optimizers
• Incompetent Query Executors
• Incompetent Engineering
• Incompetent Engineers
(allegedly)
- Peter Boncz 2018
It wasn’t just the ideas he was criticizing, it was the implementations too. Representing nodes as full on Java Objects takes way too much memory. No compression, no
way to do fast scans, no internal algebras, incompetent query optimizers, incompetent query executors, incompetent engineering. The word of that day was
incompetence….and yes I added that last one, because he might as well have said that at this point. He couldn’t hurt anyone’s feelings any harder.
Last year, Peter Boncz was back to talk about: The Sorry State of Graph Database Systems. Had we learned nothing in 4 years?
https://homepages.cwi.nl/~boncz/edbt2022.pdf
Peter outlines 6 problems…
1. Data that should be accessed together is all over the place. You see it in Triple Stores and in the way Schema-less Graphs store property chains of Nodes and
Relationships.
2. Too many joins. You have to go chase this data down, which means your query planner has to work very hard instead of just scanning a record.
3. Triple stores have no concept of Objects so the query optimizer treats each property independently
4. Graph Databases should stop being special little snow
fl
akes and be more like Relational Databases.
5. Graph Databases built on Key Value stores can’t do bulk operations and the API overhead will kill you.
6.The query languages are a trap. If the optimizer can’t do it, you’ll have to forget the query language. This is how Cypher betrays you.
Peter Suggests:
Stop Sucking
https://homepages.cwi.nl/~boncz/edbt2022.pdf
To wrap up… Peter suggests we stop sucking.
It is at this point that we stand at a fork in the road ahead of us. There are two directions in which we could go. We could explore some of Peter’s suggestions. But then
this talk would be called something like….
Completely Sensible and
Utterly Boring Ideas
Data Day Texas - January 28, 2023
For Graph Databases
Completely Sensible and Utterly Boring Ideas for Graph Databases. But it’s not.
Outrageous Ideas
Data Day Texas - January 28, 2023
For Graph Databases
It’s called outrageous ideas, so let’s get on with it.
Let’s go backwards, let’s go the wrong direction.
For this we need a time machine. But not that one.
Not that one either.
We’re going to 1969. October 1969. So we’ll catch a ride with Bill and Ted instead.
We are going back to our roots. The Codasyl Model. The Network Model… and in keeping with our theme of 1969 we are dropping acid.
Drop ACID
Idea One
First idea, is to drop ACID because in almost all use cases, we are NOT the primary database.
We are the Robin to the Batman. We are a sidekick.
We are the Emotional Support Database. We help keep it together, but we are not the primary database of record.
We are the Mini-Me to the Dr. Evil. We complete them, and as much as we may try to look like them, we aren’t.
Vendor: I bet they are thinking about buying a
Graph Database
Customer: Why did someone take a photo
of us trying to sleep?
No Customer lies in bed at night thinking about buying a graph database. Let’s face it. They already have a database. But it can’t satisfy all their needs. They already tried
some kinky solutions like denormalizing data, adding materialized views, and clustered indexes, but it didn’t do the trick and now they need something new to spice
things up. But we’re there to help, not take over.
Drop The “D”
So let’s start by dropping the Durability.
The hardware vendors already did.
Many graph database vendors have built or are building distributed systems.
I only know 2 things about distributed systems.
1. They introduce a lot of overhead. Frank McSherry showed how terrible some of these so called scalable distributed systems really are.
..and the second thing I know is: They are Hard. Hard to build, Hard to maintain, Hard to reason about. Hard.
But what is Harder than hard?
Distributed
Graphs
NP-
Distributing a graph is. It’s NP-Hard. It doesn’t matter if P equals NP or not. Splitting up a graph is still going to be NP-Hard.
Let’s talk about A1, this is a 2020 paper about a distributed in memory graph database from Microsoft. I’ll skip the details and jump right into the performance testing for
which they went all out. They built a cluster of 245 Machines with Intel E5-2673 processors. I had to look that one up.
12 Cores
245 Servers
2940
Cores
X
=
It’s a 12 Core Haswell. They have 245 Servers, so a total of 2940 cores. Oh wait a second.
—>
Two… They had 2 of these per server.
12 Cores
245 Servers
5880
Cores
X
=
X 2
So 5880 Cores. Almost 6000 Cores on this Cluster. This is the ultimate dream for a lot of people. A massively distributed in memory graph database. Can you imagine
what kind of performance they got? Well you don’t have to imagine because the paper tell us.
2 hop Query
They performed a two hop query. Start with Steven Spielberg, go to the movies he directed and then to the actors who where in those movies and get a count. They
managed 20,000 queries per second.
20,000
5880
Cores Queries
per
Second
They managed 20,000 queries per second with almost 6000 cores. 20k queries per second. Almost 6k cores.
But why?
The distributed the nodes randomly across the cluster. Can you imagine? Every single time they traverse a relationship they have to take a network hit? My mind is
blown, hope yours is too.
Distribute On Cores
not on Servers
Idea Two
So idea number two. Distribute on Cores, and not on Servers.
Why are we here? It’s the Big Question. We aren’t here to have an existential crisis. I’m talking about why are you here at this tech conference? I’ll tell you why.
To prepare for the future. To do that, we have to answer one simple question:
Before the future comes the present and today Intel Xeon processors have up to 60 cores.
Who knows how many cores they will have in the future?
The internet knows. Late this year we get 64 cores, in 2024 we’re getting 128 cores and soon there after at least 344 cores, with a potential for 512 or 528 cores
according to internal leaks at Intel. https://www.youtube.com/watch?v=h20inMLeDnE
Today AMD processors have up to 64 Cores, but by the middle of this year…
They’ll be cranking 128 cores, and who knows how many in the future.
256 Cores
In 2024 AMD will release Zen 5 with up to 256 cores to select customers.
384 !!!
Then somewhere between 2024 and 2025 we will start to see 384 core chips!
64 cores in the cloud.
Hold on you say. You need big RAM to feed all these cores?
Take a look at this beauty. Oh not this kind of RAM?
4TB
Computer RAM, ok. How about 4TB today on a single socket? Is your graph bigger than 4TB?
Tomorrow that will be 8TB and before you know it 32 and 64TB.
24TB
11TB
11TB
What about the cloud? It’s raining terabytes and the forecast is for more.
…and that’s not all. Much like SANs today can let you use a scalable shared pool of hard drive space across a network, CXL technology will let you use a scalable shared
pool of memory across a network.
If you want to learn more, watch this presentation from Gustavo Alonso.
https://www.youtube.com/watch?v=KekKAKI0Aho
At Google, 90% of all analytics workloads operate on less than 1 TB of data.
Dr. Hannes Mühleisen, creator of the DuckDB reminding us that at Google, 90% of all analytics workloads operated on less than 1 terabyte of data.
Does your data
fi
t in a single server today? Will it
fi
t in a single server tomorrow?
Let’s talk about Query Languages.
You don’t have a single gremlin, you have many of them. The Groovy one, the Python one, the Ruby one, the Scala one, the Rust one, they all look similar but they aren’t
the same.
These back-ends are Implemented by a bunch of di
ff
erent Vendors.
Tinkerpop
Standard?
Around 100 vendor
dependent features
Do they allow Lambdas?
What kind of Indexing?
But is it the Standard? No way. Each vendor sets which combination of 100 features they support along with a bunch of other di
ff
erences amongst them. Like allowing
lambdas and the indexing behind the scenes. This is what Peter was complaining about earlier. What I know is that Gremlin is good at two things:
One is giving developers impostor syndrome because it is so hard to learn it turns many people away from graphs.
The second thing Gremlin is good at is allowing those that do make it through the learning curve to start thinking in paths. Start thinking “depth
fi
rst”, which is an
important concept to understand when it comes to graph queries. So it’s not all bad.
Then we have Cypher. Here he is eating the juicy steak in the matrix. It tastes so good, but you know it’s not real.
Customer
• Between a Dozen and a
Hundred Trivial Queries
• Between 0 and a Dozen Non-
Trivial Queries
• A lucky few have All Trivial
Queries
• Most have 1 Non-Trivial Query
and small variations
Workloads
Cypher can handle the Trivial queries just
fi
ne. Some customers have all trivial queries and are blissfully happy. But most have at least 1 big non-trivial query. That
recommendation engine, that shortest path
fi
nding query, that multi source bi-directional weighted traversal, etc. This is where Cypher dies. Literally. He gets electrocuted
by Tank.
So when that happens, we have APOC! Awesome Procedures on Cypher. A library of 450 plus Java Stored Procedures that actually make Cypher usable out of the
matrix and in the real world.
Wait
use graph ldbc
drop query i_short_2
create query i_short_2(INT vid) for graph ldbc {
SetAccum<INT> @@postSet;
SetAccum<INT> @@commentsSet;
SetAccum<INT> @@creatorSet;
SetAccum<INT> @@messageSet;
SetAccum<INT> @@replySet;
SetAccum<INT> @@postFromReplySet;
SetAccum<INT> @@replyToPostSet;
SumAccum<INT> @@current;
SetAccum<INT> @@resultID;
SetAccum<INT> @@visitedSet;
SumAccum<INT> @postID;
SumAccum<INT> @creatorID;
SumAccum<STRING> @creatorFirst;
SumAccum<STRING> @creatorLast;
INT tempMessageID;
INT tempCreator;
STRING tempFirst;
STRING tempLast;
INT postID;
INT tempPostID;
INT length;
INT size;
INT cur;
Person = {person.*};
Creator = {person.*};
Message = {post.*, comments.*};
Prev = {comments.*};
Post = {post.*};
Comments ={comments.*};
Reply = {comments.*};
Reply1 ={comments.*};
ReplyToPost = {comments.*};
Result = {post.*, comments.*};
CurrentReply = {comments.*};
length = Comments.size();
//get person from vid
Person = SELECT s
FROM Person:s
WHERE s.id == vid;
//get latest message
Message = SELECT s
FROM Message:s-((post_hasCreator_person|comments_hasCreator_person):e)->person:t
WHERE t.id == vid
ORDER BY s.creationDate DESC
LIMIT 10;
Message = SELECT s FROM Message:s
ACCUM @@messageSet += s.id,
@@visitedSet += s.id;
PostSet = SELECT s FROM Message:s-(post_hasCreator_person)->:t
ACCUM @@postSet += s.id;
// PRINT PostTest;
//get comment in message
Reply = SELECT s
FROM Message:s-(comments_hasCreator_person)->:t
WHERE t.id == vid
ACCUM @@replySet += s.id,
@@visitedSet += s.id;
Reply1 = SELECT s FROM Comments:s WHERE s.id IN @@replySet;
// PRINT Reply1, @@replySet;
ReplyToPost = SELECT s FROM Reply1:s-(comments_replyOf_post)->:t
ACCUM @@replyToPostSet += s.id,
@@visitedSet += s.id;
// PRINT @@replyToPostSet;
// PRINT ReplyToPost, @@replyToPostSet;
// //for each comment in message, get 1 hop comment to post
FOREACH item IN @@replySet DO
IF item != -1 THEN
CurrentReply = SELECT s FROM Reply1:s WHERE s.id == item;
//PRINT CurrentReply;
size = CurrentReply.size();
WHILE size != 0 LIMIT 100 DO
Prev = SELECT s FROM CurrentReply:s ACCUM cur = s.id;
CurrentReply = SELECT t
FROM Comments:s-(comments_replyOf_comments)->:t
WHERE s.id == cur
ACCUM @@visitedSet += t.id;
size = CurrentReply.size();
IF size == 0 THEN BREAK; END;
END;
CurrentReply = SELECT s
FROM Prev:s
ACCUM @@replyToPostSet += s.id;
//PRINT CurrentReply;
END;
END;
// PRINT @@replyToPostSet;
//
//get post from 1 hop comment
Post = SELECT s
FROM Post:s-(comments_replyOf_post_reverse)->:t
WHERE t.id IN @@replyToPostSet
ACCUM @@postFromReplySet += s.id;
// PRINT Post;
//get post creator info
Post = SELECT s
FROM Post:s-(post_hasCreator_person)->:t
ACCUM s.@creatorID = t.id,
s.@creatorFirst = t.firstName,
s.@creatorLast = t.lastName;
What about GSQL?
ACCUM @@postFromReplySet += s.id;
// PRINT Post;
//get post creator info
Post = SELECT s
FROM Post:s-(post_hasCreator_person)->:t
ACCUM s.@creatorID = t.id,
s.@creatorFirst = t.firstName,
s.@creatorLast = t.lastName;
// PRINT Post;
//pass person info and postID to 1 hop comment
ReplyToPost = SELECT t
FROM Post:s-(comments_replyOf_post_reverse)->:t
ACCUM t.@postID = s.id,
t.@creatorID = s.@creatorID,
t.@creatorFirst = s.@creatorFirst,
t.@creatorLast = s.@creatorLast,
@@replyToPostSet += t.id;
// PRINT ReplyToPost;
// //the foreach block pass person info and postID to visited comments in post
FOREACH item IN @@replyToPostSet DO
IF item != 0 THEN
Temp = SELECT s FROM ReplyToPost:s WHERE s.id == item
ACCUM tempMessageID = s.id,
tempCreator = s.@creatorID,
tempFirst = s.@creatorFirst,
tempLast = s.@creatorLast,
tempPostID = s.@postID;
//
//// //save person info and PostID from 1 kop comments to message set
Result = SELECT s
FROM Result:s
WHERE s.id IN @@visitedSet
ACCUM CASE WHEN s.id == item THEN
s.@creatorID = tempCreator,
s.@creatorFirst = tempFirst,
s.@creatorLast = tempLast,
s.@postID = tempPostID,
@@resultID += s.id
END;
size = Temp.size();
//filter result set by visited comments
Result = SELECT s FROM Result:s WHERE s.id IN @@visitedSet;
// PRINT tempCreator;
// PRINT "-----------------debug--------------------";
//
// PRINT Result;
//
// PRINT "------debug-----";
//pass post creator info to all visited comment
WHILE size != 0 LIMIT 100 DO
TempReplyTemp= SELECT t
FROM Temp:s-(comments_replyOf_comments_reverse)->:t
ACCUM tempMessageID = s.@creatorID,
tempFirst = s.@creatorFirst,
tempLast = s.@creatorLast,
tempPostID = s.@postID;
IF TempReplyTemp.size() == 1 THEN
Result = SELECT s
FROM Result:s
ACCUM CASE WHEN s.id == tempMessageID THEN
s.@creatorID = tempCreator,
s.@creatorFirst = tempFirst,
s.@creatorLast = tempLast,
s.@postID = postID
END;
size = TempReplyTemp.size();
END;
END;
END;
END;
//
//
//
// PRINT "---------------Result-------------------------";
//pass post creator to post in message set
FOREACH item IN @@postSet DO
IF item != -1 THEN
TempPost = SELECT s
FROM Result:s-(post_hasCreator_person)->:t
WHERE s.id == item
ACCUM tempCreator = t.id,
tempFirst = t.firstName,
tempLast = t.lastName;
Result = SELECT s FROM Result:s
ACCUM CASE WHEN s.id IN @@postSet THEN
s.@postID = s.id,
s.@creatorID = tempCreator,
s.@creatorFirst = tempFirst,
s.@creatorLast = tempLast
END;
END;
END;
Result = SELECT s FROM Result:s
WHERE s.id IN @@messageSet
Order by s.creationDate DESC, s.id DESC;
PRINT Result.id, Result.content, Result.imageFile, Result.creationDate, Result.@postID,
Result.@creatorID, Result.@creatorFirst, Result.@creatorLast;
}
install query i_short_2
•
GSQL can’t decide if it’s a query language or a programming language, so it’s just kind of accumulates a lot of lines of code and it’s a pain to work with for all but the
people who get paid by the hour to write this stu
ff
.
So GQL? That’s the new Standard like SQL the vendors have been building? The problem here is that it will still need APOC. Or APOG I guess, and then you can kiss
your standard goodbye.
Programming Languages
instead of Query Languages
Idea Three
Idea Three is to use actual programming languages instead of query languages.
There is a blog post from Ted Neward called the “Vietnam of Computer Science” talking about the war of ORMs and Relational Databases. This is my spin on the subject
about Declarative Query Languages.
The Lie
• In Declarative Query
Languages (like SQL, Cypher,
GQL, etc) developers are
supposed to:
• specify what is to be done
• instead of how to do it.
Let’s start o
ff
with the L I E. Can you spot it? It’s subtle. It says “In Declarative Query Languages developers are supposed to specify what is to be done instead of how to
do it”.
The Problem
• Find the customers who
decreased their purchase
amounts on their most
recent order
• A contest for who
could beat Joe Celko
performance wise
on 10k rows of data
A “simple” query
Let’s look at an example. The problem is a simple query. Find the customers who ordered less on their most recent order than the one before that. This was a subject for
a contest Joe Celko ran back in the day to see who could write a faster query on 10k rows of data. Look at that horrible mess, that was Joe’s query. https://www.red-
gate.com/simple-talk/databases/sql-server/t-sql-programming-sql-server/celkos-sql-stumper-the-data-warehouse-problem/
44 Different
• There are at least 44 different
ways to write:
“Find the customers who
decreased their purchase
amounts on their most recent
order”
• 30 Unique Timings
• At least 30 ways for the Query
Planner and Optimizer to execute
Queries
I remember this challenge because I entered two queries. There were 44 in total. 44 di
ff
erent ways to write that sentence in SQL, and 30 unique timings to go with them.
So at least 30 ways for the query planner and query optimizer to execute those queries. The queries range in performance from 46ms to 10 seconds. Just on 10
thousand rows of data. Can you imagine the timing range on 10 million rows of data? The fastest queries are 10x faster than the middle of the pack and 20x faster than
all but the worst which we will ignore because Ramesh was probably Trolling.
You end up not only having to be an expert in the query language, but also how to manipulate the query planner and query optimizer to take full advantage of the
mechanical sympathy of the database engine to run your queries optimally. This is worse than just telling the database how to execute the query.
It’s not the fossil fuel industry killing the planet, its all those ine
ffi
cient database queries running on ever growing data that will doom us all.
Lightning Round
Lightning Round.
Idea Four
No More
Database
Drivers
Idea Four is No More Database Drivers. It’s just one more thing to get in the way. You’ll spend your time answering “oh sorry we don’t have a Go Driver or Rust Driver or
Zig Driver or Julia Driver or whatever the cool kids are using this month”…and you’ll have to hire a bunch of people to build and maintain these things. It’s going to cost a
lot of money and be a royal pain. Trust me on this one.
Some of Peter’s Ideas
Schema,
Vectorization,
JIT, SIMD
A sprinkle of Peter’s ideas like actual Schema, vectorized query execution where possible, Just In Time Queries Compilation, take advantage of SIMD where possible. I
mean sure why not, these aren’t bad ideas.
Never trust vendor Benchmarks
Before I say anything more, please remember to never trust vendor benchmarks. Never ever.
Anyone, one day I got really mad at the performance I was getting. And I do mean really mad. Mad enough to write a few thousand lines of C code.
8.3m vs 330m r/s/c*
3m vs 175m r/s/c*
*Relationships Traversed Per
Second Per Core
40-60x
Faster
So I wrote the bare in memory data structures needed to duplicate what Neo4j was doing in C and compared a couple of traversals. The top one goes through 50 million
relationships per query, the second does the same, but checks a property on those relationships before traversing. From 8 million to 330 million. From 3 million to 174
million. That’s 40 to 60 times faster.
But I’m comparing apples and oranges. One is a database meant to handle any workload. The other is handcrafted code meant to handle two queries that we have
complete control over.
So does that mean everyone should just do a couple of shots and build their own handmade graph services? Not really. What it means is that there is plenty of room to
make the current databases better and build new and faster databases.
I got no
patience
and I hate
waiting
Just like Jay-Z. I have no patience and I hate waiting.
We need to code today for a better tomorrow.
Rage DB
@rage_database
ragedb.com
GitHub.com/ragedb
hub.docker.com/u/ragedb
An outrageous
graph database
So I started working on RageDB. Taking some of these outrageous ideas and implementing them.
Shell
f
ish
Because I am Shell
fi
sh.
Sorry, I meant Sel
fi
sh. Graphs are the only thing I know and if the current vendors don’t
fi
x their o
ff
erings then I might be in the same sinking ship as the Hadoop Experts.
I want to build 4 me
• Better performance
• A lot faster (hopefully)
• Can handle diverse workloads
• Properties in Traversals
• An easy interface
• HTTP + JSON
• A programming language
• For complex queries
A graph db that has:
I want to build for my needs. A graph database that is Faster, Better, Easier, and more Flexible by following some of the hardware trends we talked about in this
presentation.
—Paul Barham
“You can have a second computer
once you’ve shown
you know how to use the
f
irst one.”
And planning for a Scale Up System using Lots of Ram, Lots of Cores, on a Single Server. Replicated (eventually) but not Distributed.
Seastar
• Shared Nothing Multicore
• “Server per core”
• Message Passing
• Futures and Promises
• High Performance
Networking
Framework
Using the Seastar framework with it’s “server per core”, futures and promises, and high performance networking.
We avoid shared memory and locking, think of each core as a server message passing events within the physical box instead of via the network. No ACID needed
(maybe).
On 4 Cores
190k Requests / Second
Stupid fast, with latencies low enough for AdTech use cases.
On 4 Cores with DPDK
280k Requests / Second
We can use DPDK (Data Plane Development Kit) to go even faster skipping the network driver and talking to the network card directly… even on the Cloud. Yes. I’m only
getting an empty node, but the other graph databases can’t even say hello that fast.
Schema
• Nodes have a single Type
• No multiple labels
• Properties have a Type
• Bool, Int, Double, String, List
• Nodes of the same Type have
the same properties
• Like any sane database
Not Optional
With a Schema, because in the real world, data has schema. A single type for Nodes and Relationships, because multiple labels were a terrible mistake. Let’s make
things sane again.
HTTP + JSON
• You can talk to it from your
browser
• You can talk to it from any
programming language
• No drivers needed, no custom
protocol
Universal
Let’s talk via HTTP and JSON, from any language, no drivers needed, no custom binary protocols, you can even talk to it from your browser window.
Lua
• Proven
• Used in embedded systems
and games
• Fast
• Fastest scripting language I
know of, and using LuaJIT
• Powerful, small and free (MIT)
“Moon” in Portuguese
Using Lua as the Query Language because it’s proven in the
fi
eld and used in embedded systems and games where performance matters. Using LuaJIT the fastest
scripting language I know of.
Lua
• Simple Queries
As a Query Language
We’ll take whatever the last line of the query is and turn it into JSON. For example getting a node.
Lua
• Simple Queries
• Pipelined Queries
As a Query Language
Or doing a bunch of stu
ff
, related or unrelated, in a pipeline or batch.
Lua
• Simple Queries
• Pipelined Queries
• Complex Queries
As a Query Language
You have a real programming language to do complex queries plus helper functions for accessing the database and soon to com vectorized procedures for faster data
processing.
Look at that pretty UI. I built that myself. Let’s traverse 50M relationships in 10 seconds. Too Slow?
Remember about 100 slides ago when Peter Boncz was complaining about Graph databases not having Bulk APIs. Turns out he was right. Here we can go about 5x
faster by traversing in bulk instead of one at a time. Makes the query simpler too.
Oh hey I forgot to talk about Dgraph and GraphQL. Do we really need it here? We are already returning JSON and can return it in any way we want. A single request can
be one query or one hundred, related or not.
SIMD
• Already in Find with Predicate
• Will be added to Math and Data
Manipulation Functions
• Sprinkled in wherever it can to
speed things up
For Vectorized Execution
Borrowing the EVE library for SIMD vectorized execution. Already making
fi
nding nodes and relationships with a predicate faster, will be adding math and data
manipulation functions as well sprinkling it in wherever we can.
4 Layer Design
HTTP
Lua (in Thread)
Peered
Shard
A very simple 4 layer design. HTTP in the front, Lua if needed in Thread, a Peered method to coordinate multi shard requests and a shard layer to actually work with the
data.
Blog Posts
maxdemarzi.com
I’ve been writing my progress on my blog at maxdemarzi.com so you don’t walk blind into a 20,000 line C++ codebase. A little behind on where the code base is, but will
catch up soon.
Bookmark the website today, it’s RageDB.com
Apache License 2.0. Pinch the person sitting besides you, they aren’t dreaming. My employer allowed me to release this software as Open Source.
Todo in Spanish means “All of It”.
So there is still a ton of things to do.
Of course I’m looking for help.
Todos
• C++ Dev: ragedb
• Java Dev: rage-assured
• Scala Dev: benchmarks
• JavaScript Dev: UI
• DevRel: Home Page
• DevOps: Docker + Packaging
• Anyone: Use it, report bugs,
request features
Means all of us
Just remember that “todos” in Spanish means all of us, whatever your skill set is, I have something you can help with.
Rage DB
@rage_database
ragedb.com
GitHub.com/ragedb
hub.docker.com/u/ragedb
An outrageous
graph database
So we can build an outrageous database together. Thank you.

Más contenido relacionado

La actualidad más candente

Improve monitoring and observability for kubernetes with oss tools
Improve monitoring and observability for kubernetes with oss toolsImprove monitoring and observability for kubernetes with oss tools
Improve monitoring and observability for kubernetes with oss toolsNilesh Gule
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
 
Personalized search
Personalized searchPersonalized search
Personalized searchToine Bogers
 
Platform engineering 101
Platform engineering 101Platform engineering 101
Platform engineering 101Sander Knape
 
Monitoring real-life Azure applications: When to use what and why
Monitoring real-life Azure applications: When to use what and whyMonitoring real-life Azure applications: When to use what and why
Monitoring real-life Azure applications: When to use what and whyKarl Ots
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsEllen Friedman
 
DataOps introduction : DataOps is not only DevOps applied to data!
DataOps introduction : DataOps is not only DevOps applied to data!DataOps introduction : DataOps is not only DevOps applied to data!
DataOps introduction : DataOps is not only DevOps applied to data!Adrien Blind
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Databricks
 
How to Take Advantage of an Enterprise Data Warehouse in the Cloud
How to Take Advantage of an Enterprise Data Warehouse in the CloudHow to Take Advantage of an Enterprise Data Warehouse in the Cloud
How to Take Advantage of an Enterprise Data Warehouse in the CloudDenodo
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data EngineeringDurga Gadiraju
 
Media Processing Workflows at High Velocity and Scale using AI and ML - AWS O...
Media Processing Workflows at High Velocity and Scale using AI and ML - AWS O...Media Processing Workflows at High Velocity and Scale using AI and ML - AWS O...
Media Processing Workflows at High Velocity and Scale using AI and ML - AWS O...Amazon Web Services
 
How to apply machine learning into your CI/CD pipeline
How to apply machine learning into your CI/CD pipelineHow to apply machine learning into your CI/CD pipeline
How to apply machine learning into your CI/CD pipelineAlon Weiss
 
Business Intelligence tools comparison
Business Intelligence tools comparisonBusiness Intelligence tools comparison
Business Intelligence tools comparisonStratebi
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...
MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...
MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...Databricks
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)James Serra
 

La actualidad más candente (20)

Observability & Datadog
Observability & DatadogObservability & Datadog
Observability & Datadog
 
Data Engineering Basics
Data Engineering BasicsData Engineering Basics
Data Engineering Basics
 
DevOps Architecture Design
DevOps Architecture DesignDevOps Architecture Design
DevOps Architecture Design
 
Improve monitoring and observability for kubernetes with oss tools
Improve monitoring and observability for kubernetes with oss toolsImprove monitoring and observability for kubernetes with oss tools
Improve monitoring and observability for kubernetes with oss tools
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Personalized search
Personalized searchPersonalized search
Personalized search
 
Platform engineering 101
Platform engineering 101Platform engineering 101
Platform engineering 101
 
Monitoring real-life Azure applications: When to use what and why
Monitoring real-life Azure applications: When to use what and whyMonitoring real-life Azure applications: When to use what and why
Monitoring real-life Azure applications: When to use what and why
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven Organizations
 
DataOps introduction : DataOps is not only DevOps applied to data!
DataOps introduction : DataOps is not only DevOps applied to data!DataOps introduction : DataOps is not only DevOps applied to data!
DataOps introduction : DataOps is not only DevOps applied to data!
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
 
How to Take Advantage of an Enterprise Data Warehouse in the Cloud
How to Take Advantage of an Enterprise Data Warehouse in the CloudHow to Take Advantage of an Enterprise Data Warehouse in the Cloud
How to Take Advantage of an Enterprise Data Warehouse in the Cloud
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Media Processing Workflows at High Velocity and Scale using AI and ML - AWS O...
Media Processing Workflows at High Velocity and Scale using AI and ML - AWS O...Media Processing Workflows at High Velocity and Scale using AI and ML - AWS O...
Media Processing Workflows at High Velocity and Scale using AI and ML - AWS O...
 
How to apply machine learning into your CI/CD pipeline
How to apply machine learning into your CI/CD pipelineHow to apply machine learning into your CI/CD pipeline
How to apply machine learning into your CI/CD pipeline
 
Platform engineering
Platform engineeringPlatform engineering
Platform engineering
 
Business Intelligence tools comparison
Business Intelligence tools comparisonBusiness Intelligence tools comparison
Business Intelligence tools comparison
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...
MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...
MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 

Similar a DataDay 2023 Presentation - Notes

Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
 
Breaking first-normal form with Hive
Breaking first-normal form with HiveBreaking first-normal form with Hive
Breaking first-normal form with HiveEdward Capriolo
 
The Big Data Developer (@pavlobaron)
The Big Data Developer (@pavlobaron)The Big Data Developer (@pavlobaron)
The Big Data Developer (@pavlobaron)Pavlo Baron
 
What does OOP stand for?
What does OOP stand for?What does OOP stand for?
What does OOP stand for?Colin Riley
 
Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2ice799
 
Raspberry pi education_manual
Raspberry pi education_manualRaspberry pi education_manual
Raspberry pi education_manualTry Fajarman
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++Mike Acton
 
Raspberry pi: Aprende raspberry pi con Linux por peter membrey y david hows.
Raspberry pi: Aprende raspberry pi con Linux por peter membrey y david hows.Raspberry pi: Aprende raspberry pi con Linux por peter membrey y david hows.
Raspberry pi: Aprende raspberry pi con Linux por peter membrey y david hows.SANTIAGO PABLO ALBERTO
 
Linked Data: The Real Web 2.0 (from 2008)
Linked Data: The Real Web 2.0 (from 2008)Linked Data: The Real Web 2.0 (from 2008)
Linked Data: The Real Web 2.0 (from 2008)Uche Ogbuji
 
Neurodb Engr245 2021 Lessons Learned
Neurodb Engr245 2021 Lessons LearnedNeurodb Engr245 2021 Lessons Learned
Neurodb Engr245 2021 Lessons LearnedStanford University
 
Databases benoitg 2009-03-10
Databases benoitg 2009-03-10Databases benoitg 2009-03-10
Databases benoitg 2009-03-10benoitg
 
Bender kuszmaul tutorial-xldb12
Bender kuszmaul tutorial-xldb12Bender kuszmaul tutorial-xldb12
Bender kuszmaul tutorial-xldb12Atner Yegorov
 
Data Structures and Algorithms for Big Databases
Data Structures and Algorithms for Big DatabasesData Structures and Algorithms for Big Databases
Data Structures and Algorithms for Big Databasesomnidba
 
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsPyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsUwe Korn
 
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...PyData
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformGeekNightHyderabad
 

Similar a DataDay 2023 Presentation - Notes (20)

Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
Breaking first-normal form with Hive
Breaking first-normal form with HiveBreaking first-normal form with Hive
Breaking first-normal form with Hive
 
The Big Data Developer (@pavlobaron)
The Big Data Developer (@pavlobaron)The Big Data Developer (@pavlobaron)
The Big Data Developer (@pavlobaron)
 
What does OOP stand for?
What does OOP stand for?What does OOP stand for?
What does OOP stand for?
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2
 
Raspberry pi education_manual
Raspberry pi education_manualRaspberry pi education_manual
Raspberry pi education_manual
 
Raspberry pi education_manual
Raspberry pi education_manualRaspberry pi education_manual
Raspberry pi education_manual
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Raspberry pi: Aprende raspberry pi con Linux por peter membrey y david hows.
Raspberry pi: Aprende raspberry pi con Linux por peter membrey y david hows.Raspberry pi: Aprende raspberry pi con Linux por peter membrey y david hows.
Raspberry pi: Aprende raspberry pi con Linux por peter membrey y david hows.
 
Aug 2012 HUG: Hug BigTop
Aug 2012 HUG: Hug BigTopAug 2012 HUG: Hug BigTop
Aug 2012 HUG: Hug BigTop
 
Linked Data: The Real Web 2.0 (from 2008)
Linked Data: The Real Web 2.0 (from 2008)Linked Data: The Real Web 2.0 (from 2008)
Linked Data: The Real Web 2.0 (from 2008)
 
Neurodb Engr245 2021 Lessons Learned
Neurodb Engr245 2021 Lessons LearnedNeurodb Engr245 2021 Lessons Learned
Neurodb Engr245 2021 Lessons Learned
 
Databases benoitg 2009-03-10
Databases benoitg 2009-03-10Databases benoitg 2009-03-10
Databases benoitg 2009-03-10
 
Bender kuszmaul tutorial-xldb12
Bender kuszmaul tutorial-xldb12Bender kuszmaul tutorial-xldb12
Bender kuszmaul tutorial-xldb12
 
Data Structures and Algorithms for Big Databases
Data Structures and Algorithms for Big DatabasesData Structures and Algorithms for Big Databases
Data Structures and Algorithms for Big Databases
 
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsPyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
 
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
 

Más de Max De Marzi

DataDay 2023 Presentation
DataDay 2023 PresentationDataDay 2023 Presentation
DataDay 2023 PresentationMax De Marzi
 
Developer Intro Deck-PowerPoint - Download for Speaker Notes
Developer Intro Deck-PowerPoint - Download for Speaker NotesDeveloper Intro Deck-PowerPoint - Download for Speaker Notes
Developer Intro Deck-PowerPoint - Download for Speaker NotesMax De Marzi
 
Outrageous Ideas for Graph Databases
Outrageous Ideas for Graph DatabasesOutrageous Ideas for Graph Databases
Outrageous Ideas for Graph DatabasesMax De Marzi
 
Neo4j Training Cypher
Neo4j Training CypherNeo4j Training Cypher
Neo4j Training CypherMax De Marzi
 
Neo4j Training Modeling
Neo4j Training ModelingNeo4j Training Modeling
Neo4j Training ModelingMax De Marzi
 
Neo4j Training Introduction
Neo4j Training IntroductionNeo4j Training Introduction
Neo4j Training IntroductionMax De Marzi
 
Detenga el fraude complejo con Neo4j
Detenga el fraude complejo con Neo4jDetenga el fraude complejo con Neo4j
Detenga el fraude complejo con Neo4jMax De Marzi
 
Data Modeling Tricks for Neo4j
Data Modeling Tricks for Neo4jData Modeling Tricks for Neo4j
Data Modeling Tricks for Neo4jMax De Marzi
 
Fraud Detection and Neo4j
Fraud Detection and Neo4j Fraud Detection and Neo4j
Fraud Detection and Neo4j Max De Marzi
 
Detecion de Fraude con Neo4j
Detecion de Fraude con Neo4jDetecion de Fraude con Neo4j
Detecion de Fraude con Neo4jMax De Marzi
 
Neo4j Data Science Presentation
Neo4j Data Science PresentationNeo4j Data Science Presentation
Neo4j Data Science PresentationMax De Marzi
 
Neo4j Stored Procedure Training Part 2
Neo4j Stored Procedure Training Part 2Neo4j Stored Procedure Training Part 2
Neo4j Stored Procedure Training Part 2Max De Marzi
 
Neo4j Stored Procedure Training Part 1
Neo4j Stored Procedure Training Part 1Neo4j Stored Procedure Training Part 1
Neo4j Stored Procedure Training Part 1Max De Marzi
 
Decision Trees in Neo4j
Decision Trees in Neo4jDecision Trees in Neo4j
Decision Trees in Neo4jMax De Marzi
 
Neo4j y Fraude Spanish
Neo4j y Fraude SpanishNeo4j y Fraude Spanish
Neo4j y Fraude SpanishMax De Marzi
 
Data modeling with neo4j tutorial
Data modeling with neo4j tutorialData modeling with neo4j tutorial
Data modeling with neo4j tutorialMax De Marzi
 
Neo4j Fundamentals
Neo4j FundamentalsNeo4j Fundamentals
Neo4j FundamentalsMax De Marzi
 
Neo4j Presentation
Neo4j PresentationNeo4j Presentation
Neo4j PresentationMax De Marzi
 
Fraud Detection Class Slides
Fraud Detection Class SlidesFraud Detection Class Slides
Fraud Detection Class SlidesMax De Marzi
 

Más de Max De Marzi (20)

DataDay 2023 Presentation
DataDay 2023 PresentationDataDay 2023 Presentation
DataDay 2023 Presentation
 
Developer Intro Deck-PowerPoint - Download for Speaker Notes
Developer Intro Deck-PowerPoint - Download for Speaker NotesDeveloper Intro Deck-PowerPoint - Download for Speaker Notes
Developer Intro Deck-PowerPoint - Download for Speaker Notes
 
Outrageous Ideas for Graph Databases
Outrageous Ideas for Graph DatabasesOutrageous Ideas for Graph Databases
Outrageous Ideas for Graph Databases
 
Neo4j Training Cypher
Neo4j Training CypherNeo4j Training Cypher
Neo4j Training Cypher
 
Neo4j Training Modeling
Neo4j Training ModelingNeo4j Training Modeling
Neo4j Training Modeling
 
Neo4j Training Introduction
Neo4j Training IntroductionNeo4j Training Introduction
Neo4j Training Introduction
 
Detenga el fraude complejo con Neo4j
Detenga el fraude complejo con Neo4jDetenga el fraude complejo con Neo4j
Detenga el fraude complejo con Neo4j
 
Data Modeling Tricks for Neo4j
Data Modeling Tricks for Neo4jData Modeling Tricks for Neo4j
Data Modeling Tricks for Neo4j
 
Fraud Detection and Neo4j
Fraud Detection and Neo4j Fraud Detection and Neo4j
Fraud Detection and Neo4j
 
Detecion de Fraude con Neo4j
Detecion de Fraude con Neo4jDetecion de Fraude con Neo4j
Detecion de Fraude con Neo4j
 
Neo4j Data Science Presentation
Neo4j Data Science PresentationNeo4j Data Science Presentation
Neo4j Data Science Presentation
 
Neo4j Stored Procedure Training Part 2
Neo4j Stored Procedure Training Part 2Neo4j Stored Procedure Training Part 2
Neo4j Stored Procedure Training Part 2
 
Neo4j Stored Procedure Training Part 1
Neo4j Stored Procedure Training Part 1Neo4j Stored Procedure Training Part 1
Neo4j Stored Procedure Training Part 1
 
Decision Trees in Neo4j
Decision Trees in Neo4jDecision Trees in Neo4j
Decision Trees in Neo4j
 
Neo4j y Fraude Spanish
Neo4j y Fraude SpanishNeo4j y Fraude Spanish
Neo4j y Fraude Spanish
 
Data modeling with neo4j tutorial
Data modeling with neo4j tutorialData modeling with neo4j tutorial
Data modeling with neo4j tutorial
 
Neo4j Fundamentals
Neo4j FundamentalsNeo4j Fundamentals
Neo4j Fundamentals
 
Neo4j Presentation
Neo4j PresentationNeo4j Presentation
Neo4j Presentation
 
Fraud Detection Class Slides
Fraud Detection Class SlidesFraud Detection Class Slides
Fraud Detection Class Slides
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
 

Último

Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 

Último (20)

Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 

DataDay 2023 Presentation - Notes

  • 1. Outrageous Ideas Data Day Texas - January 28, 2023 For Graph Databases Welcome to this talk on Outrageous Ideas for Graph Databases.
  • 2. @maxdemarzi maxdemarzi.com GitHub.com/maxdemarzi Max De Marzi My name is Max De Marzi. Follow me on twitter at maxdemarzi, checkout my blog at maxdemarzi.com or read my bad code on github.com slash maxdemarzi. I’ve spent the last part of my career teaching people about graph databases.
  • 3. In fact, if you go to the earliest blog post and check out the date. You’ll see… January 2012.
  • 4. Ten Years In the Graph game. That’s 10 years, in the graph game telling people about graphs.
  • 5. But I am not an ivory tower academic, writing about stu ff I don’t actually have fi rst hand experience with.
  • 6. I work the fi eld. I write code. I get my hands dirty and I am the one getting yelled at when the stu ff doesn’t work.
  • 7. But I’m not here to talk about me, I’m here to talk about Graphs… and that ladies, gentlemen and everyone else, is called a chart not a graph. That chart tells us that Graph Databases have grown in popularity more than the other categories.
  • 8. 1.8% However, even after 10 years, we’re haven’t broken past 2%. Look at Document Stores which broke two digits and we can’t break 2 percent.
  • 10. Because pretty much everyone sucks.
  • 11. 2018 Back in 2018, Peter Boncz gave a talk on why graph databases suck.
  • 12. Ideas are Wrong • Too Many Back-ends (aka Tinkerpop is wrong) • No lessons applied from Relational Databases • API is incomplete (bulk) • Query Languages are Incompetent - Peter Boncz 2018 He started o ff saying that the ideas were wrong. Tinkerpop as a front end has too many back end systems (we’ll get to that). That we learned nothing from relational databases. That we provided an incomplete API, speci fi cally APIs to do bulk operations. And then he said the query languages were incompetent.
  • 13. Implementation is Wrong • Nodes as Objects sucks • No internal algebras • Incompetent Query Optimizers • Incompetent Query Executors • Incompetent Engineering • Incompetent Engineers (allegedly) - Peter Boncz 2018 It wasn’t just the ideas he was criticizing, it was the implementations too. Representing nodes as full on Java Objects takes way too much memory. No compression, no way to do fast scans, no internal algebras, incompetent query optimizers, incompetent query executors, incompetent engineering. The word of that day was incompetence….and yes I added that last one, because he might as well have said that at this point. He couldn’t hurt anyone’s feelings any harder.
  • 14. Last year, Peter Boncz was back to talk about: The Sorry State of Graph Database Systems. Had we learned nothing in 4 years?
  • 15.
  • 16.
  • 18. 1. Data that should be accessed together is all over the place. You see it in Triple Stores and in the way Schema-less Graphs store property chains of Nodes and Relationships.
  • 19. 2. Too many joins. You have to go chase this data down, which means your query planner has to work very hard instead of just scanning a record.
  • 20. 3. Triple stores have no concept of Objects so the query optimizer treats each property independently
  • 21. 4. Graph Databases should stop being special little snow fl akes and be more like Relational Databases.
  • 22. 5. Graph Databases built on Key Value stores can’t do bulk operations and the API overhead will kill you.
  • 23. 6.The query languages are a trap. If the optimizer can’t do it, you’ll have to forget the query language. This is how Cypher betrays you.
  • 25. It is at this point that we stand at a fork in the road ahead of us. There are two directions in which we could go. We could explore some of Peter’s suggestions. But then this talk would be called something like….
  • 26. Completely Sensible and Utterly Boring Ideas Data Day Texas - January 28, 2023 For Graph Databases Completely Sensible and Utterly Boring Ideas for Graph Databases. But it’s not.
  • 27. Outrageous Ideas Data Day Texas - January 28, 2023 For Graph Databases It’s called outrageous ideas, so let’s get on with it.
  • 28. Let’s go backwards, let’s go the wrong direction.
  • 29. For this we need a time machine. But not that one.
  • 30. Not that one either.
  • 31. We’re going to 1969. October 1969. So we’ll catch a ride with Bill and Ted instead.
  • 32. We are going back to our roots. The Codasyl Model. The Network Model… and in keeping with our theme of 1969 we are dropping acid.
  • 33. Drop ACID Idea One First idea, is to drop ACID because in almost all use cases, we are NOT the primary database.
  • 34. We are the Robin to the Batman. We are a sidekick.
  • 35. We are the Emotional Support Database. We help keep it together, but we are not the primary database of record.
  • 36. We are the Mini-Me to the Dr. Evil. We complete them, and as much as we may try to look like them, we aren’t.
  • 37. Vendor: I bet they are thinking about buying a Graph Database Customer: Why did someone take a photo of us trying to sleep? No Customer lies in bed at night thinking about buying a graph database. Let’s face it. They already have a database. But it can’t satisfy all their needs. They already tried some kinky solutions like denormalizing data, adding materialized views, and clustered indexes, but it didn’t do the trick and now they need something new to spice things up. But we’re there to help, not take over.
  • 38. Drop The “D” So let’s start by dropping the Durability.
  • 39. The hardware vendors already did.
  • 40. Many graph database vendors have built or are building distributed systems.
  • 41. I only know 2 things about distributed systems.
  • 42. 1. They introduce a lot of overhead. Frank McSherry showed how terrible some of these so called scalable distributed systems really are.
  • 43. ..and the second thing I know is: They are Hard. Hard to build, Hard to maintain, Hard to reason about. Hard.
  • 44. But what is Harder than hard?
  • 45. Distributed Graphs NP- Distributing a graph is. It’s NP-Hard. It doesn’t matter if P equals NP or not. Splitting up a graph is still going to be NP-Hard.
  • 46. Let’s talk about A1, this is a 2020 paper about a distributed in memory graph database from Microsoft. I’ll skip the details and jump right into the performance testing for which they went all out. They built a cluster of 245 Machines with Intel E5-2673 processors. I had to look that one up.
  • 47. 12 Cores 245 Servers 2940 Cores X = It’s a 12 Core Haswell. They have 245 Servers, so a total of 2940 cores. Oh wait a second.
  • 48. —> Two… They had 2 of these per server.
  • 49. 12 Cores 245 Servers 5880 Cores X = X 2 So 5880 Cores. Almost 6000 Cores on this Cluster. This is the ultimate dream for a lot of people. A massively distributed in memory graph database. Can you imagine what kind of performance they got? Well you don’t have to imagine because the paper tell us.
  • 50. 2 hop Query They performed a two hop query. Start with Steven Spielberg, go to the movies he directed and then to the actors who where in those movies and get a count. They managed 20,000 queries per second.
  • 51. 20,000 5880 Cores Queries per Second They managed 20,000 queries per second with almost 6000 cores. 20k queries per second. Almost 6k cores.
  • 53. The distributed the nodes randomly across the cluster. Can you imagine? Every single time they traverse a relationship they have to take a network hit? My mind is blown, hope yours is too.
  • 54. Distribute On Cores not on Servers Idea Two So idea number two. Distribute on Cores, and not on Servers.
  • 55. Why are we here? It’s the Big Question. We aren’t here to have an existential crisis. I’m talking about why are you here at this tech conference? I’ll tell you why.
  • 56. To prepare for the future. To do that, we have to answer one simple question:
  • 57. Before the future comes the present and today Intel Xeon processors have up to 60 cores.
  • 58. Who knows how many cores they will have in the future?
  • 59. The internet knows. Late this year we get 64 cores, in 2024 we’re getting 128 cores and soon there after at least 344 cores, with a potential for 512 or 528 cores according to internal leaks at Intel. https://www.youtube.com/watch?v=h20inMLeDnE
  • 60. Today AMD processors have up to 64 Cores, but by the middle of this year…
  • 61. They’ll be cranking 128 cores, and who knows how many in the future.
  • 62. 256 Cores In 2024 AMD will release Zen 5 with up to 256 cores to select customers.
  • 63. 384 !!! Then somewhere between 2024 and 2025 we will start to see 384 core chips!
  • 64. 64 cores in the cloud. Hold on you say. You need big RAM to feed all these cores?
  • 65. Take a look at this beauty. Oh not this kind of RAM?
  • 66. 4TB Computer RAM, ok. How about 4TB today on a single socket? Is your graph bigger than 4TB? Tomorrow that will be 8TB and before you know it 32 and 64TB.
  • 67. 24TB 11TB 11TB What about the cloud? It’s raining terabytes and the forecast is for more.
  • 68. …and that’s not all. Much like SANs today can let you use a scalable shared pool of hard drive space across a network, CXL technology will let you use a scalable shared pool of memory across a network.
  • 69. If you want to learn more, watch this presentation from Gustavo Alonso. https://www.youtube.com/watch?v=KekKAKI0Aho
  • 70. At Google, 90% of all analytics workloads operate on less than 1 TB of data. Dr. Hannes Mühleisen, creator of the DuckDB reminding us that at Google, 90% of all analytics workloads operated on less than 1 terabyte of data.
  • 71. Does your data fi t in a single server today? Will it fi t in a single server tomorrow?
  • 72. Let’s talk about Query Languages.
  • 73. You don’t have a single gremlin, you have many of them. The Groovy one, the Python one, the Ruby one, the Scala one, the Rust one, they all look similar but they aren’t the same.
  • 74. These back-ends are Implemented by a bunch of di ff erent Vendors.
  • 75. Tinkerpop Standard? Around 100 vendor dependent features Do they allow Lambdas? What kind of Indexing? But is it the Standard? No way. Each vendor sets which combination of 100 features they support along with a bunch of other di ff erences amongst them. Like allowing lambdas and the indexing behind the scenes. This is what Peter was complaining about earlier. What I know is that Gremlin is good at two things:
  • 76. One is giving developers impostor syndrome because it is so hard to learn it turns many people away from graphs.
  • 77. The second thing Gremlin is good at is allowing those that do make it through the learning curve to start thinking in paths. Start thinking “depth fi rst”, which is an important concept to understand when it comes to graph queries. So it’s not all bad.
  • 78. Then we have Cypher. Here he is eating the juicy steak in the matrix. It tastes so good, but you know it’s not real.
  • 79. Customer • Between a Dozen and a Hundred Trivial Queries • Between 0 and a Dozen Non- Trivial Queries • A lucky few have All Trivial Queries • Most have 1 Non-Trivial Query and small variations Workloads Cypher can handle the Trivial queries just fi ne. Some customers have all trivial queries and are blissfully happy. But most have at least 1 big non-trivial query. That recommendation engine, that shortest path fi nding query, that multi source bi-directional weighted traversal, etc. This is where Cypher dies. Literally. He gets electrocuted by Tank.
  • 80. So when that happens, we have APOC! Awesome Procedures on Cypher. A library of 450 plus Java Stored Procedures that actually make Cypher usable out of the matrix and in the real world.
  • 81. Wait use graph ldbc drop query i_short_2 create query i_short_2(INT vid) for graph ldbc { SetAccum<INT> @@postSet; SetAccum<INT> @@commentsSet; SetAccum<INT> @@creatorSet; SetAccum<INT> @@messageSet; SetAccum<INT> @@replySet; SetAccum<INT> @@postFromReplySet; SetAccum<INT> @@replyToPostSet; SumAccum<INT> @@current; SetAccum<INT> @@resultID; SetAccum<INT> @@visitedSet; SumAccum<INT> @postID; SumAccum<INT> @creatorID; SumAccum<STRING> @creatorFirst; SumAccum<STRING> @creatorLast; INT tempMessageID; INT tempCreator; STRING tempFirst; STRING tempLast; INT postID; INT tempPostID; INT length; INT size; INT cur; Person = {person.*}; Creator = {person.*}; Message = {post.*, comments.*}; Prev = {comments.*}; Post = {post.*}; Comments ={comments.*}; Reply = {comments.*}; Reply1 ={comments.*}; ReplyToPost = {comments.*}; Result = {post.*, comments.*}; CurrentReply = {comments.*}; length = Comments.size(); //get person from vid Person = SELECT s FROM Person:s WHERE s.id == vid; //get latest message Message = SELECT s FROM Message:s-((post_hasCreator_person|comments_hasCreator_person):e)->person:t WHERE t.id == vid ORDER BY s.creationDate DESC LIMIT 10; Message = SELECT s FROM Message:s ACCUM @@messageSet += s.id, @@visitedSet += s.id; PostSet = SELECT s FROM Message:s-(post_hasCreator_person)->:t ACCUM @@postSet += s.id; // PRINT PostTest; //get comment in message Reply = SELECT s FROM Message:s-(comments_hasCreator_person)->:t WHERE t.id == vid ACCUM @@replySet += s.id, @@visitedSet += s.id; Reply1 = SELECT s FROM Comments:s WHERE s.id IN @@replySet; // PRINT Reply1, @@replySet; ReplyToPost = SELECT s FROM Reply1:s-(comments_replyOf_post)->:t ACCUM @@replyToPostSet += s.id, @@visitedSet += s.id; // PRINT @@replyToPostSet; // PRINT ReplyToPost, @@replyToPostSet; // //for each comment in message, get 1 hop comment to post FOREACH item IN @@replySet DO IF item != -1 THEN CurrentReply = SELECT s FROM Reply1:s WHERE s.id == item; //PRINT CurrentReply; size = CurrentReply.size(); WHILE size != 0 LIMIT 100 DO Prev = SELECT s FROM CurrentReply:s ACCUM cur = s.id; CurrentReply = SELECT t FROM Comments:s-(comments_replyOf_comments)->:t WHERE s.id == cur ACCUM @@visitedSet += t.id; size = CurrentReply.size(); IF size == 0 THEN BREAK; END; END; CurrentReply = SELECT s FROM Prev:s ACCUM @@replyToPostSet += s.id; //PRINT CurrentReply; END; END; // PRINT @@replyToPostSet; // //get post from 1 hop comment Post = SELECT s FROM Post:s-(comments_replyOf_post_reverse)->:t WHERE t.id IN @@replyToPostSet ACCUM @@postFromReplySet += s.id; // PRINT Post; //get post creator info Post = SELECT s FROM Post:s-(post_hasCreator_person)->:t ACCUM s.@creatorID = t.id, s.@creatorFirst = t.firstName, s.@creatorLast = t.lastName; What about GSQL? ACCUM @@postFromReplySet += s.id; // PRINT Post; //get post creator info Post = SELECT s FROM Post:s-(post_hasCreator_person)->:t ACCUM s.@creatorID = t.id, s.@creatorFirst = t.firstName, s.@creatorLast = t.lastName; // PRINT Post; //pass person info and postID to 1 hop comment ReplyToPost = SELECT t FROM Post:s-(comments_replyOf_post_reverse)->:t ACCUM t.@postID = s.id, t.@creatorID = s.@creatorID, t.@creatorFirst = s.@creatorFirst, t.@creatorLast = s.@creatorLast, @@replyToPostSet += t.id; // PRINT ReplyToPost; // //the foreach block pass person info and postID to visited comments in post FOREACH item IN @@replyToPostSet DO IF item != 0 THEN Temp = SELECT s FROM ReplyToPost:s WHERE s.id == item ACCUM tempMessageID = s.id, tempCreator = s.@creatorID, tempFirst = s.@creatorFirst, tempLast = s.@creatorLast, tempPostID = s.@postID; // //// //save person info and PostID from 1 kop comments to message set Result = SELECT s FROM Result:s WHERE s.id IN @@visitedSet ACCUM CASE WHEN s.id == item THEN s.@creatorID = tempCreator, s.@creatorFirst = tempFirst, s.@creatorLast = tempLast, s.@postID = tempPostID, @@resultID += s.id END; size = Temp.size(); //filter result set by visited comments Result = SELECT s FROM Result:s WHERE s.id IN @@visitedSet; // PRINT tempCreator; // PRINT "-----------------debug--------------------"; // // PRINT Result; // // PRINT "------debug-----"; //pass post creator info to all visited comment WHILE size != 0 LIMIT 100 DO TempReplyTemp= SELECT t FROM Temp:s-(comments_replyOf_comments_reverse)->:t ACCUM tempMessageID = s.@creatorID, tempFirst = s.@creatorFirst, tempLast = s.@creatorLast, tempPostID = s.@postID; IF TempReplyTemp.size() == 1 THEN Result = SELECT s FROM Result:s ACCUM CASE WHEN s.id == tempMessageID THEN s.@creatorID = tempCreator, s.@creatorFirst = tempFirst, s.@creatorLast = tempLast, s.@postID = postID END; size = TempReplyTemp.size(); END; END; END; END; // // // // PRINT "---------------Result-------------------------"; //pass post creator to post in message set FOREACH item IN @@postSet DO IF item != -1 THEN TempPost = SELECT s FROM Result:s-(post_hasCreator_person)->:t WHERE s.id == item ACCUM tempCreator = t.id, tempFirst = t.firstName, tempLast = t.lastName; Result = SELECT s FROM Result:s ACCUM CASE WHEN s.id IN @@postSet THEN s.@postID = s.id, s.@creatorID = tempCreator, s.@creatorFirst = tempFirst, s.@creatorLast = tempLast END; END; END; Result = SELECT s FROM Result:s WHERE s.id IN @@messageSet Order by s.creationDate DESC, s.id DESC; PRINT Result.id, Result.content, Result.imageFile, Result.creationDate, Result.@postID, Result.@creatorID, Result.@creatorFirst, Result.@creatorLast; } install query i_short_2 • GSQL can’t decide if it’s a query language or a programming language, so it’s just kind of accumulates a lot of lines of code and it’s a pain to work with for all but the people who get paid by the hour to write this stu ff .
  • 82. So GQL? That’s the new Standard like SQL the vendors have been building? The problem here is that it will still need APOC. Or APOG I guess, and then you can kiss your standard goodbye.
  • 83. Programming Languages instead of Query Languages Idea Three Idea Three is to use actual programming languages instead of query languages.
  • 84. There is a blog post from Ted Neward called the “Vietnam of Computer Science” talking about the war of ORMs and Relational Databases. This is my spin on the subject about Declarative Query Languages.
  • 85. The Lie • In Declarative Query Languages (like SQL, Cypher, GQL, etc) developers are supposed to: • specify what is to be done • instead of how to do it. Let’s start o ff with the L I E. Can you spot it? It’s subtle. It says “In Declarative Query Languages developers are supposed to specify what is to be done instead of how to do it”.
  • 86. The Problem • Find the customers who decreased their purchase amounts on their most recent order • A contest for who could beat Joe Celko performance wise on 10k rows of data A “simple” query Let’s look at an example. The problem is a simple query. Find the customers who ordered less on their most recent order than the one before that. This was a subject for a contest Joe Celko ran back in the day to see who could write a faster query on 10k rows of data. Look at that horrible mess, that was Joe’s query. https://www.red- gate.com/simple-talk/databases/sql-server/t-sql-programming-sql-server/celkos-sql-stumper-the-data-warehouse-problem/
  • 87. 44 Different • There are at least 44 different ways to write: “Find the customers who decreased their purchase amounts on their most recent order” • 30 Unique Timings • At least 30 ways for the Query Planner and Optimizer to execute Queries I remember this challenge because I entered two queries. There were 44 in total. 44 di ff erent ways to write that sentence in SQL, and 30 unique timings to go with them. So at least 30 ways for the query planner and query optimizer to execute those queries. The queries range in performance from 46ms to 10 seconds. Just on 10 thousand rows of data. Can you imagine the timing range on 10 million rows of data? The fastest queries are 10x faster than the middle of the pack and 20x faster than all but the worst which we will ignore because Ramesh was probably Trolling.
  • 88. You end up not only having to be an expert in the query language, but also how to manipulate the query planner and query optimizer to take full advantage of the mechanical sympathy of the database engine to run your queries optimally. This is worse than just telling the database how to execute the query.
  • 89. It’s not the fossil fuel industry killing the planet, its all those ine ffi cient database queries running on ever growing data that will doom us all.
  • 91. Idea Four No More Database Drivers Idea Four is No More Database Drivers. It’s just one more thing to get in the way. You’ll spend your time answering “oh sorry we don’t have a Go Driver or Rust Driver or Zig Driver or Julia Driver or whatever the cool kids are using this month”…and you’ll have to hire a bunch of people to build and maintain these things. It’s going to cost a lot of money and be a royal pain. Trust me on this one.
  • 92. Some of Peter’s Ideas Schema, Vectorization, JIT, SIMD A sprinkle of Peter’s ideas like actual Schema, vectorized query execution where possible, Just In Time Queries Compilation, take advantage of SIMD where possible. I mean sure why not, these aren’t bad ideas.
  • 93. Never trust vendor Benchmarks Before I say anything more, please remember to never trust vendor benchmarks. Never ever.
  • 94. Anyone, one day I got really mad at the performance I was getting. And I do mean really mad. Mad enough to write a few thousand lines of C code.
  • 95. 8.3m vs 330m r/s/c* 3m vs 175m r/s/c* *Relationships Traversed Per Second Per Core 40-60x Faster So I wrote the bare in memory data structures needed to duplicate what Neo4j was doing in C and compared a couple of traversals. The top one goes through 50 million relationships per query, the second does the same, but checks a property on those relationships before traversing. From 8 million to 330 million. From 3 million to 174 million. That’s 40 to 60 times faster.
  • 96. But I’m comparing apples and oranges. One is a database meant to handle any workload. The other is handcrafted code meant to handle two queries that we have complete control over.
  • 97. So does that mean everyone should just do a couple of shots and build their own handmade graph services? Not really. What it means is that there is plenty of room to make the current databases better and build new and faster databases.
  • 98. I got no patience and I hate waiting Just like Jay-Z. I have no patience and I hate waiting.
  • 99. We need to code today for a better tomorrow.
  • 100. Rage DB @rage_database ragedb.com GitHub.com/ragedb hub.docker.com/u/ragedb An outrageous graph database So I started working on RageDB. Taking some of these outrageous ideas and implementing them.
  • 101. Shell f ish Because I am Shell fi sh.
  • 102. Sorry, I meant Sel fi sh. Graphs are the only thing I know and if the current vendors don’t fi x their o ff erings then I might be in the same sinking ship as the Hadoop Experts.
  • 103. I want to build 4 me • Better performance • A lot faster (hopefully) • Can handle diverse workloads • Properties in Traversals • An easy interface • HTTP + JSON • A programming language • For complex queries A graph db that has: I want to build for my needs. A graph database that is Faster, Better, Easier, and more Flexible by following some of the hardware trends we talked about in this presentation.
  • 104. —Paul Barham “You can have a second computer once you’ve shown you know how to use the f irst one.” And planning for a Scale Up System using Lots of Ram, Lots of Cores, on a Single Server. Replicated (eventually) but not Distributed.
  • 105. Seastar • Shared Nothing Multicore • “Server per core” • Message Passing • Futures and Promises • High Performance Networking Framework Using the Seastar framework with it’s “server per core”, futures and promises, and high performance networking.
  • 106. We avoid shared memory and locking, think of each core as a server message passing events within the physical box instead of via the network. No ACID needed (maybe).
  • 107. On 4 Cores 190k Requests / Second Stupid fast, with latencies low enough for AdTech use cases.
  • 108. On 4 Cores with DPDK 280k Requests / Second We can use DPDK (Data Plane Development Kit) to go even faster skipping the network driver and talking to the network card directly… even on the Cloud. Yes. I’m only getting an empty node, but the other graph databases can’t even say hello that fast.
  • 109. Schema • Nodes have a single Type • No multiple labels • Properties have a Type • Bool, Int, Double, String, List • Nodes of the same Type have the same properties • Like any sane database Not Optional With a Schema, because in the real world, data has schema. A single type for Nodes and Relationships, because multiple labels were a terrible mistake. Let’s make things sane again.
  • 110. HTTP + JSON • You can talk to it from your browser • You can talk to it from any programming language • No drivers needed, no custom protocol Universal Let’s talk via HTTP and JSON, from any language, no drivers needed, no custom binary protocols, you can even talk to it from your browser window.
  • 111. Lua • Proven • Used in embedded systems and games • Fast • Fastest scripting language I know of, and using LuaJIT • Powerful, small and free (MIT) “Moon” in Portuguese Using Lua as the Query Language because it’s proven in the fi eld and used in embedded systems and games where performance matters. Using LuaJIT the fastest scripting language I know of.
  • 112. Lua • Simple Queries As a Query Language We’ll take whatever the last line of the query is and turn it into JSON. For example getting a node.
  • 113. Lua • Simple Queries • Pipelined Queries As a Query Language Or doing a bunch of stu ff , related or unrelated, in a pipeline or batch.
  • 114. Lua • Simple Queries • Pipelined Queries • Complex Queries As a Query Language You have a real programming language to do complex queries plus helper functions for accessing the database and soon to com vectorized procedures for faster data processing.
  • 115. Look at that pretty UI. I built that myself. Let’s traverse 50M relationships in 10 seconds. Too Slow?
  • 116. Remember about 100 slides ago when Peter Boncz was complaining about Graph databases not having Bulk APIs. Turns out he was right. Here we can go about 5x faster by traversing in bulk instead of one at a time. Makes the query simpler too.
  • 117. Oh hey I forgot to talk about Dgraph and GraphQL. Do we really need it here? We are already returning JSON and can return it in any way we want. A single request can be one query or one hundred, related or not.
  • 118. SIMD • Already in Find with Predicate • Will be added to Math and Data Manipulation Functions • Sprinkled in wherever it can to speed things up For Vectorized Execution Borrowing the EVE library for SIMD vectorized execution. Already making fi nding nodes and relationships with a predicate faster, will be adding math and data manipulation functions as well sprinkling it in wherever we can.
  • 119. 4 Layer Design HTTP Lua (in Thread) Peered Shard A very simple 4 layer design. HTTP in the front, Lua if needed in Thread, a Peered method to coordinate multi shard requests and a shard layer to actually work with the data.
  • 120. Blog Posts maxdemarzi.com I’ve been writing my progress on my blog at maxdemarzi.com so you don’t walk blind into a 20,000 line C++ codebase. A little behind on where the code base is, but will catch up soon.
  • 121. Bookmark the website today, it’s RageDB.com
  • 122. Apache License 2.0. Pinch the person sitting besides you, they aren’t dreaming. My employer allowed me to release this software as Open Source.
  • 123. Todo in Spanish means “All of It”.
  • 124. So there is still a ton of things to do.
  • 125. Of course I’m looking for help.
  • 126. Todos • C++ Dev: ragedb • Java Dev: rage-assured • Scala Dev: benchmarks • JavaScript Dev: UI • DevRel: Home Page • DevOps: Docker + Packaging • Anyone: Use it, report bugs, request features Means all of us Just remember that “todos” in Spanish means all of us, whatever your skill set is, I have something you can help with.
  • 127. Rage DB @rage_database ragedb.com GitHub.com/ragedb hub.docker.com/u/ragedb An outrageous graph database So we can build an outrageous database together. Thank you.