Using MongoDB as a high performance graph database

Using MongoDB as a high
performance graph
database

MongoDB UK, 20th June 2012
Chris Clarke
CTO, Talis Education Limited

Thursday, 21 June 12

Who is talis?

Using mongo about 8 months (since 2.0)
5 months in production

What this talk not about


A blueprint for what you should do
A pitch to encourage you to take our approach
Providing or proving performance benchmarks
Evangelism for the semantic web or linked data
Encouraging you to contribute/download/use an open source
project
Optimised for your use case

Although we can talk to you about any of the above (see me
after)

So, what is this talk about?


Our journey of using MongoDB as a high performance graph
database
Speciﬁcally the software wrapper we implemented on top of
Mongo to give us a leg up in terms of scalability and performance
To give you some ideas for how to work with graph data models
if you’d like to use document databases

GRAPHS 101


Apologies
Nodes and edges
or
Resources and properties
Really easy to represents facts

John knows Jane

John knows Jane


Ball and stick diagrams
This is an undirected graph. It implies that John knows Jane and
Jane knows John. The property has no directional signiﬁcance.

John knows Jane
Jane knows John

John knows Jane


This is an undirected graph. It implies that John knows Jane and
Jane knows John. The property has no directional signiﬁcance.

John knows Jane
Jane ? John

John knows Jane


This is a directed graph. The relationship is one way. To add Jane
knows John we need a second property.

We will only use directed graphs from herein as they are more
speciﬁc

John knows Jane
Jane knows John
knows
John Jane
knows


Triples + RDF 101


Subject Property Object

John knows Jane


This is a triple

Property = predicate


John knows Jane

Jane knows John


This is a second triple
The same resource can be a subject or an object

http://example.com/John http://xmlns.com/foaf/0.1/knows http://example.com/Jane


RDF
Resources and properties as URIs
URIs can be dereferenced
Can share common property descriptions (RDF Schemas)
Here using FOAF - billions if not trillions of triples deﬁned using
FOAF

http://example.com/John foaf:knows http://example.com/Jane

http://example.com/John foaf:name “John”

PREFIX foaf: <http://xmlns.com/foaf/0.1/>


Namespaces for readability

In RDF subjects are always uris
But objects can be literals i.e. plain text
Many RDF/graph databases allow you to further type literals as
dates, numbers, etc.

http://example.com/John rdf:type foaf:Person
http://example.com/John foaf:name “John”
http://example.com/John foaf:knows http://example.com/Jane
http://example.com/Jane rdf:type foaf:Person
http://example.com/Jane foaf:name “Jane”
http://example.com/Jane foaf:knows http://example.com/John

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>


Here we type John and Jane as foaf:Person using rdf:type

Note both John and Jane appear as subjects and resources

This RDF graph represents six facts

foaf:Person

rdf:type rdf:type
foaf:knows

example:John example:Jane

foaf:knows

“John” “Jane”


Here it is in ball and stick

FFS! I can do that in two
minutes in BSON


> db.people.find()
{
_id: ObjectID(‘123’),
name: ‘John’
knows: [ObjectID(‘456’)]
},
{
_id: ObjectID(‘456’),
name: ‘Jane’
knows: [ObjectID(‘123’)]
}


Yes, you can!
Data only makes sense inside your db though

http://sheikspear.blogspot.co.uk/2011/07/simples.html


Talk over, right?
We can all go home

Some useful stuff, using RDF


Lets look at some reasons why we think RDF is good

attribution


This is the linked open data cloud

Linked data is a way RDF published on the open web

Search linked data TED to hear why Tim Burness Lee cares about
this

Each blob on this diagram represents an open, interlinked
dataset. The lines between them represent the interlinking
between data sets

Billions of public “facts” and growing exponentially from sites
such as BBC, governments, Last.fm, Wikipedia

Merging data from different
sources is really easy


Because the format is subject, predicate, object the shape of RDF
is always the same.
Because schemas are public and widely shared the same
properties are used all over the place.
Really easy to use this data in your own app and remix

Dataset A Dataset B

example:John example:John

rdf:type foaf:name

“John”
foaf:Person


Dataset A+B

example:John

rdf:type foaf:name

“John”
foaf:Person


Really easy to merge graphs
“Designed in” to the data format
Lots of existing tooling to do this

RDF query language:
SPARQL


SELECT ?name ?email
WHERE {
?person a foaf:Person.
?person foaf:name ?name.
?person foaf:mbox ?email.
}
ORDER BY ?name
LIMIT 50


SPARQL is mega ﬂexible. Lots of functions for grouping, walking
graphs, pattern matching, inference, UNIONS, Geo extensions
etc. etc. - all that shit.
Most if not all of those datasets will have a SPARQL endpoint you
can query

SELECT Tabular
DESCRIBE Graph
ASK Boolean
CONSTRUCT Graph


4 main query types

SELECT ?name ?email
WHERE {
FFS! That looks like SQL!
?person a foaf:Person.
?person foaf:name ?name.
?person foaf:mbox ?email.
}
ORDER BY ?name
LIMIT 50


Yes it does. The WHERE clause is basically doing a shit load of
joins. I’ll come back to that.

Application DB Triple store +
(SQL or other) SPARQL

Ofﬂine conversion process


Most datasets on the LOD diagram don’t exist natively as Linked
data and RDF. They are post-produced.
Data not held natively - so conversion script - needs to be
maintained and updated every time app schema changes
Data not up to date (1 hour, 1 day, 1 month behind?)

Our innovation:
Native Linked Data
Applications


We started working on these applications back in 2008

They are natively linked data so solve the conversion+currency
issue

There is no other “format” or schema the data is stored in, it’s
native RDF

When you have no schema, and you can integrate data from
elsewhere on the web, it’s addictive

Our problem:
FFS! For applications, we
need humongous scale and
performance


Those applications becoming rather popular with our users...

sub 50ms query time

Modern web apps need speed and data scale

Out-grown triple store and SPARQL

SPARQL is very ﬂexible and expressive. It’s also expensive
SPARQL is great for data sets where the questions you can ask are
limitless, but our applications need a data layer where speed is
measured in single digit ms.

Complex caching (w/Memcache) to achieve performance and
scalability
90:10 read:write

Tripod


It’s a pod for our triples
A triple store designed for applications and scalability
Based on Mongo

Functional requirements:
• Order magnitude increase in perf/scale
• Graph-orientated interface

Non-functional requirements:
• Strong community


Existing code very graph orientated

Core data format
Tripod API
Dealing with complex queries
TripodTables
Free text search


Walk through Tripod looking at 5 areas

{
‘http://example.com/John’ : {
‘http://purl.org/dc/elements/1.1/name’ : [
{
value: ‘John’,
type: ‘literal’
}
],
‘http://purl.org/dc/elements/1.1/knows’ : [
{
value: ‘http://example.com/Jane’,
type: ‘uri’
}
]
},
‘http://example.com/Jane’ : {
‘http://purl.org/dc/elements/1.1/name’ : [
{
value: ‘Jane’,
type: ‘literal’
}
],
‘http://purl.org/dc/elements/1.1/knows’ : [
{
value: ‘http://example.com/John’,
type: ‘uri’
},
{
value: ‘http://example.com/James’,
type: ‘uri’
}
]
}
}


RDF/JSON - a serialisation of RDF in JSON

Neither disk space efficient or readable

full-formed properties not compatible with Mongo (dot notation)

Even single values inside an array (problems for compound
indexing)

> db.CBD_people.find()
{
_id: ‘http://example.com/John’,
‘foaf:name’: {l: ‘John’},
‘foaf:knows’: {u: ‘http://example.com/Jane’}
},
{
_id: ‘http://example.com/Jane’,
‘foaf:name’: {l: ‘Jane’},
‘foaf:knows’: [
{u:‘http://example.com/John’},
{u:‘http://example.com/James’}
]
}


Same semantics

2 documents here

Concise bound descriptions - all data known about a subject,
one relationship deep

One document per subject per collection, keyed (and thus
enforced) by Subject URI

Property names are namespaced

CBD collections are deemed as read/write in Tripod

class MongoGraph extends SimpleGraph {

function add_tripod_array($tarray)
function to_tripod_array($docId)

}


All of our app already uses SimpleGraph from a library called
Moriarty (Google Code)

Simple extension which can ingest/output the data format on
prev slide

interface ITripod
{
public function select($query,$fields,$sortBy=null,$limit=null);
public function describeResource($resource);
public function describeResources(Array $resources);
public function saveChanges($oldGraph, $newGraph);
public function search($query);
}


Almost the same as our existing data access API onto generic
triple store

All of these methods return graphs, all are mega-simple queries
on the CBD collections

None of these methods support joins (WHERE clause in SPARQL)

public function describeResource($resource)
{
$query = array(“_id”=>$resource);
$bson = $this->getCollection()->findOne($query);
$graph = new MongoGraph();
$graph->add_tripod_data($bson);
return $graph;
}


These methods mega simple to implement as they translate to
really simple Mongo Queries on the CBD collections returning
single objects

interface ITripod
{
public function select($query,$fields,$sortBy=null,$limit=null);
public function describeResource($resource);
public function describeResources(Array $resources);
public function saveChanges($oldGraph, $newGraph);
public function search($query);

public function getViewForResource($resource,$viewType);
public function getViewForResources(Array $resources,$viewType);
public function getViews(Array $filter,$viewType);

}


Some extra methods to deal with complex queries involving joins

Core data format
Tripod API
TripodTables
Free text search


2 things we realised when looking at our applications

DESCRIBE <http://example.com/foo> ?sectionOrItem ?resource ?document ?
authorList ?author ?usedBy ?creator ?libraryNote ?publisher
WHERE
{
OPTIONAL
{
<http://example.com/foo> resource:contains ?sectionOrItem .
OPTIONAL
{
?sectionOrItem resource:resource ?resource .
OPTIONAL { ?resource dcterms:isPartOf ?document . }
OPTIONAL
{
?resource bibo:authorList ?authorList .
OPTIONAL { ?authorList ?p ?author . }
}
OPTIONAL { ?resource dcterms:publisher ?publisher . }
}
OPTIONAL { ?libraryNote bibo:annotates ?sectionOrItem }
} .
OPTIONAL { <http://example.com/foo> resource:usedBy ?usedBy } .
OPTIONAL { <http://example.com/foo> sioc:has_creator ?creator }
}


Typical SPARQL query in our app

9 “joins” in this query

DESCRIBE <http://example.com/foo> ?sectionOrItem ?resource ?document ?
authorList ?author ?usedBy ?creator ?libraryNote ?publisher
WHERE
{
OPTIONAL
{
<http://example.com/foo> resource:contains ?sectionOrItem .
OPTIONAL
{
?sectionOrItem resource:resource ?resource .
OPTIONAL { ?resource dcterms:isPartOf ?document . }
OPTIONAL
{
?resource bibo:authorList ?authorList .
OPTIONAL { ?authorList ?p ?author . }
}
OPTIONAL { ?resource dcterms:publisher ?publisher . }
}
OPTIONAL { ?libraryNote bibo:annotates ?sectionOrItem }
} .
OPTIONAL { <http://example.com/foo> resource:usedBy ?usedBy } .
OPTIONAL { <http://example.com/foo> sioc:has_creator ?creator }
}


Only thing that changes at run time in this query is this URI

Flexibility of SPARQL great for developer but terrible here for
system performance

Query engine needs to join 9 times! Flexibility costs us every
time we run this query!

This is why we hid it behind a cache

join
count
follow sequences (n times)
join across databases

All the above with a condition

include certain properties
include all properties


2nd thing

We only make use of minimal SPARQL

And some of these aren’t even well supported in SPARQL
(sequences + join across databases)

Materialised views, generated
infrequently, read often


Remember 90:10 read:update

View speciﬁcations based on a subset of SPARQL

Views are for DESCRIBE like queries where all the data is brought
back in one hit (not tabular data)

{
_id: "v_resource_brief",
from: "CBD_harvest",
type: "http://talisaspire.com/schema#Resource",
include: ["rdf:type", "dct:subject", "dct:isVersionOf",
"searchterms:usedAt", "dc:identifier"],
joins: {
"acorn:preferredMetadata": [],
"acorn:listReferences": {
include: ["acorn:list"]
},
"acorn:bookmarkReferences": {
include: ["acorn:bookmark"]
},
"dcterms:isPartOf": [],
"acorn:partReferences": {
include: ["dct:hasPart"],
joins: {
"dct:hasPart": {
joins: {
"acorn:preferredMetadata": []
}
}
}
}
}
}


A view speciﬁcation - itself a document that can be stored in
Mongo

8 keywords:

type from include joins
ttl followSequence maxJoins counts

Generated by incremental
MapReduce when:
1) Data is changed
2) TTL expires


Tripod can take these speciﬁcations and manage views in a
special collection within the DB.

They expire and are regenerated automatically (and
incrementally)

Incremental map reduce inside the DB

Fast, interleaves with reads

> db.views.findOne()
{
"_id" : {
"rdf:resource" : "http://talisaspire.com/examples/1",
"type" : "v_resource_full"
},
"value" : {
"graphs" : [
{
"_id" : "http://talisaspire.com/examples/1",
"rdf:type" : {
"type" : "uri",
"value" : "http://talisaspire.com/schema#Resource"
}
}
],
"impactIndex" : [
"rdf:resource" : "http://talisaspire.com/examples/1"
]
}
}


This is what a view looks like

ID is a composite key of the view type and root resource
Graphs is a collection of CBDs

MongoGraph we displayed earlier can take this and represent it
as a uniﬁed graph to the application

Impact index - A watch list of resources. When resources are
saved the impact index is queried to ﬁnd views that need
invalidating

TTL is an alternative. If in viewspec timestamp is stored in view to
determine when it can be invalidated

1 2

3

4

attribution


Match views to data update rate

Core data format
Tripod API
TripodTables
Free text search


Tripod Tables are for larger datasets which cannot be brought
back in one hit

They can be paged or have individual columns indexed for fast
sort capability

SELECT ?listName ?listUri!
WHERE
{
! ?resource bibo:isbn10 "$isbn"
! UNION
! {
! ! ?resource bibo:isbn10 "$isbnLowerCase" .
! }
! ?item resource:resource ?resource .
! UNION
! {
! ! ?resourcePartOf bibo:isbn10 "$isbn" .
! ! UNION
! ! {
! ! ! ?resourcePartOf bibo:isbn10 "$isbnLowerCase" .
! ! }
! ! ?resourcePartOf dct:hasPart ?resource .
! ! ?item resource:resource ?resource .
}
?listUri resource:contains ?item .
?listUri sioc:name ?listName .
?listUri rdf:type resource:List
}
LIMIT 10
OFFSET 40


This is a select query that brings back a two col document

OFFSET

LIMIT

<?xml version="1.0"?>
<sparql xmlns="http://www.w3.org/2005/sparql-results#">
! <head>
! ! <variable name="label"/>
! ! <variable name="type"/>
! </head>
! <results>
! ! <result>
! ! ! <binding name="label">
! ! ! ! <literal>Tropical grassland</literal>
! ! ! </binding>
! ! ! <binding name="type">
! ! ! ! <uri>http://purl.org/ontology/wo/TerrestrialHabitat</uri>
! ! ! </binding>
! ! </result>
! ! <result>
! ! ! <binding name="label">
! ! ! ! <literal>Grassy field</literal>
! ! ! </binding>
! ! ! <binding name="type">
! ! ! ! <uri>http://purl.org/ontology/wo/TerrestrialHabitat</uri>
! ! ! </binding>
! ! </result>
! </results>
</sparql>


SPARQL SELECT results - tabular format - here in XML

> db.t_resource.findOne()
{
"_id" : "http://talisaspire.com/resources/3SplCtWGPqEyXcDiyhHQpA-2",
"value" : {
"type" : [
"http://purl.org/ontology/bibo/Book",
"http://talisaspire.com/schema#Resource"
],
"isbn" : "9780393929690",
"isbn13" : [
"9780393929691",
"9780393929691-2",
! "9780393929691-3"
],
"impactIndex" : [
"http://talisaspire.com/works/4d101f63c10a6",
]
}
}


This time our map reduce doesn’t create one doc as with
materialised views

We get one doc per row

Core data format
Tripod API
TripodTables
Free text search


Our triple store included free text search

We wanted to stream updates into Elastic Search or A N Other
search solution

When documents saved, same speciﬁcation language used to
build Search Document Format docs and submit them to an
endpoint

We like ElasticSearch but you could use Amazon CloudSearch

Limitations


Map Reduce as a non-blocking db.eval() and also to work around
sync PHP programming model

PHP only for now - our web apps were PHP

To get a SPARQL endpoint we are exporting data out to Fueski -
solved the mapping not the currency (for SPARQL)

Future


Node JS port
Use as a server not a library
Eliminate dependancy on map reduce
Speciﬁcation version control
Tap into op log for stream approach into Fuseki and other
locations
Named graph support
Further optimisation of data model
Maybe open source

That’s it


Questions?

Find us on:
Web: talisaspire.com
Twitter: @talisaspire
YouTube: youtube.com/user/TalisAspire
Facebook: facebook.com/talisaspire
Support: support.talisaspire.com


Find us on:
Web: talisaspire.com
Twitter: @talisaspire
YouTube: youtube.com/user/TalisAspire
Facebook: facebook.com/talisaspire
Support: support.talisaspire.com


Using MongoDB as a high performance graph database

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Más de Chris Clarke

Más de Chris Clarke (10)

Using MongoDB as a high performance graph database