GraphTO Meetup: Intro to Graph Database and Pacer 1.0

GraphTO
October 2012, Mozilla Toronto

David Colebatch & Darrick Wiebe us@xnlogic.com

Agenda
• Intro's and Welcome

• Pacer 1.0

• pacer-examples-*
Sponsored By:

• Enron

• Graph in the REPL

• Questions

• Discussion...

¿por qué?

• Data Set Size
• Connectivity of Data
• Semi-structure
• Evolution of SOA and REST

The Zone of SQL Adequacy
SQL database

Requirement of application
Performance

Data complexity

SQL database

Performance

Salary List

ERP

CRM

Data complexity

SQL database
Social
Geo
Performance

Salary List

Network / Cloud
Management

ERP

MDM
CRM

Data complexity

Graph Space
New Vendors

The last 12 months have seen a
number of new graph
technologies emerge
• AfﬁnityDB (VMW),YarcData uRiKA
(Cray), Giraph (YHOO), Cassovary
(Twitter), StigDB (Tagged), NuvolaBase,
Pegasus, Titan (Aurelius), etc

How?
• Nodes / Vertices

• Relationships / Edges

Relational Model vs. Graph

Each of these models
expresses the same thing

Person* Person-Friend Friend*
Em Joh
il an
knows knows
Alli Tob Lar
son i as knows s
knows
And And knows
knows rea rés
s
knows knows knows
Pet Miic
Mc knows Ian
er knows aa
knows knows
De Mic
lia hae
l

Graph db performance
๏ a sample social graph
• with ~1,000 persons
๏ average 50 friends per person
๏ pathExists(a,b) limited to depth 4
๏ caches warmed up to eliminate disk I/O

Database # persons query time
MySQL 1,000 2,000 ms
Neo4j 1,000 2 ms
Neo4j 1,000,000 2 ms

Query Languages

• SPARQL - if you grok RDF already
• Cypher
• Gremlin
• Pacer - gem install pacer*

* requires jRuby 1.7.0

Pacer::Examples::Enron

• Sample data provided by Chris Diehl
http://www.cpdiehl.org

• A selection of the Enron email database, in
GraphML format

$ git clone
http://github.com/xnlogic/pacer-examples-enron.git

$ cd pacer-examples-enron
$ bundle
$ ./script/download_enron_data.sh
$ ./script/start_jirb.sh

> enron = Pacer::Examples::Enron.load_data

> enron.summary_of_data_types
=> {
"Message"=>255636,
"Email Address"=>87474,
"Person"=>156
}

So What?

• Heavy Emailers
• Communication paths
• ...add meta-data vertices to

Clustering Email by
Topic
• Analyze each email body
• Detect 150 topics
• Associate each email to its top-5 topics
• Create meta-data vertices representing this
data
• Go nuts!
• Trading emails with high BCC rates.

001> Pacer.in_the_REPL

REPL driven development is what I do!

Too hard to work with query
languages and traversal
algorithms in an interactive way.

- Simple things like discovering
labels of edges for the current
vertex were not easy.

The same time I was feeling that
pain, discovered this interesting
xpath inspired language called
Gremlin

gremlin> outE/inV/inE/outV/back(3)/outV/etc
==> V[1]
==> V[2]

why is this its own language?

- Solid underpinnings, great architecture

- Ruby syntax in a couple of hours

- Pacer was born by the end of the weekend

- 2 years ago...
Ecosystem has evolved a lot since then

Time for some Pacer!

g = Pacer.neo4j 'db/enron.graph'
=> #<PacerGraph neo4jgraph[db/enron.graph]>

Discovering your data

>> g.v.frequencies :type
=> {
"Message"=>255636,
"Email Address"=>87474,
"Person"=>156

Keeping Pacer friendly in the repl
reuse routes and extend them

>> emails = g.v(Email, type: 'email')
=> #<V-Lucene(type:email) ~ 87474>

intelligent output

>> emails.limit 5
#<V[191310]> #<V[252457]> #<V[210184]>
#<V[237290]> #<V[252460]>

=> #<V-Lucene(type:email) ~ 87474 -> V -> V-Range(-1...5)>

we can do better
g.vertex_name = proc do |v|
if v[:type] == 'email' ; v[:address]
elsif v[:type] == 'Message' ; v[:subject] ; end
end

>> emails.limit 5
#<V[191310] susan_bittick@may-co.com>
#<V[252457] b.todes@worldnet.att.net>
#<V[210184] usatoday5918@mailandnews.com>
#<V[237290] lacker@llgm.com>
#<V[252460] cesherman@hahnlaw.com>
Total: 5

Beyond the REPL
>> emails.map.help
Works much like Ruby's built-in map method but has some extra
options and, like all routes, does not evaluate immediately (see
the :routes help topic).

Example:
...
...

>> emails.map.help :options
...
...

So how can we use it?

In the Enron data, we can tell when people
were BCC'd:

>> g.e('RECEIVED_BY').frequencies :type
=> {
"to"=>1159970,
"bcc"=>243627,
"cc"=>243627
}

Apparently it was a common
occurrence
>> few_bccs = emails.lookahead(max:10) do |e|
e.sent.received_by(type: 'bcc')
end
=> #<V-Lucene(type:email) ~ 87474 -> V -> V-Future(#<V -> out(:SENT) -> V ->
outE(:RECEIVED_BY) -> E-Property(type=="bcc") -> inV -> V -> HasCount(<= 10) -> is(true)>)>

>> rare_bccs = few_bccs.lookahead(min:1000) do |e|
e.sent.received_by(type: 'to')
end
...
=> #<V-Lucene(type:email) ~ 87474 -> V -> V-Future(#<V -> out(:SENT) -> V ->...>

Takes a couple of seconds.
Let's speed it up

>> interesting = rare_bccs.result
=> #<Obj 32 ids -> lookup -> is_not(nil)>

Let's see if anyone had
any concerns
>> interesting.sent.
lookahead(&:bcc).
filter { |m| m[:body] =~ /concerns/i }
#<V[131323] Enron Mentions - 01/30/2001>
#<V[194186] Western Storage Initiatives>
Total: 2
=> #<Obj 32 ids -> ...>

Resources

https://github.com/xnlogic/pacer-examples-enron
https://github.com/xnlogic/pacer-xml
https://github.com/pangloss/pacer
http://tinkerpop.com/
http://neo4j.org/

GraphTO Meetup: Intro to Graph Database and Pacer 1.0

Recomendados

Recomendados

Más contenido relacionado

Último

Último (20)

Destacado

Destacado (20)

GraphTO Meetup: Intro to Graph Database and Pacer 1.0

Notas del editor