Slides presented at http://www.meetup.com/GraphTO/events/84573962 covering an introduction to Graph Database concepts, ecosystem and use cases. Detailed talk on the basics of Pacer, a jRuby graph traversal and data processing library, through Darrick Wiebe's "Pacer in the REPL" talk.
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
GraphTO Meetup: Intro to Graph Database and Pacer 1.0
1. GraphTO
October 2012, Mozilla Toronto
David Colebatch & Darrick Wiebe us@xnlogic.com
2. Agenda
• Intro's and Welcome
• Pacer 1.0
• pacer-examples-*
Sponsored By:
• Enron
• Graph in the REPL
• Questions
• Discussion...
3. ¿por qué?
• Data Set Size
• Connectivity of Data
• Semi-structure
• Evolution of SOA and REST
4. The Zone of SQL Adequacy
SQL database
Requirement of application
Performance
Data complexity
5. The Zone of SQL Adequacy
SQL database
Requirement of application
Performance
Data complexity
6. The Zone of SQL Adequacy
SQL database
Requirement of application
Performance
Salary List
ERP
CRM
Data complexity
7. The Zone of SQL Adequacy
SQL database
Social
Requirement of application
Geo
Performance
Salary List
Network / Cloud
Management
ERP
MDM
CRM
Data complexity
8. Graph Space
New Vendors
The last 12 months have seen a
number of new graph
technologies emerge
• AffinityDB (VMW),YarcData uRiKA
(Cray), Giraph (YHOO), Cassovary
(Twitter), StigDB (Tagged), NuvolaBase,
Pegasus, Titan (Aurelius), etc
10. Relational Model vs. Graph
Each of these models
expresses the same thing
Person* Person-Friend Friend*
Em Joh
il an
knows knows
Alli Tob Lar
son i as knows s
knows
And And knows
knows rea rés
s
knows knows knows
Pet Miic
Mc knows Ian
er knows aa
knows knows
De Mic
lia hae
l
11. Graph db performance
๏ a sample social graph
• with ~1,000 persons
๏ average 50 friends per person
๏ pathExists(a,b) limited to depth 4
๏ caches warmed up to eliminate disk I/O
Database # persons query time
MySQL 1,000 2,000 ms
Neo4j 1,000 2 ms
Neo4j 1,000,000 2 ms
12. Query Languages
• SPARQL - if you grok RDF already
• Cypher
• Gremlin
• Pacer - gem install pacer*
* requires jRuby 1.7.0
13.
14. Pacer::Examples::Enron
• Sample data provided by Chris Diehl
http://www.cpdiehl.org
• A selection of the Enron email database, in
GraphML format
17. So What?
• Heavy Emailers
• Communication paths
• ...add meta-data vertices to
18. Clustering Email by
Topic
• Analyze each email body
• Detect 150 topics
• Associate each email to its top-5 topics
• Create meta-data vertices representing this
data
• Go nuts!
• Trading emails with high BCC rates.
20. Too hard to work with query
languages and traversal
algorithms in an interactive way.
- Simple things like discovering
labels of edges for the current
vertex were not easy.
21. The same time I was feeling that
pain, discovered this interesting
xpath inspired language called
Gremlin
gremlin> outE/inV/inE/outV/back(3)/outV/etc
==> V[1]
==> V[2]
22. why is this its own language?
- Solid underpinnings, great architecture
- Ruby syntax in a couple of hours
- Pacer was born by the end of the weekend
- 2 years ago...
Ecosystem has evolved a lot since then
23. Time for some Pacer!
g = Pacer.neo4j 'db/enron.graph'
=> #<PacerGraph neo4jgraph[db/enron.graph]>
24. Discovering your data
>> g.v.frequencies :type
=> {
"Message"=>255636,
"Email Address"=>87474,
"Person"=>156
25. Keeping Pacer friendly in the repl
reuse routes and extend them
>> emails = g.v(Email, type: 'email')
=> #<V-Lucene(type:email) ~ 87474>
27. we can do better
g.vertex_name = proc do |v|
if v[:type] == 'email' ; v[:address]
elsif v[:type] == 'Message' ; v[:subject] ; end
end
>> emails.limit 5
#<V[191310] susan_bittick@may-co.com>
#<V[252457] b.todes@worldnet.att.net>
#<V[210184] usatoday5918@mailandnews.com>
#<V[237290] lacker@llgm.com>
#<V[252460] cesherman@hahnlaw.com>
Total: 5
28. Beyond the REPL
>> emails.map.help
Works much like Ruby's built-in map method but has some extra
options and, like all routes, does not evaluate immediately (see
the :routes help topic).
Example:
...
...
>> emails.map.help :options
...
...
29. So how can we use it?
In the Enron data, we can tell when people
were BCC'd:
>> g.e('RECEIVED_BY').frequencies :type
=> {
"to"=>1159970,
"bcc"=>243627,
"cc"=>243627
}
30. Apparently it was a common
occurrence
>> few_bccs = emails.lookahead(max:10) do |e|
e.sent.received_by(type: 'bcc')
end
=> #<V-Lucene(type:email) ~ 87474 -> V -> V-Future(#<V -> out(:SENT) -> V ->
outE(:RECEIVED_BY) -> E-Property(type=="bcc") -> inV -> V -> HasCount(<= 10) -> is(true)>)>
>> rare_bccs = few_bccs.lookahead(min:1000) do |e|
e.sent.received_by(type: 'to')
end
...
=> #<V-Lucene(type:email) ~ 87474 -> V -> V-Future(#<V -> out(:SENT) -> V ->...>
31. Takes a couple of seconds.
Let's speed it up
>> interesting = rare_bccs.result
=> #<Obj 32 ids -> lookup -> is_not(nil)>
32. Let's see if anyone had
any concerns
>> interesting.sent.
lookahead(&:bcc).
filter { |m| m[:body] =~ /concerns/i }
#<V[131323] Enron Mentions - 01/30/2001>
#<V[194186] Western Storage Initiatives>
Total: 2
=> #<Obj 32 ids -> ...>
34. GraphTO
October 2012, Mozilla Toronto
David Colebatch & Darrick Wiebe us@xnlogic.com
Notas del editor
\n
\n
There are four trends underpinning the NoSQL and specifically the GraphDB movements:\n1)...the size of data that we are managing is more than doubling every two years, with around 2.4 Zettabytes expected by the end of this year (or 250mil years of the TV show &#x201C;24&#x201D;).\n\n2) Data is more highly-connected than ever before. FOAF on social networks; Configuration Management for a Datacenter\n\n3) Schema-less data persistence; Add a field to just one record, no problem. Sparkes on Toyota\n\n4) Application Architecture changed from flat-files and batch processing, to shared RDBMS, SOA + Web services\n\n
\n
\n
\n
\n
\n
*This is a somewhat contrived example, as &#x201C;person&#x201D; & &#x201C;friend&#x201D; would normally be one table with a self join.\n
A borrowed slide from neo technology\n
A few options exist for graph query languages, some you may have hear of. \nSPARQL is a recursive acronym for &#x201C;SPARQL Protocol and RDF Query Language&#x201D; for Resource Description Framework. \nCypher and Gremlin are modern graph query languages with strong ties to the Neo4j community.\nPacer is a ruby gem that you can include in your projects and get jamming on embedded graph databases straight away. \n
\n
Chris compared Traffic-based and Content-based message ranking approaches to discover Ego Networks. We don&#x2019;t need to worry about the details here though. Chris has left us with a nice property graph which identifies official reporting relationships by an edge labelled &#x201C;Directly_Reported_To&#x201D;.\n
Add organizational groups\nCluster messages together into X (new vertex for X)\n\nNaughty emails.\nNONE came *from* enron email addresses\n\n
Here we show how to:\n Start IRB with 3GB of heap space & require the enron examples lib\n Load the sample GraphML data into an in-memory TinkerGraph\n Use a helper method to get a high-level summary of the data\n
A few quick query types that are best suited to the graph are things like: \nCalculating the heaviest emailers with fast edge counting, and discovering communication paths through graph algorithms like Dijkstra&#x2019;s shortest-path.\n\nAdding meta-data vertices to a dataset like this enables even more power-of-the-graph type of analysis. For example, coupling the &#x2018;influencer&#x2019; type analysis with sentiment analysis on the email message bodies, would allow you to determine which groups of staff were being negatively (or positively) effected by a given influencer. \n\n