Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Próxima SlideShare
Cargando en…5
×

# Memoirs of a Graph Addict: Despair to Redemption

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Inicia sesión para ver los comentarios

### Memoirs of a Graph Addict: Despair to Redemption

1. 1. Memoirs of a Graph Addict: Despair to Redemption Marko A. Rodriguez Graph Systems Architect http://markorodriguez.com http://twitter.com/twarko Winter Whirlwind Tour – Chicago to Malm¨ – January 10-14, 2011 o January 8, 2011
2. 2. Abstract A graph database provides a means of linking together objects using direct references. In other words, in order to determine if one object is adjacent to another, no index lookup is required. In contrast to relational databases, in a graph database, there is no notion of a join operation as the graph is already an explicitly joined structure. Given a graph, problems are solved using graph traversals–that is, directed walks over the objects and relations that compose the graph. This lecture has three primary points of discussion. The ﬁrst is a description of graph database technology. The second, a memoir of the speaker’s applied and theoretical work with graphs. The third and ﬁnal point, a review of an open source graph processing stack currently being developed by AT&T Interactive and its collaborators.
3. 3. For 10 years now, I’ve dealt with a painful graph addiction... Let me share my story with you.
4. 4. Outline • Graph Structures • Graph Databases • Graph Applications • TinkerPop Product Suite
5. 5. Outline • Graph Structures • Graph Databases • Graph Applications • TinkerPop Product Suite
6. 6. Graph Data Structure Pieces: Part 1 id vertex (thing, object, dot) } element edge (relation, join, line)
7. 7. Single-Relational Graph marko peter neotech tinkerpop neo4j gremlin blueprints In single-relational graphs, things are related. Unfortunately, not a very useful structure for most domain modeling situations. Relatedness is too generic—all edges have the same meaning.
8. 8. Graph Data Structure Pieces: Part 2 id vertex (thing, object, dot) } element label edge (relation, join, line)
9. 9. Multi-Relational Graph knows marko knows peter member neotech member member created tinkerpop neo4j created created imports gremlin imports blueprints By adding labels to the edges, its possible to denote the type of relation that exists between any two vertices. Now its possible to denote diﬀerent types of things and the diﬀerent ways in which they relate to one another.
10. 10. Graph Data Structure Pieces: Part 3 id vertex (thing, object, dot) } element label edge (relation, join, line) key=value property (key/value, attribute) key1=value1 key2=value2 property map
11. 11. Property Graph knows marko knows peter member neotech member member created tinkerpop date=2009 date=2009 neo4j created created imports lang=java use=graphdb gremlin imports blueprints lang=java lang=java use=api use=traverse Allow elements to have key/value properties. In particular, very useful for further specifying the meaning of an edge. “When did TinkerPop create Gremlin?”
12. 12. Numerous Graph Types vertex-labeled a multi ted igh hyper we 0.2 edge-labeled knows simple created=2-01-09 modiﬁed=2-11-09 ge tic undirected half-ed hired di an re edge-attributed cte sem pseudo d name=emil type=person http://ex.com/123 vertex-attributed resource description framework Rodriguez, M.A., Neubauer, P., “Constructions from Dots and Lines,” Bulletin of the American Society for Information Science and Technology, 36(6), pp. 35-41, 2010. [http://arxiv.org/abs/1006.2361]
13. 13. Property Graph as a Rich Structure weighted graph add weight attribute property graph remove attributes remove attributes no op labeled graph no op semantic graph no op directed graph remove edge labels remove edge labels make labels URIs no op rdf graph multi-graph remove directionality remove loops, directionality, and multiple edges simple graph no op undirected graph A fun related thought: Rodriguez, M.A., “Mapping Semantic Networks to Undirected Networks,” International Journal of Applied Mathematics and Computer Sciences, 4(1), pp. 39–42, 2009. [http://arxiv.org/abs/0804.0277]
14. 14. Graph Algorithms in Single-Relational Graphs • Most graph algorithms are designed for single-relational graphs.1 Geodesic: shortest path, eccentricity, diameter, closeness centrality, betweenness centrality, etc. Eigenvector: spreading activation, pagerank, eigenvector centrality, etc. Assortative: scalar, assortative, etc. 1 Excellent book reviewing numerous graph algorithms: Brandes U., Erlebach, T., “Network Analysis: Methodological Foundations,” Springer, 2005.
15. 15. Graph Algorithms in Multi-Relational+ Graphs • Most real-world software systems require multi-relational+ graphs. E.g.: Who are the most central coauthors when all I know is wrote? coauthor coauthor wrote wrote wrote wrote wrote wrote • A key concept when evaluating graph algorithms over multi-relational+ graphs is implicit adjacency/path descriptions/virtual edges/etc.2 2 Rodriguez M.A., Shinavier, J., “Exposing Multi-Relational Networks to Single-Relational Network Analysis Algorithms,” Journal of Informetrics, 4(1), pp. 29–41, 2009. [http://arxiv.org/abs/0806.2274]
16. 16. Outline • Graph Structures • Graph Databases • Graph Applications • TinkerPop Product Suite
17. 17. The Simplicity of a Graph • A graph is a simple data structure. • A graph states that something is related to something else (the foundation of any other data structure).3 • It is possible to model a graph in various types of databases.4 Relational database: MySQL, Oracle, PostgreSQL JSON document database: MongoDB, CouchDB XML document database: MarkLogic, eXist-db etc. 3 A graph can be used to represent other data structures. This point becomes convenient when looking beyond using graphs for typical, real-world domain models (e.g. friends, favorites, etc.), and seeing their applicability in other areas such as modeling code (e.g. http://arxiv.org/abs/0802.3492), indices, etc. 4 For the sake of diagram clarity, the examples to follow are with respect to a single-relational, directed graph. Note that it is possible to model multi-relational graphs in these types of database as well.
18. 18. Representing a Graph in a Relational Database outV | inV ------------ A A | B A | C C | D B C D | A D
19. 19. Representing a Graph in a JSON Database { A : { A outE : [B, C] } B : { outE : [] } B C C : { outE : [D] } D : { D outE : [A] } }
20. 20. Representing a Graph in an XML Database <graphml> <graph> A <node id=A /> <node id=B /> <node id=C /> <node id=D /> <edge source=A target=B /> B C <edge source=A target=C /> <edge source=C target=D /> <edge source=D target=A /> </graph> D </graphml>
21. 21. Deﬁning a Graph Database “If any database can represent a graph, then what is a graph database?”
22. 22. Deﬁning a Graph Database A graph database is any storage system that provides index-free adjacency.
23. 23. Deﬁning a Graph Database by Example Toy Graph Gremlin (stuntman) B E A C D
24. 24. Graph Databases and Index-Free Adjacency B E A C D • Our gremlin is at vertex A. • In a graph database, vertex A has direct references to its adjacent vertices. • Constant time cost to move from A to B and C . It is dependent upon the number of edges emanating from vertex A (local).
25. 25. Graph Databases and Index-Free Adjacency B E A C D The Graph (explicit)
26. 26. Graph Databases and Index-Free Adjacency B E A C D The Graph (explicit)
27. 27. Non-Graph Databases and Index-Based Adjacency B E A B C A B,C E D,E D E C D • Our gremlin is at vertex A.
28. 28. Non-Graph Databases and Index-Based Adjacency B E A B C A B,C E D,E D E C D • In a non-graph database, the gremlin needs to look at an index to determine what is adjacent to A. • log(n) time cost to move to B and C . It is dependent upon the total number of vertices and edges in the database (global).
29. 29. Non-Graph Databases and Index-Based Adjacency B E A B C A B,C E D,E D E C D The Index (explicit) The Graph (implicit)
30. 30. Non-Graph Databases and Index-Based Adjacency B E A B C A B,C E D,E D E C D The Index (explicit) The Graph (implicit)
31. 31. Index-Free Adjacency • While any database can implicitly represent a graph, only a graph database makes the graph structure explicit.5 • In a graph database, each vertex serves as a “mini index” of its adjacent elements.6 • Thus, as the graph grows in size, the cost of a local step remains the same.7 5 Please see http://markorodriguez.com/Blarko/Entries/2010/3/29_MySQL_vs._Neo4j_on_a_ Large-Scale_Graph_Traversal.html for some performance characteristics of graph traversals in a relational database (MySQL) and a graph database (Neo4j). 6 Each vertex can be intepreted as a “parent node” in an index with its children being its adjacent elements. In this sense, traversing a graph is analogous in many ways to traversing an index—albeit the graph is not an acyclic connected graph (tree). (a vision espoused by Craig Taverner) 7 A graph, in many ways, is like a distributed index.
32. 32. Graph Query = Graph Traversal • Graph databases are optimized for graph-theoretic operations (e.g. graph traversals). • Graph databases are not optimized for set-theoretic operations (e.g. union, intersection, theta-join). • The graph traversal pattern:8 Given some root set of elements, traverse in X fashion to yield some side-eﬀect and/or destination. 8 Rodriguez, M.A., Neubauer, P., “The Graph Traversal Pattern,” Graph Data Management: Techniques and Applications, eds. S. Sakr, E. Pardede, IGI Global, 2011. http://arxiv.org/abs/1004.1001
33. 33. Outline • Graph Structures • Graph Databases • Graph Applications • TinkerPop Product Suite
34. 34. Adventures in Graphlandia My graph disease ﬁrst started in 2001 and it’s only progressed since... • Collective decision making: graph-based voting. • Eudaemonic engine: graph-based recommendation. • Universal computer: graph-based computing.
35. 35. Collective Decision Making: Fall of the Modern World The year is 2014.
36. 36. Oil production has dropped signiﬁcantly. Any reserves that are left are too expensive to purchase. Nations can not transport food.9 Regions with poor agriculture yield famine. 9 Peak oil available at http://en.wikipedia.org/wiki/Peak_oil.
37. 37. People are in shock, fear, and panic over the fall of the modern world. The world sees a 75% drop in human population.
38. 38. The technology and knowledge of the modern world still exists. The social infrastructure doesn’t....A few rise to create a new world order.10 10 Watkins, J.H., M.A. Rodriguez, “A Survey of Web-Based Collective Decision Making Systems,” Studies in Computational Intelligence: Evolution of the Web in Artiﬁcial Intelligence Environments, eds. R. Nayak, N. Ichalkaranje, and L.C. Jain, pp. 245-279, 2008. [http://escholarship.org/uc/item/04h3h1cr]
39. 39. Collective Decision Making: Rise of the Machines Four strong, brave men begin the journey to stability. Decisions marko peter need to be made regarding how to determine and execute social goals. The distributed collective of TinkerPop is created. josh • Marko Rodriguez (former USA) • Peter Neubauer (former Sweden) pavel • Josh Shinavier (former China) • Pavel Yaskevich (former Belarus)
40. 40. Collective Decision Making: Rise of the Machines marko peter josh pavel Dynamically Distribute Direct Democracy Democracy Two examples will be presented for the same decision making scenario. One using direct democracy as the aggregation algorithm and one using dynamically distributed democracy as the aggregation algorithm.11 11 Rodriguez, M.A., Watkins, J.H., “Revisiting the Age of Enlightenment from a Collective Decision Making Systems Perspective,” First Monday, 14(8), 2009. [http://arxiv.org/abs/0901.3929]
41. 41. Collective Decision Making: Direct Democracy • “What percentage of our crop marko peter yield should we store as 0.8 0.5 reserves?” • The outcome is represented as a real value in [0, 1]. josh 0.8 • Each individual has their opinion of the situation. pavel Marko (80% should be stored.) 0.9 Peter (50% should be stored.) Josh (80% should be stored.) Pavel (90% should be stored.)
42. 42. Collective Decision Making: Direct Democracy • In a direct democracy, every one marko peter voices their opinion. 0.8 0.5 • The average of all voiced opinions is the ﬁnal decision (even in binary josh decisions). 0.8 • For our society of 4, a pure direct pavel democracy would yield 0.9 (0.8 + 0.5 + 0.8 + 0.9)/4 = 0.75.
43. 43. Collective Decision Making: Direct Democracy • If an individual abstains from marko peter participation, then their opinion 0.8 0.5 is not considered. • Assume only Peter and Pavel are there to participate. Marko and josh 0.8 Josh are out hunting. • For our society of 4 (with 2 voters), a pure direct democracy pavel would yield 0.9 (0.5 + 0.9)/2 = 0.7. |0.75 − 0.7| = 0.05 error.
44. 44. Collective Decision Making: Representative Democracy • Thomas Paine stated that when populations are small “some convenient tree will aﬀord them a State house”, but as the population increases it becomes a necessity for representatives to “act in the same manner as the whole body would act were they present.”12 13 12 Paine, T., “Common Sense,” 1776. 13 The role of the representative as an expert vs. a model is argued at length in Pitkin, H.F., “The Concept of Representation,” University of California Press, 1972.
45. 45. Collective Decision Making: DDD • Dynamically distributed democracy (DDD) strikes a balance between direct and representative democracy. • An individual is at least a representative of themselves. • An individual can also yield the power of those that abstain from participation. • Dynamically distributing representative power is the purpose of the algorithm.
46. 46. Collective Decision Making: DDD • Peter believes that Josh and Marko are good decision makers. marko 0.5 peter • When Peter abstains, Marko 0.5 and Josh yield his social power in equal parts (0.5). josh • Like a friendship graph, but the edges denote “trust.” “I believe that X has identical values pavel to me and will behave as I do.” “I believe that X is more expert than I and should make decisions.”
47. 47. Collective Decision Making: DDD • Marko believes Josh is the key to humanity. marko 0.5 peter 1.0 0.5 • Josh prefers people closer to his 0.25 eastern home of former China. josh 0.75 • Pavel is of the former Soviet Union, and simply has no faith pavel in anyone.
48. 48. Collective Decision Making: DDD marko 0.5 peter 1.0 0.5 0.25 josh 0.75 pavel This is the trust-based social graph. Individuals can add/remove outgoing edges from their vertex as they please. When decisions are required, the current snapshot of the graph is used to compute the collective decision.
49. 49. Collective Decision Making: DDD • In a dynamically distributed democracy, every can voice their marko 0.5 peter opinion. 1.0 0.5 • The weighted average of all 0.25 voiced opinions is the ﬁnal josh decision. 0.75 • For our society of 4, a pure direct democracy would yield pavel (0.8 + 0.5 + 0.8 + 0.9)/4 = 0.75. • When everyone participates, its a direct democracy.
50. 50. Collective Decision Making: DDD • Assume Marko and Josh go 1.0 1.0 hunting, again. By abstaining, marko peter they diﬀuse their vote power 0.8 0.5 0.5 over their outgoing edges. 1.0 0.5 • By participating, Peter and 0.25 josh Pavel aggregate vote power 0.8 through their incoming edges. 1.0 0.75 1.0 • This diﬀusion process continues pavel until all power has aggregated at 0.9 participating individuals.
51. 51. Collective Decision Making: DDD • Note that Marko fully trusts Josh decision making abilities. 1.25 marko peter 0.5 0.8 0.5 • However, given that Josh is not 1.0 0.5 participating, Marko is implicitly 0.25 stating that he trusts Josh’s josh decision in choosing decision 0.8 makers. 1.0 0.75 1.75 pavel • Thus, Josh serves to route 0.9 Marko’s power.
52. 52. Collective Decision Making: DDD • In the end, Peter and Pavel have aggregated all the energy 1.5 in the graph (albeit, to diﬀerent marko peter 0.5 degrees). 0.8 0.5 1.0 0.5 • Now a weighted direct democracy 0.25 is used to calculate the collective josh 0.8 decision. 0.75 2.5 • The collective vote is pavel ((1.5·0.5)+(2.5·0.9))/4 = 0.75. 0.9 |0.75 − 0.75| = 0.0 error.
53. 53. Collective Decision Making: DDD 0.20 correct decisions 0.00 0.05 0.10 0.15 0.95 direct democracy dynamically distributed democracy 0.80 proportion oferror 0.65 dynamically distributed democracy direct democracy 0.50 100 90 80 70 60 50 40 30 20 10 100 90 80 70 60 50 40 30 20 10 0 0 percentage of active citizens percentage of active citizens (n) Fig. 5. The relationship between k and evote for direct democracy (gray k line) and dynamically distributed democracy (black line). The plot provides the proportion of identical, correct decisions over a simulation that was run • As participation wanes, dynamically 6. A visualization with 1000 artiﬁcially generated networks composed of 100 citizens each. Fig. distributed democracy is able to1, andcolor denotes th citizen’s is purple is 0.5. As previously stated, let x ∈ [0, 1]n denote 14 political Reingold layout. the simulate direct democracy. xi is the tendency of each citizen in this population, where tendency of citizen i and, for the purpose of simulation, is determined from a uniform distribution. Assume that every n “vote power” and 1 14 Rodriguez, M.A., Steinbock, D.J., “A Social Networka population of n citizens uses some social network- such that thentotal a citizen in for Societal-Scale Decision-Making based system to create links to those individuals that they 1. Let y ∈ R+ deno Systems,” Proceedings of the Computational Social and Organizational Science In practice, these links ﬂowed to each citize believe reﬂect their tendency the best. Conference, 2004. [http://arxiv.org/abs/cs/0412047] may point to a close friend, a relative, or some public ﬁgure a ∈ {0, 1}n denotes whose political tendencies resonate with the individual. In in the current decis other words, representatives are any citizens, not political values of a are biase candidates that serve in public ofﬁce. Let A ∈ [0, 1]n×n denote of making the citize the link matrix representing the network, where the weight of the citizen inactive. an edge, for the purpose of simulation, is denoted where ◦ denotes en 1 − |xi − xj | if link exists Ai,j = π←0 0 otherwise. i≤ while i= y←y
54. 54. Collective Decision Making: Techno-Government • In this model of decision making, there is no governmental body. • Power is determined when a decision is needed. • How are bills created? Wikilegislature?15 • What about diﬀerent types of trust (e.g. “Marko trusts Josh in engineering decisions only.”) — Hint: Multi-relational+ graphs. Tagging legislature and tagging trust.16 15 Turoﬀ, M., Roxanne-Hiltz, S., Bieber, M., Rana, A., “Collaborative Discourse Structures in Computer Mediated Group Communications”, Hawaii International Conference on Systems Science (HICSS), 1998. [http://web.njit.edu/~turoff/Papers/CDSCMC/CDSCMC.htm] 16 Rodriguez, M.A., “Social Decision Making with Multi-Relational Networks and Grammar-Based Particle Swarms,” Hawaii International Conference on Systems Science (HICSS), pp. 39–49, 2007. [http://arxiv.org/abs/cs/0609034]
55. 55. “The founders of modern democracies provided a moral heritage that remains highly regarded in societies today. However, it should be remembered that it is the ideals that are valuable, not the speciﬁc implementation of the systems that protect and support them. If there is another implementation of government that better realizes these ideals, then, by the rights of man, it must be enacted.”17 – Michael Scott 17 Rodriguez, M.A., Watkins, J.H., “Revisiting the Age of Enlightenment from a Collective Decision Making Systems Perspective,” First Monday, 14(8), University of Illinois at Chicago Library, 2009. [http://arxiv.org/abs/0901.3929]
56. 56. Eudaemonic Engine: Seeking Virtue through Circuitry The year is 2018.
57. 57. Human life on earth has stabilized.
58. 58. Humans no longer struggle to survive. They struggle for eudaemonia. They seek the “good daemon” within...
59. 59. Eudaemonic Engine: Artistotle • Being virtuous is repeatedly choosing correctly. • Habitual correct behavior leads to eudaemonia – complete engagement in the world (a complete sense of engagement/acceptance).18 19 • Can systems aid individuals in choosing correctly – in all aspects of life? Aristotle David L. Norton 18 Aristotle, “Nicomachean Ethics”, 350 B.C. 19 Mihaly Csikszentmihalyi, “Flow: The Psychology of Optimal Experience”, Harper Perennial, 1990.
60. 60. Eudaemonic Engine: Resource Modeling But if the development of character is a the moral objective, it is obvious that [...] the choices of vocation and avocations to pursue, of friends to cultivate, of books to read are moral for they clearly inﬂuence such development.20 • Web services are continuing to build richer models of humans, resources, and the relationships between them. • There exists an increasing reliance on such services to aid in decision making: correct books (Amazon.com), correct movies (NetFlix.com), correct music (Pandora), correct occupation (Monster.com), correct friends (PointsCommuns.com), correct life partner (Match.com), etc.21 20 David L. Norton, “Democracy and Moral Development: A Politics of Virtue”, University of California Press, 1991. 21 Rodriguez, M.A., Watkins, J., “Faith in the Algorithm, Part 2: Computational Eudaemonics,” Proceedings of the International Conference on Knowledge-Based and Intelligent Information & Engineering Systems, 5712, pp. 813–820, 2009. [http://arxiv.org/abs/0904.0027]
61. 61. Eudaemonic Engine: Mapping Person to Resource movie watch article read time person listen music meet friend eat food Map an individual to actions on resources. However, how do we model/expose the resources of the world?
62. 62. Model
63. 63. Eudaemonic Engine: The Web of Data homologenekegg projectgutenberg symbol libris cas bbcjohnpeel unists diseasome dailymed w3cwordnet chebi hgnc pubchem eurostat mgi omim wikicompany geospecies geneid reactome drugbank worldfactbook magnatune pubmed opencyc freebase uniparc linkedct homologenekegg projectgutenberg taxonomy uniprot interpro symbol libris uniref geneontologypdb umbel yago pfam dbpedia bbclatertotp govtrack prosite cas bbcjohnpeel prodom flickrwrappropencalais unists uscensusdata diseasome dailymed w3cwordnet surgeradio chebi lingvoj linkedmdb virtuososponger hgnc pubchem eurostat rdfbookmashup mgi omim wikicompany geospecies swconferencecorpus geonames musicbrainz geneid myspacewrapper dblpberlin reactome pubguide drugbank worldfactbook magnatune revyu pubmed jamendo opencyc uniparcrdfohloh freebase bbcplaycountdata linkedct uniprotriese taxonomy semanticweborg foafprofiles siocsites interpro uniref geneontology audioscrobbler pdb bbcprogrammes umbel dblphannover openguides yago crunchbase pfam dbpedia bbclatertotp govtrack doapspace prosite prodom flickrwrappropencalais flickrexporter qdos uscensusdata budapestbme surgeradio eurecom semwebcentral lingvoj linkedmdb ecssouthampton dblprkbexplorer newcastle virtuososponger pisa rae2001 rdfbookmashup geonames musicbrainz eprints irittoulouse laascnrs acm citeseer swconferencecorpus myspacewrapper resex ieee dblpberlin pubguide ibm revyu jamendo rdfohloh bbcplaycountdata semanticweborg siocsites riese foafprofiles openguides audioscrobbler bbcprogrammes dblphannover crunchbase doapspace flickrexporter budapestbme qdos
64. 64. Eudaemonic Engine: URIs of the Web of Data http://dbpedia.org/resource/The Fountainhead FLICKR http://www4.wiwiss.fu-berlin.de/ﬂickrwrappr/photos/Ayn_Rand foaf:depiction ﬂickr:Ayn_Rand dbpprop:hasPhotoCollection dbpedia:Ayn_Rand DBPEDIA dbpedia:Book dbpedia:author dbpedia:Fountain_Head rdf:type
65. 65. Eudaemonic Engine: Datasets on the Web of Data data set domain data set domain data set domain audioscrobbler music govtrack government pubguide books bbclatertotp music homologene biology qdos social bbcplaycountdata music ibm computer rae2001 computer bbcprogrammes media ieee computer rdfbookmashup books budapestbme computer interpro biology rdfohloh social chebi biology jamendo music resex computer crunchbase business laascnrs computer riese government dailymed medical libris books semanticweborg computer dblpberlin computer lingvoj reference semwebcentral social dblphannover computer linkedct medical siocsites social dblprkbexplorer computer linkedmdb movie surgeradio music dbpedia general magnatune music swconferencecorpus computer doapspace social musicbrainz music taxonomy reference drugbank medical myspacewrapper social umbel general eurecom computer opencalais reference uniref biology eurostat government opencyc general unists biology ﬂickrexporter images openguides reference uscensusdata government ﬂickrwrappr images pdb biology virtuososponger reference foafproﬁles social pfam biology w3cwordnet reference freebase general pisa computer wikicompany business geneid biology prodom biology worldfactbook government geneontology biology projectgutenberg books yago general geonames geographic prosite biology ...
66. 66. Eudaemonic Engine: Transforms Development A new application development paradigm emerges. No longer do data and application providers need to be the same entity (left). With the Web of Data, its possible for developers to write applications that utilize data that they do not maintain (right).22 Application 1 Application 2 Application 3 Application 1 Application 2 Application 3 processes processes processes processes processes processes Web of Data structures structures structures structures structures structures 127.0.0.1 127.0.0.2 127.0.0.3 127.0.0.1 127.0.0.2 127.0.0.3 22 Rodriguez, M.A., “A Reﬂection on the Structure and Process of the Web of Data,” Bulletin of the American Society for Information Science and Technology, 35(6), pp. 38–43, 2009. [http://arxiv.org/abs/0908.0373]
67. 67. Now that there is a rich structure, what is the process?
68. 68. Process
69. 69. Eudaemonic Engine: Diﬀusion Processes on Graphs A graph diﬀusion process will be used to determine the solution to one’s problems. • Graph traversing can be seen as a diﬀusion process over a graph. • “Energy” moves over a graph and reverberates in regions where there is recurrence (i.e. cycles). • At some t in the future, the vertices with the greatest ﬂow are the solution to the problem.
70. 70. Eudaemonic Engine: Diﬀusion Processes on Graphs
71. 71. Eudaemonic Engine: Diﬀusion Processes on Graphs
72. 72. Eudaemonic Engine: Diﬀusion Processes on Graphs
73. 73. Eudaemonic Engine: Diﬀusion Processes on Graphs
74. 74. Eudaemonic Engine: Diﬀusion Processes on Graphs
75. 75. Implementing a diﬀusion process is easy when the edges of the graph are unlabeled. flow = new HashMap<Vertex,Integer>(); current = Arrays.asList(startVertex); steps = 10; for(int i=0; i<steps; i++) { current = current.collect{ it.getAdjacentVertices() } current.each{ flow[it] = flow[it] + 1 } }
76. 76. Eudaemonic Engine: Diﬀusion on a Property Graph? likes emil likes linked 24 process knows True Blood likes wrote wrote likes likes jen knows marko knows peter occupation occupation likes likes wrote occupation intelligence The Wire gremlin tagged graphs With diﬀerent types of things being related by diﬀerent types of relations, you need to specify legal paths for the energy to ﬂow over.
77. 77. Eudaemonic Engine: Diﬀusion on a Property Graph • Problem statement = Start vertices + path expression. • Problem solution = Highest energy vertices at t.23 24 25 23 Examples presented next are basic due to the simplicity of the toy graph example used. In such cases, queries as opposed to energy diﬀusions are best. In general, the purpose of an energy diﬀusion is to expose recurrence/feedback in the graph. For the more technically inclined, think of it as determining the eigenvector of the graph deﬁned by the path expression. 24 Rodriguez, M.A., “Grammar-Based Random Walkers in Semantic Networks,” Knowledge-Based Systems, 21(7), pp. 727–739, 2008. [http://arxiv.org/abs/0803.4355] 25 Rodriguez, M.A., Neubauer, P., “A Path Algebra for Multi-Relational Graphs,” 2nd International Workshop on Graph Data Management (GDM11), 2010. [http://arxiv.org/abs/1011.0390]
78. 78. Eudaemonic Engine: Friend Recommendation likes emil likes linked 24 process knows True Blood likes wrote wrote likes likes jen knows marko knows peter occupation occupation likes likes wrote occupation intelligence The Wire gremlin tagged graphs “Who are my friends’ friends that are not me or my friends?”26 26 marko.outE[[label:’knows’]].inV.aggregate(x).outE.inV{!x.contains(it)}
79. 79. Eudaemonic Engine: Product Recommendation likes emil likes linked 24 process knows True Blood likes wrote wrote likes likes jen knows marko knows peter occupation occupation likes likes wrote occupation intelligence The Wire gremlin tagged graphs “Who likes what I like? Of those things they like, what else do they like that I don’t already like?”27 27 marko.outE[[label:’likes’]].inV.aggregate(x).inE[[label:’likes’]].outV.outE[[label:’likes’]].inV{!x.contains(it)}
80. 80. Eudaemonic Engine: Product Recommendation 2 likes emil likes linked 24 process knows True Blood likes wrote wrote likes likes jen knows marko knows peter occupation occupation likes likes wrote occupation intelligence The Wire gremlin tagged graphs “Who likes what I like and what do they like? What do the people I know like? Of those things liked, what do I not already like?”
81. 81. Eudaemonic Engine: Recommendation • Diﬀerent paths through a domain model expose diﬀerent types of recommendations. • Individual path preferences allow for an ecosystem of traversals (diﬀerent problems can be solved over the same domain model).28 29 30 28 Rodriguez, M.A., Allen, D.W., Shinavier, J., Ebersole, G., “A Recommender System to Support the Scholarly Communication Process,” 2009. [http://arxiv.org/abs/0905.1594] 29 Rodriguez, M.A., “Problem-Solving using Graph Traversals: Searching, Scoring, Ranking, and Recommendation,” Technical Talk Seminar, AT&T Interactive, 2010. [http://slidesha.re/bOCy4Q] 30 Traversal Patterns with Gremlin available at https://github.com/tinkerpop/gremlin/wiki/ Traversal-Patterns.
82. 82. Universal Computer: A Single Computational Substrate The year is 2023.
83. 83. Life is good. Humans ﬂourish. Virtuous men’s minds are ﬁlled with wonderfully creative ideas. Inventions proliferate.
84. 84. Advances in computer network technology yield a new model of computing. Computer networks are no longer the bottleneck for speed. Accessing local and remote data is no longer considered “diﬀerent.” The distinction between RAM, disk drive, and Web disappears.
85. 85. Universal Computer: A Computational Substrate On the Web... • Represent data. • Represent code. • Represent virtual machines.
86. 86. Universal Computer: Represent Data • URIs form an inﬁnite universal address space. • A URI can denote a datum. http://markorodriguez.com#self (Marko) http://sws.geonames.org/4887398/about.rdf (Chicago) http://data.nytimes.com/N38395718310308503251 (Malm¨) o • RDF (Resource Description Framework) is a data model for linking URIs into a multi-relational graph.
87. 87. Universal Computer: Represent Data 127.0.0.2 127.0.0.1 atti:marko atti:bestFriend nm:puppy atti:hasFur atti:hasFur atti:numberOfLegs atti:numberOfLegs "2"^^xsd:integer "false"^^xsd:boolean "4"^^xsd:integer "true"^^xsd:boolean • The concept of atti:marko and the properties atti:numberOfLegs, atti:hasFur, and atti:bestFriend is maintained by AT&Ti graph server. • The concept of nm:puppy is maintained by a New Mexico graph server. • The data types of xsd:integer and xsd:boolean are maintained by XML standards organization.
88. 88. Universal Computer: Represent Code • Computing is a series of instructions — add, write, branch, goto... • The URI address space and RDF glue can be seen as computational medium.31 _:123 rdf:type atti:Add atti:left-op atti:right-op rdf:subClassOf "3"^^xsd:int "7"^^xsd:int atti:Instruction 31 Rodriguez, M.A., “General-Purpose Computing on a Semantic Network Substrate,” Emergent Web Intelligence: Advanced Semantic Technologies, eds. R. Chbeir, A. Hassanien, A. Abraham, and Y. Badr, pp. 57–104, 2010. [http://arxiv.org/abs/0704.3395]
89. 89. Universal Computer: Represent Code atti:marko atti:bestFriend nm:puppy atti:hasMethod atti:isHappy Method atti:pet "false"^^xsd:boolean atti:args atti:block _:1234 _:2345 atti:inst rdf:1 _:3456 "animal"^^xsd:string // make animal happy Represent methods and their instructions attached to objects/classes.
90. 90. Universal Computer: Represent Virtual Machines Virtual Machine atti:VM atti:marko atti:bestFriend nm:puppy atti:hasMethod atti:isHappy rdf:type _:6789 atti:pc _:3456 atti:pet "false"^^xsd:boolean atti:block atti:inst _:2345 write "true"^^xsd:boolean Represent not only code, but the machines that execute it.
91. 91. Universal Computer: Represent Virtual Machines xsd:boolean RVM xsd:boolean [1] [1] methodReuse halt programLocation Fhat operandTop hasFrame returnTop [0..1] [0..1] [0..1] currentFrame [0..1] Operand [0..1] Instruction ReturnStack Stack rdf:rest rdf:rest blockTop rdf:ﬁrst [0..1] [0..*] rdf:ﬁrst [0..1] [0..1] forFrame Frame [1] rdfs:Resource Instruction rdf:li [0..*] [0..1] [0..1] Frame Block Variable Stack rdf:rest hasSymbol hasValue fromBlock rdf:ﬁrst [0..1] [1] [0..*] [1] Block xsd:string rdfs:Resource Block NenoFhat Project (circa 2006): http://neno.lanl.gov.
92. 92. Global Data Structure Data Machine Architecture API Program Virtual Machine State read/write read/write Virtual Machine Processes ... 127.0.0.1 Physical Machines 127.0.0.4 127.0.0.2 127.0.0.3 Physics My Belief in Reality
93. 93. Universal Computer: A Ramiﬁcation • Data, APIs, code, machine architectures, and virtual machines are within the same global URI address space. Code can by physically distributed across computers. For example, an add instruction on 127.0.0.1 references a branch instruction on 127.0.0.2. Hardware machines can be added or removed without altering the state of computation — only the speed. No developer concept of RAM-based memory addresses — the only address space is the space of all URIs.
94. 94. Universal Computer: Another Ramiﬁcation • Reﬂection down to the machine level.32 Most languages support the manipulation of code at runtime. In this model, the virtual machine can be altered at runtime. Code can rewrite the virtual machine that is evaluating the code. (i.e. create lots of bugs.) 32 Rodriguez, M.A., The RDF Virtual Machine, LA-UR-08-03925, in review, 2009. [http://arxiv.org/ abs/0802.3492]
95. 95. The year is 2030.
96. 96. Man learns to encode themselves into the URI address space...33 34 33 Egan, G., “Permutation City,” Eos Publisher, 1995. 34 Rodriguez, M.A., “From the Signal to the Symbol: Structure and Process in Artiﬁcial Intelligence,” Center for Nonlinear Studies Post Doctorate Seminar, Los Alamos National Laboratory, Los Alamos, New Mexico, 2008. [http://slidesha.re/hdqRn2]
97. 97. Outline • Graph Structures • Graph Databases • Graph Applications • TinkerPop Product Suite
98. 98. This is the TinkerPop...
99. 99. TinkerPop Productions • Blueprints: Data Models and their Implementations [http://blueprints.tinkerpop.com] • Pipes: A Data Flow Framework using Process Graphs [http://pipes.tinkerpop.com] • Gremlin: A Graph-Based Programming Language [http://gremlin.tinkerpop.com] • Rexster: A RESTful Graph Shell [http://rexster.tinkerpop.com]35 35 Please see http://engineering.attinteractive.com/2010/12/a-graph-processing-stack/ for a short review of these products. Also TinkerPop’s homepage at: http://tinkerpop.com
100. 100. Blueprints: A Property Graph Model Interface Blueprints • Blueprints is the like the JDBC of the graph database community. • Provides a Java-based interface API for the property graph data model. Graph, Vertex, Edge, Index. • Connectors to TinkerGraph, Neo4j, OrientDB, Sails (e.g. AllegroGraph, HyperSail, etc.), and soon InﬁniteGraph. Into the future, hope to support InfoGrid, Sones, DEX, and HyperGraphDB.36 36 HyperGraphDB makes use of an n-ary graph structure known as a hypergraph. Blueprints, in its current form, only supports the more common binary graph.
101. 101. Creating a Neo4jGraph in Blueprints // create a graph Graph graph = new Neo4jGraph("/tmp/neo4j"); // add two vertices Vertex a = graph.addVertex(null); a.setProperty("name","marko"); Vertex b = graph.addVertex(null); b.setProperty("name","peter"); // join the two vertices by a knows relation Edge e = graph.addEdge(null,a,b,"knows"); e.setProperty("since","2007"); 0 knows 1 since=2007 name=marko name=peter
102. 102. Handy Features of Blueprints • Supports automatic transactions graph.setTransactionMode(AUTOMATIC -or- MANUAL) In automatic mode, every manipulation of the graph is wrapped in a transaction and committed. • Supports automatic indices graph.createIndex(AUTOMATIC -or- MANUAL) In automatic mode, elements are added or removed from an index as their properties are manipulated. • Utility Suite Blueprints Sail makes a graphdb into a traversal-based RDF store. GraphML Reader/Writer library.
103. 103. Pipes: A Data Flow Framework using Process Graphs Pipes • Lazy data ﬂow with support for Blueprints-based graph processing. • Provides a collection of “pipes” (implement Iterable and Iterator) that are connected together to form processing pipelines. Filters: ComparisonFilterPipe, RandomFilterPipe, etc. Traversal: VertexEdgePipe, EdgeVertexPipe, PropertyPipe, etc. Splitting/Merging: CopySplitPipe, RobinMergePipe, etc. Logic: OrFilterPipe, AndFilterPipe, etc.
104. 104. Pipes: Chained Iterators This pipeline takes objects of type A and turns them into objects of type D through a sequence of processing pipes...37 D D A A A Pipe1 B Pipe2 C Pipe3 D D A D A Pipeline Pipe<A,D> pipeline = new Pipeline<A,D>(Pipe1<A,B>, Pipe2<B,C>, Pipe3<C,D>) 37 Though not discussed, splitting and merging is allowed as well (branching pipelines).
105. 105. Pipes: A Simple Example “What are the names of the people that marko knows?” B name=peter knows A knows C name=pavel name=marko created created D name=gremlin
106. 106. Pipes: A Simple Example Pipe<Vertex,Edge> pipe1 = new VertexEdgePipe(Step.OUT_EDGES); Pipe<Edge,Edge> pipe2= new LabelFilterPipe("knows",Filter.NOT_EQUAL); Pipe<Edge,Vertex> pipe3 = new EdgeVertexPipe(Step.IN_VERTEX); Pipe<Vertex,String> pipe4 = new PropertyPipe<String>("name"); Pipe<Vertex,String> pipeline = new Pipeline(pipe1,pipe2,pipe3,pipe4); pipeline.setStarts(new SingleIterator<Vertex>(graph.getVertex("A")); B name=peter knows A knows C name=pavel name=marko created created D name=gremlin
107. 107. Pipes: A Simple Example for(String name : pipeline) { System.out.println(name); } B name=peter knows A knows C name=pavel name=marko created created D name=gremlin peter pavel
108. 108. Pipes: A Simple Example EdgeVertexPipe(IN_VERTEX) VertexEdgePipe(OUT_EDGES) PropertyPipe("name") B name=peter knows A knows C name=pavel name=marko created created D name=gremlin LabelFilterPipe("knows")
109. 109. Pipes: A Simple Example EdgeVertexPipe(IN_VERTEX) VertexEdgePipe(OUT_EDGES) PropertyPipe("name") B name=peter knows A knows C name=pavel name=marko created created D name=gremlin LabelFilterPipe("knows")
110. 110. Pipes: A Simple Example EdgeVertexPipe(IN_VERTEX) VertexEdgePipe(OUT_EDGES) PropertyPipe("name") B name=peter knows A knows C name=pavel name=marko created created D name=gremlin LabelFilterPipe("knows")
111. 111. Pipes: A Simple Example EdgeVertexPipe(IN_VERTEX) VertexEdgePipe(OUT_EDGES) PropertyPipe("name") B name=peter knows A knows C name=pavel name=marko created created D name=gremlin LabelFilterPipe("knows")
112. 112. Pipes: Library of Generally Useful Pipes [ MERGES ] [ SIDEEFFECTS ] [ FILTERS ] ExhaustiveMergePipe AggregatorPipe AndFilterPipe RobinMergePipe CountCombinePipe CollectionFilterPipe CountPipe ComparisonFilterPipe [ GRAPHS ] KeyCombinePipe DuplicateFilterPipe EdgeVertexPipe SideEffectCapPipe FutureFilterPipe IdFilterPipe ObjectFilterPipe IdPipe [ UTILITIES ] OrFilterPipe LabelFilterPipe DynamicStartsPipe RandomFilterPipe LabelPipe GatherPipe RangeFilterPipe PropertyFilterPipe PathPipe PropertyPipe PrintStreamPipe VertexEdgePipe ProductPipe [ SPLITS ] ScatterPipe CopySplitPipe TypeCastPipe RobinSplitPipe Pipeline ...
113. 113. Pipes: Easy to Create New Pipes public class NumCharsPipe extends AbstractPipe<String,Integer> { public Integer processNextStart() { String word = this.starts.next(); return word.length(); } } When extending the base class AbstractPipe<S,E> all that is required is an implementation of processNextStart().
114. 114. Pipes: Easy to Create New Pipes Most of my projects are composed of lots of application speciﬁc Pipes. com.tinkerpop.pipes That is, Pipes that are speciﬁc to my domain model and yield useful jumps in the graph. For example, domain speciﬁc SameLikesPipe<Vertex,Vertex>. From these domain speciﬁc Pipes, complex algorithms are created through the piecing together of complex traversal those Pipes. For example, algorithms RecommenderPipe<Vertex,Map>.
115. 115. Gremlin: A Graph-Based Programming Language Gremlin G = (V, E) • A graph traversal language that uses Groovy as its host language. • Compiles Gremlin syntax down to Pipes (implements JSR 223).38 38 At the time of this presentation, Gremlin’s most recent stable release is 0.6 which is a standalone language. To increase the ﬂexibility of the language, 0.7-SNAPSHOT+ boasts the use of Groovy as the host the language.
116. 116. Gremlin: Easily Compose Graph Related Pipes Pipes is verbose... Pipe<Vertex,Edge> pipe1 = new VertexEdgePipe(Step.OUT_EDGES); Pipe<Edge,Edge> pipe2 = new LabelFilterPipe("knows",Filter.NOT_EQUAL); Pipe<Edge,Vertex> pipe3 = new EdgeVertexPipe(Step.IN_VERTEX); Pipe<Vertex,String> pipe4 = new PropertyPipe<String>("name"); Pipe<Vertex,String> pipeline = new Pipeline(pipe1,pipe2,pipe3,pipe4); pipeline.setStarts(new SingleIterator<Vertex>(graph.getVertex("A")); ...relative to Gremlin. g.v(‘A’).outE[[label:‘knows’]].inV.name
117. 117. Gremlin: The Simple Example inV outE name B name=peter knows g.v('A') A knows C name=pavel name=marko created created D name=gremlin [[label:'knows']]
118. 118. Gremlin: Deﬁning a Step “Who likes the same things that I like?” Vertex.metaClass.same_like = { _().outE[[label:‘likes’]].inV.inE[[label:‘likes’]].outV } B likes E likes likes A C likes F likes likes D likes G
119. 119. Gremlin: Deﬁning a Step gremlin> g.v(‘A’).same_likes ==>v[E] ==>v[F] ==>v[F] ==>v[G] B likes E likes likes A C likes F likes likes D likes G
120. 120. Gremlin: Deﬁning a Step gremlin> m = g:id-v(‘A’).same_likes.group_count >> 1 gremlin> m ==>v[E]=1 ==>v[F]=2 ==>v[G]=1 v[F] is most similar, in terms of likes, to v[A].39 39 For a thorough review of such traversal patterns, please see: Rodriguez, M.A., “Problem- Solving using Graph Traversals: Searching, Scoring, Ranking, and Recommendation,” July 2010. [http://slidesha.re/bOCy4Q]
121. 121. Rexster: A RESTful Graph Shell reXster • Allows Blueprints graphs to be exposed through a RESTful API (HTTP). • All communication is via JSON. • Supports stored traversals written in raw Pipes or Gremlin. • Supports adhoc traversals represented in Gremlin. • Provides “helper classes” for performing search-, score-, and rank-based traversal algorithms—in concert, support for recommendation.
122. 122. Rexster: URI Patterns • http://localhost/graph/vertices: all the vertices in the graph • http://localhost/graph/vertices/1: vertex with id 1 in the graph. • http://localhost/graph/vertices/1/outE: outgoing edges of vertex with id 1. { "results": { "_type":"vertex", "_id":"1", "name":"aaron", "type":"person" }, "query_time":0.1537 }
123. 123. Typical TinkerPop Graph Stack GET http://{host}/{resource} Neo4j NativeStore TinkerGraph
124. 124. Conclusion • Property graphs are convenient structures for modeling the real-world. • Graph databases provide index-free adjacency to ensure speedy traversal over graphs. • The graph is such a general data structure that it can be used for numerous applications. • TinkerPop provides a database agnostic stack of technologies for working with property graphs.
125. 125. Acknowledgements • Research collaborators: Daniel Steinbock (Stanford), Jennifer H. Watkins (LANL), Alberto Pepe (Harvard), Joshua Shinvaier (RPI), Johan Bollen (LANL), Herbert Van de Sompel (LANL). • TinkerPop contributors: Pavel Yaskevich (Riptano), Stephen Mallete (Independent), Darrick Weibe (Independent), Alex Averbuch (Swedish Institute of CS), Peter Neubauer (Neo4j). • Others: Emil Eifrem (Neo4j), Luca Garulli (Orient Technologies), Aaron Patterson (AT&Ti).