AI, Knowledge Representation and Graph Databases: Key Trends in Data Science

AI, Knowledge Representation
and Graph Databases -
Key Trends in Data Science
Dan McCreary
Social Data Science Meetup
March 2nd, 2019
1

Talk Description
Knowledge Representation is a key focus for most modern AI texts. Many AI
experts feel that over half of their work is understanding how to find the
right knowledge structures to build intelligent agents that can continuously
learn and respond to changing events in their world. In 2012, a paper
published by Google started a consolidation of the many diverse forms of
knowledge representation into a single general-purpose structure called a
labeled property graph.
This talk will describe the key events behind this movement and show how a
new generation of data scientist will be needed to build and maintain
corporate knowledge graphs that contain a uniform, normalized and highly
connected data sets for used by researchers and intelligent agents. We will
also discuss the challenges of transferring siloed project-knowledge to
reusable structures.
2

Hello, my name is
dan.mccreary@gmail.com
• Distinguished Engineer in AI and Graph Technologies at
Optum’s Advanced Technology Collaborative
• Co-founder of "NoSQL Now!" conference (now part of
Dataversity)
• Author of "Making Sense of NoSQL" (w. Ann Kelly)
• 15+ years of working with non-tabular knowledge
representations
• Background in solution architecture, metadata management,
NLP, semantics, text analytics and knowledge
representation for AI
• Disclaimer: All opinions are my own and may not reflect the
views of my employer
3

Graph a “NoSQL” Data Architecture
Relational Analytical (OLAP) Key-Value
Column-Family DocumentGraph
key value
key value
key value
key value
See Chapter 1: https://www.manning.com/books/making-sense-of-nosql
4

Relational vs. Graph
6
1. Atomic unit of storage is a row of a table
2. Data is appended one row at a time
3. All columns within a table must have the same
structure and no variations within a table are
allowed
4. Table structures are fixed after design
5. Query language is SQL
6. Joins are log(N) search's against other tables
1. Atomic units of storage are nodes and edges
2. Each node and edge may have independent
properties that are determined at run time
(schema agnostic)
3. Joins between nodes and edges are computed at
load time and are stored as memory pointers
4. Each core hops through 2M edges/second (1,000x
faster than joins)
5. Query language varies although there are some
standards e.g. SPARQL/Cypher/Gremlin/etc.
Relational (row store) Graph
ID NAME DAT
E
AMOUN
T

Gartner on Graph Analytics
Key Analytics Trends for 2019
1. Augmented Analytics
2. Augmented Data Management
3. Continuous Intelligence
4. Explainable AI
5.Graph
6. Data Fabric
7. NLP/Conversational Analytics
8. Commercial AI/ML
9. Blockchain
10. Persistent Memory Servers
7
…Graph processing to continuously accelerate data
preparation and enable more complex and adaptive
data science…to efficiently model, explore and query
data with complex interrelationships across data
silos…the need to ask complex questions across data
silos which is not practical or even possible at scale
using SQL queries.
Graphs are also related to 4, 6 and 7

Which of the following organizations use graph databases?
• Every major airline uses a graph database to calculate fares in real-time
• Over half of retailers use graphs for product recommendations
Answer: They all do!
8

Amazon Product Graph Job Posting
As a leader in e-commerce, Amazon is building an authoritative knowledge base for every product in the world. With hundreds of
millions of customers and billions of products, Amazon will offer a challenging but fun journey to turn this big and rapidly changing data
into high-quality knowledge to impact customer experiences across Amazon from Alexa to Search to Shopping. As a member of the
Product Graph team, based in Seattle, you will play a key role in the establishment of a new platform, with opportunities to create
enormous benefit for our customers and Amazon.
9

How do we store
knowledge?
• What exactly is “knowledge”?
• How is knowledge different from raw data and
information?
• What is knowledge engineering?
• What is knowledge architecture?
• How does this relate to data science?
10

The Data Science Lifecycle
12
What do we mean by
“Understanding”?

2018: Graphs Join with Deep Learning
How did we get here?
https://arxiv.org/pdf/1806.01261.pdf
13

Why Metaphors Matter
“Metaphors drive design decisions”
GraphConnect Keynote 2018
We make decisions not by having a deep
understanding of how technology works,
but through being exposed to the right
metaphors.Hillary Mason
Data Scientist
https://neo4j.com/graphconnect-2018/ around 51 minutes in
14

Four Graph Metaphors
15
Neighborhood Walk
(explains index free agency, performance)
Knowledge Triangle
(explains data, information and knowledge)
The Open World Assumption
(explains graph Integration, agility)
The Jenga Tower
(explains resilience of graphs to change)

How do you get to your neighbor’s house?
• You walk out your door and over to the house
• Your houses are “Adjacent” so getting there is a direct “hop”
• You don’t need to consult anyone else about how to get to a
neighbor’s house because you have the right pointer
16

Your Logical Graph Model
• If two physical items are related, they have a relationships arch between
them and we model it like the above
• We build a “logical” data model that has this link
• In a native graph, the vertexes are loaded into memory and then the
physical memory addresses of each link is also reflected in each of the
nodes
17
Dan LIVES_NEXT_TO Ann

Relational Database Use Indexes
● You must walk to a central index system
● The index system does a “search” given the house’s address
● The index system will tell you how to get to your neighbors house
18
Central
Index
Search: 123 Main St.

The Knowledge Triangle Metaphor
19
https://en.wikipedia.org/wiki/DIKW_pyramid
• Diagram for representing the
relationships between data,
information, knowledge, and wisdom
• Too often we focus on “Big Data” and
not enough on connected knowledge
and transferrable knowledge
• Graphs are connected information
concepts
• Wisdom is reusable across multiple
context
• Can we capture knowledge in a form
that can be reused across multiple
domains?
Data - Binary, Codes, Data Lakes
Information - Concepts
Knowledge -
Patterns Relationships
Wisdom
& AI

From Raw Data to Wisdom with Continuous Enrichment
20
Knowledge that can be reused
across multiple context –
Transfer Learning
Data
Lake
Information
Knowledge
Graph
Wisdom
Raw data dumped from a
database, log files or
documents
Tagged Text
Connections
Consistent
De-duplicated
Semantics
Concepts
Reusable To New Problems
Defintions
Validity
Searchable
Continuous
Enrichment

Structure
Definition: The arrangement of and relations between the parts or elements
of something complex
• The real world has lots of structure
• How is structure captured in a machine learning feature?
• If we take simple “features” out of the real world do we lose structure?
• Do our brains “extract” features? (answer: no)
21

The Adversarial Turtle
99% of modern image recognition is just simple (but precise) texture matching
https://www.theverge.com/2017/11/2/16597276/google-ai-image-attacks-adversarial-turtle-rifle-3d-printed
• Use a 3D printer to print a turtle
• Place different “textures” on the shell
• Most image recognitions fail
• CNNs: “On a very fundamental level,
our work highlights how far current
CNNs are from learning the 'true'
structure of the world”
22
rifle
rifle

Our Brains are Graphs
100B Neurons
10K Connections per Neuron (degree)
23

Three Eras of Computing
Procedural Code
(Rules)
Programs
Data Answers
Explanations (Why)
Machine
Learning
Data
Answers
Rules
(10M weights)
1) Procedural Era
2) Machine Learning Era
3) Graph Era
Data
Answers
Explanations (Why)
Knowledge
Machine
Learning
24

Sir Tim’s
Vision
RDF
2001
Euler Solves
7 Bridges of
Königsberg
1736
W3C
SPARQL
2008
Labeled
Property
Graph
Neo4j 1.0
2010
LPG
in AI
2018
Google’s
Knowledge
Graph
“Things Not
Strings”
May 2012
Graph Timeline
25
AlexNet
Sept 2012
Graphs Rise
On
Intelligence
2005

The Birth of the Semantic Web
• May 2001
• Resource Description Format - RDF
• Keep it simple: Triples all the way down
• Universal Identifiers: URIs
• Ideal for data interchange
• The problem: reification
• When adding a simple attribute to a relationship
causes 10,000 SPARQL queries to become
obsolete
Subject Object
Property
26

The Semantic Web Stack
27
Graph Model
Graph Query
Blockchain

2010 Neo4j 1.0 Released
• Neo4j used a new graph data model called a Labeled Property Graph (LPG)
• Each vertex and Edge had their own set of properties (key-value pairs)
• Each edge must have a single type
• Verticies can have 0-N types (labels)
• Adding a new property to a relationships does not require you to rewrite
you queries! They solved the reification problem but kept the flexibility of
graphs!
• Developers LOVE Neo4j!
Vertex Vertex
Properties
Properties
Properties
Edge
28

2012 Google Introduces the Knowledge Graph
29

Google Knowledge Summary
30
Knowledge
Graph
Summary
Chest
Pain
Angina
Ischemic
Chest Pain
About
Symptoms
Treatments

Google Knowledge Graph
Results: Better Search
• 110 Billion Concepts and an API to get verticies but no edges.
• Relationships are too proprietary!
31

23andMe Ancestry Graph
Better relationship insights
32

Graphs and the “Open World”
33
Past: Closed Word Graph: Open World
• Everything is prohibited until it is
permitted
• You can only add data that you model
• Everything is permitted until it is
prohibited
• Anyone can easily add any data at any
time without disruption of services
Also known as “schemaless, or schema agnostic”

The Jenga Tower Metaphor
• What happens to your existing queries when you make a change to your data model?
• LPGs allow anyone to add properties to verticies or relationships without disrupting
other queries
34
Add a property to your model
1,000 queries need to be rewritten

TigerGraph
• First commercial distributed native graph
product to fully support the LPG data
model
• Scales to 100B verticies on commodity
hardware
• Support for subgraphs (lightweight
security)
• Supports distributed ACID transactions
using multi-version concurrency control
• Large library of graph algorithms
35

Sample Graph Algorithms
36
Dependencies
• Failure chains
• Order of operations
Clustering
• Finding related items
• Friends, fraud networks
Similarity
• Similar paths and patterns
Matching/Categorizing
• Look for and tag specific patterns
Flow/Cost
• Optimize costs based on
routing
• Path optimization
Centrality/Search
• Which nodes are the most
connected or relevant?
API Server Auth
Data

Pintrest Interest Graph
better ad targeting
37
Billions of interest concepts (an ontology)

Graphs Store Concept Distance
• How similar are concepts?
• How can the distance in a graph help us find the underlying intent of the questions?
• How can this help us build automated chatbots?
38
Baby Infant
Child
Chat Question: We are planning to have a new [baby, infant, child], what are my benefits?
Chatbot: Here is a link to your maternity benefits.
Maternity

Normalized Google Distance (NGD)
A semantic similarity measure derived from the number of hits
returned by the Google search engine for a given set of keywords.
39
1. "Shakespeare" returns 130M pages
2. "Macbeth" returns 26M pages
3. "Shakespeare Macbeth" returns 20.8M pages
where N is the total number of web pages searched by Google multiplied by
the average number of singleton search terms occurring on pages; f(x) and f(y)
are the number of hits for search terms x and y, respectively; and f(x, y) is the
number of web pages on which both x and y occur.

Data Lakes Today
• Very little reuse of code to “understand” and link data
• Few people using rules engines and ML to link data
40
Data Lake
(Data Swamp?)
100s of Data Scientists
100s of R and Python Libraries
80% of effort is “Data Engineering”
20% is Data Science
Data Access Code

From Data Scientist to Knowledge Scientist
41
Data
Information
Knowledge
Graph
Wisdom
Data Engineering
Insights
70% of
your time
Fast Path with Feedback
Slow Path

Causality: The Apocryphal Cancer Gene Story
42
The Smoking Gene Theory (1950) The Correct Causal Relationship (1969)
Smoking
Gene
Lung
Cancer
Urge to
Smoke
Lung
Cancer
Tar
Smoking
20 million preventable deaths From “The Book of Why” by Judea Pearl
?

Explainable AI Models are Graphs
43
https://www.darpa.mil/program/explainable-artificial-intelligence
Training
Data
Training
Data
Machine
Learning
Process
New
Machine
Learning
Process
Learned
Function
Today
Explainable AI
• Why did you do that?
• Why not something else?
• When do you succeed?
• When do you fail?
• When can I trust you?
• How to I correct and error?
• I understand why
• I understand why not
• I know when you succeed
• I know when you fail
• I know when to trust you
• I know why you erred
Decision or
Recommendation
Explanation
Interface
Explainable Model
Task
Task

Model-based Machine Learning
44
• Model-based machine learning requires the user to create a model of the world in the form of a graph. This
model encodes the assumptions you make such as how a change on one variable changes another variable
(causal relationships).
• Model-based machine learning allows fewer algorithms and universal inference
Many Algorithms
Flat Data
Hidden Assumptions
Many Algorithms
Specific Solution
Graph Data
Assumptions in Graph
Structure
One Algorithm
Universal Inference
General Solution
Few Algorithms
http://www.mbmlbook.com

2005: On Intelligence
The key to Artificial Intelligence has
always been the representation.
Jeff Hawkins
Sensory Data
Abstract Concepts
Hierarchical Temporal Memory (HTM)
45
Edge Computing Converts
Sensors into Concepts

General CPU Hardware vs. Graph Hardware
• Most graph traversal algorithms only use simple pointer hopping
• How efficient are CPU and GPUs at running graph algorithms?
• No need for floating point
• No need for matrix multiplication
46
CPU
1,000 Instructions
Available
(1503 defined x86
instructions )
100
Instructions
Used

GPUs are optimized for array processing
• GPUs were designed to support video games
• GPUs are optimized for highly parallel matrix multiplication
• Inefficient for graph “hop” calculations
47
NVIDIA "Pascal" GP100

What’s Coming: Graph Hardware!
Graphcore – graph in hardware - $350M in funding
48

AlexNet
• A convolutional neural
network (CNN), designed by
Alex Krizhevsky
• First image recognition
program to use GPUs
• Beat other teams by huge
margin 10% 2012
49

AlexNet in Graphcore
• Each layer in the algorithm maps to a
region of the graph
• Initial layers are convolutional layers
• The colors represent connection density
50

Different Algorithms Look Different
51
ResNet50:
“AI Brain Scan”
https://www.wired.co.uk/gallery/machine-learning-graphcore-pictures-inside-ai

Dan’s Predictions
• Graph technology will continue to gain relevance in analytics
• LPGs will be the dominate the graph data model
• RDF/SPARQL might only be used in niche areas
• Once a standard query language is adopted the ability to reuse
algorithms will grow quickly
• The number of graph databases and middle-tier algorithm products
will continue to grown
53

Recommendations
• Become a knowledge scientist!
• Learn a bit about graph modeling and graph algorithms
• Think about structure when doing feature design
• Understand the causal relationships between your data (Bayesian Graphs)
• Use graph databases when you have lots of relationships or want real-time
analytics (rules engines, recommendations)
• Build knowledge APIs, not just more libraries for data lakes
• Learn a few graph algorithms
• Similarity
• Clustering
• Recommendation
54

Resources
• Dan’s Medium Blog:
https://medium.com/@dmccreary
• Machine Learning with Python
• A Comprehensive Guide to Graph Algorithms
in Neo4j Mark Needham & Amy E. Hodler
• Wikipedia Articles
• Graph Databases
• Similarity (Network Science)
• Google Normalized Distance
• Explainable AI
55

Thank You!
Please send e-mail to Dan.McCreary@gmail.com if you want a copy of
the slides.
56

AI, Knowledge Representation and Graph Databases: Key Trends in Data Science

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a AI, Knowledge Representation and Graph Databases: Key Trends in Data Science

Similar a AI, Knowledge Representation and Graph Databases: Key Trends in Data Science (20)

Más de Optum

Más de Optum (8)

Último

Último (20)

AI, Knowledge Representation and Graph Databases: Key Trends in Data Science

Notas del editor