SlideShare una empresa de Scribd logo
1 de 55
Document Classification with Neo4j 
(graphs)-[:are]->(everywhere) 
© All Rights Reserved 2014 | Neo Technology, Inc. 
@kennybastani 
Neo4j Developer Evangelist
© All Rights Reserved 2014 | Neo Technology, Inc. 
Agenda 
• Introduction to Neo4j 
• Introduction to Graph-based Document Classification 
• Graph-based Hierarchical Pattern Recognition 
• Generating a Vector Space Model for Recommendations 
• Graphify for Neo4j 
• U.S. Presidential Speech Transcript Analysis 
2
Introduction to Neo4j 
© All Rights Reserved 2014 | Neo Technology, Inc. 
3
The Property Graph Data Model 
© All Rights Reserved 2014 | Neo Technology, Inc. 
4
© All Rights Reserved 2014 | Neo Technology, Inc. 
John 
Sally 
Graph Databases 
Book 
5
© All Rights Reserved 2014 | Neo Technology, Inc. 
name: John 
age: 27 
name: Sally 
age: 32 
FRIEND_OF 
since: 01/09/2013 
title: Graph Databases 
authors: Ian Robinson, 
Jim Webber 
HAS_READ 
on: 2/03/2013 
rating: 5 
HAS_READ 
on: 02/09/2013 
rating: 4 
FRIEND_OF 
since: 01/09/2013 
6
The Relational Table Model 
© All Rights Reserved 2014 | Neo Technology, Inc. 
7
Customers Customer_Accounts Accounts 
© All Rights Reserved 2014 | Neo Technology, Inc. 
8
The Neo4j Browser 
© All Rights Reserved 2014 | Neo Technology, Inc. 
9
Neo4j Browser - finding help 
© All Rights Reserved 2014 | Neo Technology, Inc. 
http://localhost:7474/ 
10
Execute Cypher, Visualize 
© All Rights Reserved 2014 | Neo Technology, Inc. 
11
Introduction to Document Classification 
© All Rights Reserved 2014 | Neo Technology, Inc. 
12
© All Rights Reserved 2014 | Neo Technology, Inc. 
Document Classification 
Automatically assign a document to one or more classes 
Documents may be classified according to their subjects or 
according to other attributes 
Automatically classify unlabeled documents to a set of relevant 
classes using labeled training data 
13
Example Use Cases for Document 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Classification 
14
Sentiment Analysis for Movie Reviews 
Scenario: A movie website allows users to submit reviews describing what they 
either liked or disliked about a particular movie. 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Problem: The user reviews are unstructured text. 
How do I automatically generate a score indicating whether the review was 
positive or negative? 
Solution: Train a natural language parsing model on a dataset that has been 
labeled in previous reviews as either positive or negative. 
15
Recommend Relevant Tags 
Scenario: A Q/A website allows users to submit questions and receive answers 
from other users. 
Problem: Users sometime do not know what tags to apply to their questions in 
order to increase discoverability for receiving answers. 
Solution: Automatically recommend the most relevant tags for questions by 
classifying the text from training on previous questions. 
© All Rights Reserved 2014 | Neo Technology, Inc. 
16
Recommend Similar Articles 
Scenario: A news website provides hundreds of new articles a day to users on a 
broad range of topics. 
Problem: The site needs to increase user engagement and time spent on the site. 
Solution: Train natural language parsing models for daily articles in order to 
provide recommendations for highly relevant articles at the bottom of each page. 
© All Rights Reserved 2014 | Neo Technology, Inc. 
17
How Automated Document Classification Works 
© All Rights Reserved 2014 | Neo Technology, Inc. 
18
Label 
© All Rights Reserved 2014 | Neo Technology, Inc. 
X Y 
Document 
Document 
Document 
Document 
Label Label 
Assign a set of labels that describes the 
document’s text 
Supervised Learning 
Step 1: Create a Training Dataset 
Z 
19
Step 2: Train a Natural Language Parsing Model 
p 
X Y 
= State Machine 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Deep feature representations are selected and 
learned using an evolutionary algorithm 
State machines represent predicates that evaluate to 
0 or 1 for a text match 
State machines map to classes of document labels 
that matched text during training 
Deep Learning 
p p 
p p p 
Class 
Class 
Z 
Class 
20
cos(θ) 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Unlabeled Document 
The natural language parsing model is 
used to classify other unlabeled 
documents 
X 
Class 
Y 
Class 
Z 
Class 
0.99 
0.67 
0.01 
cos(θ) 
cos(θ) 
Step 3: Classify Unlabeled Documents 
21
Hierarchical Pattern Recognition 
© All Rights Reserved 2014 | Neo Technology, Inc. 
(HPR) 
22
What is Hierarchical Pattern Recognition (HPR)? 
HPR is a graph-based deep learning algorithm I 
created that learns deep feature representations in 
linear time — 
I created the algorithm to do graph-based traversals 
using a hierarchy of finite state machines (FSM). 
Designed for scalable performance in P time: 
© All Rights Reserved 2014 | Neo Technology, Inc. 
23
Influences & Inspirations 
+ = 
p 
p p 
p p p 
X Y Z 
© All Rights Reserved 2014 | Neo Technology, Inc. 
24 
Ray Kurzweil 
(Pattern Recognition Theory of Mind) 
Jeff Hawkins 
(Hierarchical Temporal Memory) 
Hierarchical Pattern Recognition
How does feature extraction work? 
p 
© All Rights Reserved 2014 | Neo Technology, Inc. 
25 
Hierarchical Pattern Recognition 
“Deep” feature representations are learned and associated 
with labels that are mapped to documents that the feature 
was discovered in. 
The feature hierarchy is translated into a Vector Space Model 
for classification on feature vectors generated from unlabeled 
text. 
p p 
p p p 
X Y Z 
HPR uses a probabilistic model in combination with an 
evolutionary algorithm to generate hierarchies of deep feature 
representations.
Graph-based feature learning 
© All Rights Reserved 2014 | Neo Technology, Inc. 
26
Learning new features from 
matches on training data 
© All Rights Reserved 2014 | Neo Technology, Inc. 
27
Cost Function for the Generations of Features 
Reproduction occurs after a threshold of matches has been 
exceeded for a feature. 
After replication the cost function is applied to increase that 
threshold every time the feature reproduces. 
is the current threshold on the feature node. 
is the minimum threshold, which I chose as 5 for new features. 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Cost function: 
28
© All Rights 29 Reserved 2014 | Neo Technology, Inc.
Vector Space Model 
© All Rights Reserved 2014 | Neo Technology, Inc. 
30
Generating Feature Vectors 
The natural language parsing model created during training can be 
turned into a global feature index. 
This global feature index is a list of Neo4j internal IDs for every feature 
in the hierarchy. 
Using that global feature index, a multi-dimensional vector space is 
created with a length equal to the number of features in the hierarchy. 
© All Rights Reserved 2014 | Neo Technology, Inc. 
31
Relevance Rankings 
“Relevance rankings of documents in a keyword search can be 
calculated, using the assumptions of document similarities theory, by 
comparing the deviation of angles between each document vector and 
the original query vector where the query is represented as the same 
kind of vector as the documents.” - Wikipedia 
© All Rights Reserved 2014 | Neo Technology, Inc. 
32
Vector-based Cosine Similarity Measure 
In practice, it is easier to calculate the cosine of the angle between the 
vectors, instead of the angle itself: 
© All Rights Reserved 2014 | Neo Technology, Inc. 
33
Cosine Similarity & Vector Space Model 
© All Rights Reserved 2014 | Neo Technology, Inc. 
34
Vector-based Cosine Similarity Measure 
“The resulting similarity ranges from -1 meaning exactly opposite, to 1 
meaning exactly the same, with 0 usually indicating independence, 
and in-between values indicating intermediate similarity or 
dissimilarity.” 
© All Rights Reserved 2014 | Neo Technology, Inc. 
via Wikipedia 
35
Graphify for Neo4j 
© All Rights Reserved 2014 | Neo Technology, Inc. 
36
Graphify for Neo4j 
Graphify is a Neo4j unmanaged extension used for 
document and text classification using graph-based 
hierarchical pattern recognition. 
© All Rights Reserved 2014 | Neo Technology, Inc. 
https://github.com/kbastani/graphify 
37
Example Project 
Head over to the GitHub project page and clone it to your 
local machine. 
Follow the directions listed in the README.md to install the 
extension. 
Navigate to the /examples directory of the project. 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Run: 
examples/graphify-examples-author/src/java/org/neo4j/nlp/examples/author/main.java 
38
U.S. Presidential Speech 
Transcript Analysis 
© All Rights Reserved 2014 | Neo Technology, Inc. 
39
Identify the Political Affiliation of a Presidential Speech 
This example ingests a set of texts from presidential speeches with 
labels from the author of that speech in training phase. After building 
the training models, unlabeled presidential speeches are classified in 
the test phase. 
© All Rights Reserved 2014 | Neo Technology, Inc. 
40
The Presidents 
© All Rights Reserved 2014 | Neo Technology, Inc. 
• Ronald Reagan 
• labels: liberal, republican, ronald-reagan 
• George H.W. Bush 
• labels: conservative, republican, bush41 
• Bill Clinton 
• labels: liberal, democrat, bill-clinton 
• George W. Bush 
• labels: conservative, republican, bush43 
• Barack Obama 
• labels: liberal, democrat, barack-obama 
41
© All Rights Reserved 2014 | Neo Technology, Inc. 
Training 
Each of the presidents in the example have 6 speeches to analyze. 
4 of the speeches are used to build a natural language parsing model. 
2 of the speeches are used to test the validity of that model. 
42
Get Similar Labels/Classes 
© All Rights Reserved 2014 | Neo Technology, Inc. 
43
Ronald Reagan 
republican 0.7182046285385341 
liberal 0.644281223102398 
democrat 0.4854114595950056 
conservative 0.4133639188595147 
bill-clinton 0.4057969121945167 
barack-obama 0.323947855372623 
bush41 0.3222644898334092 
bush43 0.3161309849153592 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Class Similarity 
44
George H.W. Bush 
conservative 0.7032274806766954 
republican 0.6047256274615608 
liberal 0.4439742461594541 
democrat 0.39114918238853674 
bill-clinton 0.3234223107986785 
ronald-reagan 0.3222644898334092 
barack-obama 0.2929260544514002 
bush43 0.29106733975087984 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Class Similarity 
45
democrat 0.8375678825642422 
liberal 0.7847858060182163 
republican 0.5561860529059708 
conservative 0.45365774896422445 
barack-obama 0.4507676679770066 
ronald-reagan 0.4057969121945167 
bush43 0.365042482383354 
bush41 0.3234223107986785 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Bill Clinton 
Class Similarity 
46
George W. Bush 
conservative 0.820636570272315 
republican 0.7056890956512284 
liberal 0.5075788396061254 
democrat 0.4505424322086937 
bill-clinton 0.365042482383354 
barack-obama 0.33801949243378965 
ronald-reagan 0.3161309849153592 
bush41 0.29106733975087984 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Class Similarity 
47
Barack Obama 
democrat 0.7668017370739147 
liberal 0.7184792203867296 
republican 0.4847680475425114 
bill-clinton 0.4507676679770066 
conservative 0.4149264161292232 
bush43 0.33801949243378965 
ronald-reagan 0.323947855372623 
bush41 0.2929260544514002 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Class Similarity 
48
Get involved in the Neo4j community 
© All Rights Reserved 2014 | Neo Technology, Inc. 
49
http://stackoverflow.com/questions/tagged/neo4j 
© All Rights Reserved 2014 | Neo Technology, Inc. 
50
http://groups.google.com/group/neo4j 
© All Rights Reserved 2014 | Neo Technology, Inc. 
51
https://github.com/neo4j/neo4j/issues 
© All Rights Reserved 2014 | Neo Technology, Inc. 
52
http://neo4j.meetup.com/ 
© All Rights Reserved 2014 | Neo Technology, Inc. 
53
© All Rights Reserved 2014 | Neo Technology, Inc. 
(Thank You) 
54
Twitter www.twitter.com/kennybastani 
LinkedIn www.linkedin.com/in/kennybastani 
GitHub www.github.com/kbastani 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Get in touch 
55

Más contenido relacionado

La actualidad más candente

How to Build a Fraud Detection Solution with Neo4j
How to Build a Fraud Detection Solution with Neo4jHow to Build a Fraud Detection Solution with Neo4j
How to Build a Fraud Detection Solution with Neo4j
Neo4j
 
Information visualization: interaction
Information visualization: interactionInformation visualization: interaction
Information visualization: interaction
Katrien Verbert
 

La actualidad más candente (20)

Workshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data ScienceWorkshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data Science
 
Network centrality measures and their effectiveness
Network centrality measures and their effectivenessNetwork centrality measures and their effectiveness
Network centrality measures and their effectiveness
 
Danish Business Authority: Explainability and causality in relation to ML Ops
Danish Business Authority: Explainability and causality in relation to ML OpsDanish Business Authority: Explainability and causality in relation to ML Ops
Danish Business Authority: Explainability and causality in relation to ML Ops
 
Data visualization
Data visualizationData visualization
Data visualization
 
Turning Shoppers Into Customers: How to Retain Your New Customers Post-Holiday
Turning Shoppers Into Customers: How to Retain Your New Customers Post-HolidayTurning Shoppers Into Customers: How to Retain Your New Customers Post-Holiday
Turning Shoppers Into Customers: How to Retain Your New Customers Post-Holiday
 
Enterprise Knowledge Graph
Enterprise Knowledge GraphEnterprise Knowledge Graph
Enterprise Knowledge Graph
 
Government GraphSummit: Leveraging Graphs for AI and ML
Government GraphSummit: Leveraging Graphs for AI and MLGovernment GraphSummit: Leveraging Graphs for AI and ML
Government GraphSummit: Leveraging Graphs for AI and ML
 
"Introduction to Data Visualization" Workshop for General Assembly by Hunter ...
"Introduction to Data Visualization" Workshop for General Assembly by Hunter ..."Introduction to Data Visualization" Workshop for General Assembly by Hunter ...
"Introduction to Data Visualization" Workshop for General Assembly by Hunter ...
 
A Connections-first Approach to Supply Chain Optimization
A Connections-first Approach to Supply Chain OptimizationA Connections-first Approach to Supply Chain Optimization
A Connections-first Approach to Supply Chain Optimization
 
Technip Energies Italy: Planning is a graph matter
Technip Energies Italy: Planning is a graph matterTechnip Energies Italy: Planning is a graph matter
Technip Energies Italy: Planning is a graph matter
 
How to Build a Fraud Detection Solution with Neo4j
How to Build a Fraud Detection Solution with Neo4jHow to Build a Fraud Detection Solution with Neo4j
How to Build a Fraud Detection Solution with Neo4j
 
Information visualization: interaction
Information visualization: interactionInformation visualization: interaction
Information visualization: interaction
 
Digital 2022 Burkina Faso (February 2022) v01
Digital 2022 Burkina Faso (February 2022) v01Digital 2022 Burkina Faso (February 2022) v01
Digital 2022 Burkina Faso (February 2022) v01
 
From Target to Product - Accelerating the Drug Lifecycle with Knowledge Graph...
From Target to Product - Accelerating the Drug Lifecycle with Knowledge Graph...From Target to Product - Accelerating the Drug Lifecycle with Knowledge Graph...
From Target to Product - Accelerating the Drug Lifecycle with Knowledge Graph...
 
Optimizing Your Supply Chain with the Neo4j Graph
Optimizing Your Supply Chain with the Neo4j GraphOptimizing Your Supply Chain with the Neo4j Graph
Optimizing Your Supply Chain with the Neo4j Graph
 
Distributed defense against disinformation: disinformation risk management an...
Distributed defense against disinformation: disinformation risk management an...Distributed defense against disinformation: disinformation risk management an...
Distributed defense against disinformation: disinformation risk management an...
 
Training Week: Introduction to Neo4j 2022
Training Week: Introduction to Neo4j 2022Training Week: Introduction to Neo4j 2022
Training Week: Introduction to Neo4j 2022
 
Future-proofing SMEs TA vFF.pdf
Future-proofing SMEs TA vFF.pdfFuture-proofing SMEs TA vFF.pdf
Future-proofing SMEs TA vFF.pdf
 
Graph Data Science at Scale
Graph Data Science at ScaleGraph Data Science at Scale
Graph Data Science at Scale
 
Data Visualization - A Brief Overview
Data Visualization - A Brief OverviewData Visualization - A Brief Overview
Data Visualization - A Brief Overview
 

Destacado

Dnc Day 4 – Obama Speech
Dnc Day 4 – Obama SpeechDnc Day 4 – Obama Speech
Dnc Day 4 – Obama Speech
mkursh
 
M893 & m894 seahawks contest
M893 & m894 seahawks contestM893 & m894 seahawks contest
M893 & m894 seahawks contest
dthielen1
 

Destacado (20)

Natural language search using Neo4j
Natural language search using Neo4jNatural language search using Neo4j
Natural language search using Neo4j
 
Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jNatural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4j
 
Natural Language Processing with Neo4j
Natural Language Processing with Neo4jNatural Language Processing with Neo4j
Natural Language Processing with Neo4j
 
Building a Graph-based Analytics Platform
Building a Graph-based Analytics PlatformBuilding a Graph-based Analytics Platform
Building a Graph-based Analytics Platform
 
Data Modeling with Neo4j
Data Modeling with Neo4jData Modeling with Neo4j
Data Modeling with Neo4j
 
Open Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache SparkOpen Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache Spark
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph Databases
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
Neo4J Open Source Graph Database
Neo4J Open Source Graph DatabaseNeo4J Open Source Graph Database
Neo4J Open Source Graph Database
 
20141216 graph database prototyping ams meetup
20141216 graph database prototyping ams meetup20141216 graph database prototyping ams meetup
20141216 graph database prototyping ams meetup
 
Dnc Day 4 – Obama Speech
Dnc Day 4 – Obama SpeechDnc Day 4 – Obama Speech
Dnc Day 4 – Obama Speech
 
The impact of language planning, terminology planning, and arabicization, on ...
The impact of language planning, terminology planning, and arabicization, on ...The impact of language planning, terminology planning, and arabicization, on ...
The impact of language planning, terminology planning, and arabicization, on ...
 
Meryl streep took a stand against donald trump
Meryl streep took a stand against donald trumpMeryl streep took a stand against donald trump
Meryl streep took a stand against donald trump
 
AP Invoice Processing for JD Edwards_Bottomline Technologies
AP Invoice Processing for JD Edwards_Bottomline TechnologiesAP Invoice Processing for JD Edwards_Bottomline Technologies
AP Invoice Processing for JD Edwards_Bottomline Technologies
 
Document Classification In PHP
Document Classification In PHPDocument Classification In PHP
Document Classification In PHP
 
The war on terrorism
The war on terrorismThe war on terrorism
The war on terrorism
 
M893 & m894 seahawks contest
M893 & m894 seahawks contestM893 & m894 seahawks contest
M893 & m894 seahawks contest
 
Visual Resume
Visual ResumeVisual Resume
Visual Resume
 
Adivina de _quienes_son_las_siguientes_cansiones[1]
Adivina de _quienes_son_las_siguientes_cansiones[1]Adivina de _quienes_son_las_siguientes_cansiones[1]
Adivina de _quienes_son_las_siguientes_cansiones[1]
 

Similar a Document Classification with Neo4j

Similar a Document Classification with Neo4j (20)

History Of C Essay
History Of C EssayHistory Of C Essay
History Of C Essay
 
MSRA 2018: Intelligent Software Engineering: Synergy between AI and Software ...
MSRA 2018: Intelligent Software Engineering: Synergy between AI and Software ...MSRA 2018: Intelligent Software Engineering: Synergy between AI and Software ...
MSRA 2018: Intelligent Software Engineering: Synergy between AI and Software ...
 
xAPI: The Landscape
xAPI: The LandscapexAPI: The Landscape
xAPI: The Landscape
 
Software system design sample
Software system design sampleSoftware system design sample
Software system design sample
 
Data science workshop
Data science workshopData science workshop
Data science workshop
 
An Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
An Empirical Comparison of Knowledge Graph Embeddings for Item RecommendationAn Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
An Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
 
C# programming : Chapter One
C# programming : Chapter OneC# programming : Chapter One
C# programming : Chapter One
 
See to believe: capturing insights using contextual inquiry
See to believe: capturing insights using contextual inquirySee to believe: capturing insights using contextual inquiry
See to believe: capturing insights using contextual inquiry
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014
 
Sudipta mukherjee 2016_2017
Sudipta mukherjee 2016_2017Sudipta mukherjee 2016_2017
Sudipta mukherjee 2016_2017
 
Sudipta_Mukherjee_2016_2017
Sudipta_Mukherjee_2016_2017Sudipta_Mukherjee_2016_2017
Sudipta_Mukherjee_2016_2017
 
Maruti gollapudi cv
Maruti gollapudi cvMaruti gollapudi cv
Maruti gollapudi cv
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and Challenges
 
Transferring Software Testing Tools to Practice
Transferring Software Testing Tools to PracticeTransferring Software Testing Tools to Practice
Transferring Software Testing Tools to Practice
 
Software craftsmanship - Imperative or Hype
Software craftsmanship - Imperative or HypeSoftware craftsmanship - Imperative or Hype
Software craftsmanship - Imperative or Hype
 
Knowledge Graphs and Generative AI
Knowledge Graphs and Generative AIKnowledge Graphs and Generative AI
Knowledge Graphs and Generative AI
 
Xiangen Hu - WESST - AutoTutor, an implementation of Conversation-Based Intel...
Xiangen Hu - WESST - AutoTutor, an implementation of Conversation-Based Intel...Xiangen Hu - WESST - AutoTutor, an implementation of Conversation-Based Intel...
Xiangen Hu - WESST - AutoTutor, an implementation of Conversation-Based Intel...
 
Sudipta_Mukherjee_Resume-Nov_2022.pdf
Sudipta_Mukherjee_Resume-Nov_2022.pdfSudipta_Mukherjee_Resume-Nov_2022.pdf
Sudipta_Mukherjee_Resume-Nov_2022.pdf
 
Final presentation
Final presentationFinal presentation
Final presentation
 
Building Large Sustainable Apps
Building Large Sustainable AppsBuilding Large Sustainable Apps
Building Large Sustainable Apps
 

Más de Kenny Bastani

Más de Kenny Bastani (9)

In the Eventual Consistency of Succeeding at Microservices
In the Eventual Consistency of Succeeding at MicroservicesIn the Eventual Consistency of Succeeding at Microservices
In the Eventual Consistency of Succeeding at Microservices
 
Building Cloud Native Architectures with Spring
Building Cloud Native Architectures with SpringBuilding Cloud Native Architectures with Spring
Building Cloud Native Architectures with Spring
 
Extending the Platform with Spring Boot and Cloud Foundry
Extending the Platform with Spring Boot and Cloud FoundryExtending the Platform with Spring Boot and Cloud Foundry
Extending the Platform with Spring Boot and Cloud Foundry
 
Back your app with MySQL and Redis on Cloud Foundry
Back your app with MySQL and Redis on Cloud FoundryBack your app with MySQL and Redis on Cloud Foundry
Back your app with MySQL and Redis on Cloud Foundry
 
Using Docker, Neo4j, and Spring Cloud for Developing Microservices
Using Docker, Neo4j, and Spring Cloud for Developing MicroservicesUsing Docker, Neo4j, and Spring Cloud for Developing Microservices
Using Docker, Neo4j, and Spring Cloud for Developing Microservices
 
Cloud Native Java Microservices
Cloud Native Java MicroservicesCloud Native Java Microservices
Cloud Native Java Microservices
 
Building REST APIs with Spring Boot and Spring Cloud
Building REST APIs with Spring Boot and Spring CloudBuilding REST APIs with Spring Boot and Spring Cloud
Building REST APIs with Spring Boot and Spring Cloud
 
Neo4j Graph Data Modeling
Neo4j Graph Data ModelingNeo4j Graph Data Modeling
Neo4j Graph Data Modeling
 
Building Killer Apps with Neo4j 2.0
Building Killer Apps with Neo4j 2.0Building Killer Apps with Neo4j 2.0
Building Killer Apps with Neo4j 2.0
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Document Classification with Neo4j

  • 1. Document Classification with Neo4j (graphs)-[:are]->(everywhere) © All Rights Reserved 2014 | Neo Technology, Inc. @kennybastani Neo4j Developer Evangelist
  • 2. © All Rights Reserved 2014 | Neo Technology, Inc. Agenda • Introduction to Neo4j • Introduction to Graph-based Document Classification • Graph-based Hierarchical Pattern Recognition • Generating a Vector Space Model for Recommendations • Graphify for Neo4j • U.S. Presidential Speech Transcript Analysis 2
  • 3. Introduction to Neo4j © All Rights Reserved 2014 | Neo Technology, Inc. 3
  • 4. The Property Graph Data Model © All Rights Reserved 2014 | Neo Technology, Inc. 4
  • 5. © All Rights Reserved 2014 | Neo Technology, Inc. John Sally Graph Databases Book 5
  • 6. © All Rights Reserved 2014 | Neo Technology, Inc. name: John age: 27 name: Sally age: 32 FRIEND_OF since: 01/09/2013 title: Graph Databases authors: Ian Robinson, Jim Webber HAS_READ on: 2/03/2013 rating: 5 HAS_READ on: 02/09/2013 rating: 4 FRIEND_OF since: 01/09/2013 6
  • 7. The Relational Table Model © All Rights Reserved 2014 | Neo Technology, Inc. 7
  • 8. Customers Customer_Accounts Accounts © All Rights Reserved 2014 | Neo Technology, Inc. 8
  • 9. The Neo4j Browser © All Rights Reserved 2014 | Neo Technology, Inc. 9
  • 10. Neo4j Browser - finding help © All Rights Reserved 2014 | Neo Technology, Inc. http://localhost:7474/ 10
  • 11. Execute Cypher, Visualize © All Rights Reserved 2014 | Neo Technology, Inc. 11
  • 12. Introduction to Document Classification © All Rights Reserved 2014 | Neo Technology, Inc. 12
  • 13. © All Rights Reserved 2014 | Neo Technology, Inc. Document Classification Automatically assign a document to one or more classes Documents may be classified according to their subjects or according to other attributes Automatically classify unlabeled documents to a set of relevant classes using labeled training data 13
  • 14. Example Use Cases for Document © All Rights Reserved 2014 | Neo Technology, Inc. Classification 14
  • 15. Sentiment Analysis for Movie Reviews Scenario: A movie website allows users to submit reviews describing what they either liked or disliked about a particular movie. © All Rights Reserved 2014 | Neo Technology, Inc. Problem: The user reviews are unstructured text. How do I automatically generate a score indicating whether the review was positive or negative? Solution: Train a natural language parsing model on a dataset that has been labeled in previous reviews as either positive or negative. 15
  • 16. Recommend Relevant Tags Scenario: A Q/A website allows users to submit questions and receive answers from other users. Problem: Users sometime do not know what tags to apply to their questions in order to increase discoverability for receiving answers. Solution: Automatically recommend the most relevant tags for questions by classifying the text from training on previous questions. © All Rights Reserved 2014 | Neo Technology, Inc. 16
  • 17. Recommend Similar Articles Scenario: A news website provides hundreds of new articles a day to users on a broad range of topics. Problem: The site needs to increase user engagement and time spent on the site. Solution: Train natural language parsing models for daily articles in order to provide recommendations for highly relevant articles at the bottom of each page. © All Rights Reserved 2014 | Neo Technology, Inc. 17
  • 18. How Automated Document Classification Works © All Rights Reserved 2014 | Neo Technology, Inc. 18
  • 19. Label © All Rights Reserved 2014 | Neo Technology, Inc. X Y Document Document Document Document Label Label Assign a set of labels that describes the document’s text Supervised Learning Step 1: Create a Training Dataset Z 19
  • 20. Step 2: Train a Natural Language Parsing Model p X Y = State Machine © All Rights Reserved 2014 | Neo Technology, Inc. Deep feature representations are selected and learned using an evolutionary algorithm State machines represent predicates that evaluate to 0 or 1 for a text match State machines map to classes of document labels that matched text during training Deep Learning p p p p p Class Class Z Class 20
  • 21. cos(θ) © All Rights Reserved 2014 | Neo Technology, Inc. Unlabeled Document The natural language parsing model is used to classify other unlabeled documents X Class Y Class Z Class 0.99 0.67 0.01 cos(θ) cos(θ) Step 3: Classify Unlabeled Documents 21
  • 22. Hierarchical Pattern Recognition © All Rights Reserved 2014 | Neo Technology, Inc. (HPR) 22
  • 23. What is Hierarchical Pattern Recognition (HPR)? HPR is a graph-based deep learning algorithm I created that learns deep feature representations in linear time — I created the algorithm to do graph-based traversals using a hierarchy of finite state machines (FSM). Designed for scalable performance in P time: © All Rights Reserved 2014 | Neo Technology, Inc. 23
  • 24. Influences & Inspirations + = p p p p p p X Y Z © All Rights Reserved 2014 | Neo Technology, Inc. 24 Ray Kurzweil (Pattern Recognition Theory of Mind) Jeff Hawkins (Hierarchical Temporal Memory) Hierarchical Pattern Recognition
  • 25. How does feature extraction work? p © All Rights Reserved 2014 | Neo Technology, Inc. 25 Hierarchical Pattern Recognition “Deep” feature representations are learned and associated with labels that are mapped to documents that the feature was discovered in. The feature hierarchy is translated into a Vector Space Model for classification on feature vectors generated from unlabeled text. p p p p p X Y Z HPR uses a probabilistic model in combination with an evolutionary algorithm to generate hierarchies of deep feature representations.
  • 26. Graph-based feature learning © All Rights Reserved 2014 | Neo Technology, Inc. 26
  • 27. Learning new features from matches on training data © All Rights Reserved 2014 | Neo Technology, Inc. 27
  • 28. Cost Function for the Generations of Features Reproduction occurs after a threshold of matches has been exceeded for a feature. After replication the cost function is applied to increase that threshold every time the feature reproduces. is the current threshold on the feature node. is the minimum threshold, which I chose as 5 for new features. © All Rights Reserved 2014 | Neo Technology, Inc. Cost function: 28
  • 29. © All Rights 29 Reserved 2014 | Neo Technology, Inc.
  • 30. Vector Space Model © All Rights Reserved 2014 | Neo Technology, Inc. 30
  • 31. Generating Feature Vectors The natural language parsing model created during training can be turned into a global feature index. This global feature index is a list of Neo4j internal IDs for every feature in the hierarchy. Using that global feature index, a multi-dimensional vector space is created with a length equal to the number of features in the hierarchy. © All Rights Reserved 2014 | Neo Technology, Inc. 31
  • 32. Relevance Rankings “Relevance rankings of documents in a keyword search can be calculated, using the assumptions of document similarities theory, by comparing the deviation of angles between each document vector and the original query vector where the query is represented as the same kind of vector as the documents.” - Wikipedia © All Rights Reserved 2014 | Neo Technology, Inc. 32
  • 33. Vector-based Cosine Similarity Measure In practice, it is easier to calculate the cosine of the angle between the vectors, instead of the angle itself: © All Rights Reserved 2014 | Neo Technology, Inc. 33
  • 34. Cosine Similarity & Vector Space Model © All Rights Reserved 2014 | Neo Technology, Inc. 34
  • 35. Vector-based Cosine Similarity Measure “The resulting similarity ranges from -1 meaning exactly opposite, to 1 meaning exactly the same, with 0 usually indicating independence, and in-between values indicating intermediate similarity or dissimilarity.” © All Rights Reserved 2014 | Neo Technology, Inc. via Wikipedia 35
  • 36. Graphify for Neo4j © All Rights Reserved 2014 | Neo Technology, Inc. 36
  • 37. Graphify for Neo4j Graphify is a Neo4j unmanaged extension used for document and text classification using graph-based hierarchical pattern recognition. © All Rights Reserved 2014 | Neo Technology, Inc. https://github.com/kbastani/graphify 37
  • 38. Example Project Head over to the GitHub project page and clone it to your local machine. Follow the directions listed in the README.md to install the extension. Navigate to the /examples directory of the project. © All Rights Reserved 2014 | Neo Technology, Inc. Run: examples/graphify-examples-author/src/java/org/neo4j/nlp/examples/author/main.java 38
  • 39. U.S. Presidential Speech Transcript Analysis © All Rights Reserved 2014 | Neo Technology, Inc. 39
  • 40. Identify the Political Affiliation of a Presidential Speech This example ingests a set of texts from presidential speeches with labels from the author of that speech in training phase. After building the training models, unlabeled presidential speeches are classified in the test phase. © All Rights Reserved 2014 | Neo Technology, Inc. 40
  • 41. The Presidents © All Rights Reserved 2014 | Neo Technology, Inc. • Ronald Reagan • labels: liberal, republican, ronald-reagan • George H.W. Bush • labels: conservative, republican, bush41 • Bill Clinton • labels: liberal, democrat, bill-clinton • George W. Bush • labels: conservative, republican, bush43 • Barack Obama • labels: liberal, democrat, barack-obama 41
  • 42. © All Rights Reserved 2014 | Neo Technology, Inc. Training Each of the presidents in the example have 6 speeches to analyze. 4 of the speeches are used to build a natural language parsing model. 2 of the speeches are used to test the validity of that model. 42
  • 43. Get Similar Labels/Classes © All Rights Reserved 2014 | Neo Technology, Inc. 43
  • 44. Ronald Reagan republican 0.7182046285385341 liberal 0.644281223102398 democrat 0.4854114595950056 conservative 0.4133639188595147 bill-clinton 0.4057969121945167 barack-obama 0.323947855372623 bush41 0.3222644898334092 bush43 0.3161309849153592 © All Rights Reserved 2014 | Neo Technology, Inc. Class Similarity 44
  • 45. George H.W. Bush conservative 0.7032274806766954 republican 0.6047256274615608 liberal 0.4439742461594541 democrat 0.39114918238853674 bill-clinton 0.3234223107986785 ronald-reagan 0.3222644898334092 barack-obama 0.2929260544514002 bush43 0.29106733975087984 © All Rights Reserved 2014 | Neo Technology, Inc. Class Similarity 45
  • 46. democrat 0.8375678825642422 liberal 0.7847858060182163 republican 0.5561860529059708 conservative 0.45365774896422445 barack-obama 0.4507676679770066 ronald-reagan 0.4057969121945167 bush43 0.365042482383354 bush41 0.3234223107986785 © All Rights Reserved 2014 | Neo Technology, Inc. Bill Clinton Class Similarity 46
  • 47. George W. Bush conservative 0.820636570272315 republican 0.7056890956512284 liberal 0.5075788396061254 democrat 0.4505424322086937 bill-clinton 0.365042482383354 barack-obama 0.33801949243378965 ronald-reagan 0.3161309849153592 bush41 0.29106733975087984 © All Rights Reserved 2014 | Neo Technology, Inc. Class Similarity 47
  • 48. Barack Obama democrat 0.7668017370739147 liberal 0.7184792203867296 republican 0.4847680475425114 bill-clinton 0.4507676679770066 conservative 0.4149264161292232 bush43 0.33801949243378965 ronald-reagan 0.323947855372623 bush41 0.2929260544514002 © All Rights Reserved 2014 | Neo Technology, Inc. Class Similarity 48
  • 49. Get involved in the Neo4j community © All Rights Reserved 2014 | Neo Technology, Inc. 49
  • 50. http://stackoverflow.com/questions/tagged/neo4j © All Rights Reserved 2014 | Neo Technology, Inc. 50
  • 51. http://groups.google.com/group/neo4j © All Rights Reserved 2014 | Neo Technology, Inc. 51
  • 52. https://github.com/neo4j/neo4j/issues © All Rights Reserved 2014 | Neo Technology, Inc. 52
  • 53. http://neo4j.meetup.com/ © All Rights Reserved 2014 | Neo Technology, Inc. 53
  • 54. © All Rights Reserved 2014 | Neo Technology, Inc. (Thank You) 54
  • 55. Twitter www.twitter.com/kennybastani LinkedIn www.linkedin.com/in/kennybastani GitHub www.github.com/kbastani © All Rights Reserved 2014 | Neo Technology, Inc. Get in touch 55

Notas del editor

  1. When we think about data, we tend to think about how things are connected. This is a natural part of how we talk about things, and also of the graph model. “This is also a graph, but with some data attached. Here: we’ve attached names to the nodes and described the type of the relationships.”
  2. “We can take this further, and attach arbitrary key/value pairs” This is the Property Graph Model, which has the following characteristics: It contains Nodes and Relationships, both of which can contain properties (key-value pairs). Relationships are always between exactly 2 nodes. They have a type, and they are directed. “There are other graph models, however everyone in the industry has converged on the idea that this model is the most obvious and the most useful for real humans and the application we’re building”
  3. Let’s review the relational table model, to see the difference from the graph property model
  4. Start with Customers and Accounts “We have a customer, Alice.” “She’s got 3 accounts” “To keep track of which accounts Alice owns, we need a 3rd table, to store the mapping. Typically called a join table.”
  5. Dashboard, for monitoring of key stats Node, Relationship and Property “counts” are just estimates (actually represent the allocated ID space for each graph entity)
  6. “The Console is where you can run graph queries, written in Cypher.” We’ll be using this starting... now.
  7. Disclaimer: This is a graph-based approach to text classification and pattern recognition. This can be done in many different ways, including SVM, bayesian networks, belief networks, and many other approaches. I chose to create this on top of Neo4j because first its a database and second its already formatted as a network. This gives me the advantage of not worrying about data storage.
  8. Explain how the genetic algorithm works.
  9. I chose this example project because it’s easy to get presidential speeches online and it seemed like a good example to get others going with Graphify.
  10. “Get involved with the community, attend meetups, browse our open source code libraries, including Neo4j, by visiting us on GitHub.”
  11. “Visit stackoverflow.com with the tag Neo4j to get fast answers to your questions. We have a very active community of contributors that provide thorough answers 24/7. If you get stuck, make sure you head there.”
  12. “The same goes for Google groups, if you prefer that format over Stackoverflow.”
  13. “You can visit us on GitHub to submit or browse issues.”
  14. “Finally, I urge you to check out our website’s meetup page to find out where meetups are happening all around the world. Also we encourage you to share your experience with Neo4j, your applications, and your use cases by speaking at a local meetup. If you’re interested, please reach out to me, my contact details are in the next slide.”
  15. “Thank you for spending some time with me and learning about Neo4j and Cypher.”
  16. “Get in touch with me about meetups and Neo4j community events happening around the world.” “I’ll now open up the floor to questions.”