Improving Graph Based Entity Resolution with Data Mining and NLP

Improving Graph Based
Entity Resolution
Using Data Mining and
NLP

Hello, I’m David
Bechberger
Architect and Developer
● Distributed systems
● High performance low
latency big data platforms
● Graph Databases
● Teach and Mentor fellow
developers
www.bechbergerconsulting.com
www.bechberger.com
@bechbd
www.linkedin.com/in/davebechberger

What is Entity Resolution
The process of linking digital entities in data to real world entities.

I am known by many names but you may call
me:
● Data referencing
● Record Linkage
● Canonicalization
● Coreference resolution
● Merge/purge
● Entity Clustering
● ….

Why is it Hard?
● Structured versus Unstructured
● Name Ambiguity
● Typos/Transposition/Data Errors
● Missing/Incomplete Data
● Changing Data
● Abbreviations

Two types of ER problems
Ones with canonical data Ones without canonical data

Typical Entity Resolution Steps
● Deduplication
● Canonicalization/Standardization
● Blocking/Clustering
● Linking Records

Wait, I thought we were talking about graphs?

Example Graph Entity Resolution Problems
● Master Data Management
● Linking Customers
● Recommendation Engines
● Intrusion Detection
● Fraud analysis

What are we talking about today?

How can Data Mining/NLP help?
● String Similarity
● Named Entity Recognition
● Shingling
● Active/Machine Learning

How can graphs help?
● Aggregating Traversals
● Pattern Matching
● Inferring Relationships
● Path
● Clustering

Problem - Matching Product Data
● Product catalog data from Amazon and Google*
● Already deduplicated
● ~1300 Amazon Products, ~3200 Google Products
● Contains a list of perfect matches for testing against
*Datasets from Database Leipzig Group and is available at: https://dbs.uni-
leipzig.de/de/research/projects/object_matching/fever/benchmark_datasets_for_entity_resolution

Goal
Match Amazon data with Google data to build out the basis for a
master data management solution

What are we starting with?
Title Manufacturer Description
clickart 950 000 -
premier image pack
(dvd-rom)
broderbund
ca international -
arcserve lap/desktop
oem 30pk
computer associates oem arcserve backup
v11.1 win 30u for
laptops and desktops
learning quickbooks
2007
intuit learning quickbooks
2007
eu063av aba
microsoft windows xp
professional
hp eu063av aba :
usually ships in 24
hours...
ID
Title
Description
Origin
NameManufacturer
built_by
Product

How are we going to get there?
1. Bipartite and Pattern Matching
2. Iteratively add attributes to data
3. Try and match on weighted attributes

Bipartite/Pattern Matching
using Gremlin

Bipartite Graph Matching
● Matched on exact titles
● Found 216 matches
Quick
Book
Turbo
Tax

Bipartite Graph Matching
g.V().hasLabel("product").group().
by(values('title').fold())
.unfold()
. filter(
select(values).count(local).is(gt(1))
)

Graph Pattern Matching
Quick
Book
Turbo
Tax
Intuit
Corp
Intuit
built_by
built_by
● Matched on manufacturer +
fuzzy match on title

Graph Pattern Matching
g.V().hasLabel(‘product’).or(
.group().by(values(‘manufacturer’).fold()).unfold()
.filter(
),
match(
__.as('a').has('origin', 'amazon').as('amazon'),
__.as('amazon'),has(‘title’, V().has('origin','google')
.values(‘title’)).as('google'),
__.as('amazon')
.has('title',tokenFuzzy(V().has('origin',’google')
.values(‘title’))
.values('title'), 2))
)
)

Find Manufacturers in Amazon data
● Fuzzy match to find unique
● Create and link nodes to
unique manufacturers
● Found 227 manufacturers Intuit
Intuit
Corp
Quick
Book
Intuit
Corp
built_by
built_by
CanonicalOriginal

Find Manufacturers in Google data
● ~7% had manufacturers (232/3229)
● 224 products matched existing manufacturers
● Found 8 more unique manufacturers

Validate Canonical Manufacturers
● Review and validate canonical
data
● Add edges between data that
represent the same entity
Sony
Sony
Corp
Intuit
Corp
Intuit
is_same_asis_same_as

Build out the Canonical Manufacturer graph
● Found 235 unique manufacturers
● 14 aliases
● Canonical Manufacturers added
to graph with aliases
Intuit
Corp
Intuit
is_same_as
Micro
soft
Sony

What’s our graph look like now?
Intuit
Intuit
Corp
Intuit
Corp
Quick
Book
Intuit
Turbo
Tax
Micro
soft
Sony
is_same_as
built_by
built_bybuilt_by
built_by

Manufacturer Pattern Matching
● Added Manufacturer
Traversal into Pattern
Match
Intuit
Intuit
Corp
Intuit
Corp
Quick
Book
CanonicalOriginal
Intuit
Turbo
Tax

Graph Pattern matching
g.V().hasLabel(‘product’).or(
.group().by(values(‘manufacturer’).fold()).unfold()
.filter(
),
match(
__.as('a').has('origin', 'amazon').as('amazon'),
__.as('amazon'),has(‘title’, V().has('origin','google')
.values(‘title’)).as('google'),
__.as('amazon')
.has('title',tokenFuzzy(V().has('origin',’google')
.values(‘title’))
.values('title'), 2))
),
V().repeat(out().hasLabel(
within(‘built_by’, ‘is_same_as’))).limit(3))

Use NLP/Data Mining to add
attributes

A quick word on Similarity Measurements
● Many different algorithms, each solves a different problem
● Know your data
● Research the options and
● Choose the right one for your data

Most Google Data Missing Manufacturer
Or is it?
Example:
eu063av aba microsoft windows xp
professional - license and media
- 1 user - cto - english

Named Entity Recognition
Process of classifying entities in strings into known categories
microsoft xbox 360: forza motorsport 2
sony playstation 2: karaoke revolution: american idol bundle
ibm(r) viavoice(r) advanced edition 10

Damereau-Levenstein Distance
● Measures the edit distance
between two strings
● Handles insertions,
deletions, transposition and
substitutions
Sony
Snoy Snyo
1 2
2

Add distance attribute
Intuit
Intuit
Corp
Intuit
Quick
Book
built_by
Canonical
distance:2
distance:3

Find similarity between titles
Amazon Title Google Title
ms visual studio 2011 plus video studio 11 plus
Spiderman 3 ps2 activision 81935 spiderman 3
ps2
kids power fun for girls Topic entertainment kids
power fun for girls

Jaccard Index
● Set similarity measures
between finite sets (A, B)
● Works on n-Grams
● Calculated as Intersection
over Union
“J(A,B) = |A∩B|/|A⋃B|”
N=1 (Unigram)
This is a sentence
this, is, a,
sentence
N=2 (Bigram)
This is a sentence
this is, is a,
a sentence
N=3 (Trigram)
This is a sentence
this is a,
is a sentence

A = Dragon Natural Speaking 9.0
B = Dragon Natural 9.0 Professional
A ⋃ B = 5
A ∩ B = 3
Jaccard Index = ⅗ = 0.60
Jaccard Index
A B
Dragon
Natural
Speaking
9.0
Professional

Add jaccard attribute
Quick
Book
Turbo
Tax
Intuit Corp
Intuit
built_by
built_by
jaccard:0.6

Find similarity between descriptions
● Use TF-IDF finds the relative importance of words in a
document
● Cosine similarity compares two vectors and gives the similarity
between them

TF = # of times a word appears
# words in a document
IDF = # of documents
# of documents
with term
TF-IDF
Word TF-IDF Score
unique 4.43
bag 4.34
original 2.945
professional 1.336
log( )

Add cosine_similarity attribute
Quick
Book
Turbo
Tax
Intuit Corp
Intuit
built_by
built_by
cosine_similarity:0.75

What does our graph looks like now?
Intuit
Corp
Intuit
is_same_as
Quick
Book
Turbo
Tax
Intuit Corp
Intuit
built_by
built_by
distance:2
distance:2
distance:3
distance:3
jaccard:0.6cosine_similarity:0.75

Aggregating Traversal
● Aggregate all the values into a weighted sum*
● Highest sum was most likely
Value = cosine_similarity + jaccard + (manufacturer simplest
traversal path where distance is <=2 and path length is <=3)
*For this talk I used evenly weighted values, in practice this needs calculated

What does our traversal look like?
Intuit
Corp
Intuit
Quick
Book
Turbo
Tax
Value = cosine_similarity + jaccard + (traversal paths <3)

~87%
● Found 1130 of 1300
● ~1.2% error rate

Clustering/Blocking
● N-squared comparisons are
expensive
● Blocking and Clustering
limit comparisons to only
those likely to match

Improve NLP/Data Mining Techniques
● Tune algorithms
● Find accurate weighing with
Active Learning
● Locality Sensitive Hashing

Toolkits I used?
Apache Commons - https://commons.apache.org/
Java String Similarity - https://github.com/tdebatty/java-string-similarity
Apache OpenNLP - https://opennlp.apache.org/
Apache Tinkerpop - http://tinkerpop.apache.org/

Thanks, any questions?
www.bechbergerconsulting.com
www.bechberger.com
@bechbd
www.linkedin.com/in/davebechberger

Improving Graph Based Entity Resolution with Data Mining and NLP

Recomendados

Recomendados

Más contenido relacionado

Similar a Improving Graph Based Entity Resolution with Data Mining and NLP

Similar a Improving Graph Based Entity Resolution with Data Mining and NLP (20)

Último

Último (20)

Improving Graph Based Entity Resolution with Data Mining and NLP

Notas del editor