“Hey, here are those new data files to add. I ‘cleaned’ them myself so it should be easy. Right?”
Words like these strike fear into the heart of all developers but integrating ‘dirty’ unstructured, denormalized and text heavy datasets from multiple locations is becoming the de facto standard when building out data platforms.
In this talk we will look at how we can augment our graph’s attributes using techniques from data mining (e.g. string similarity/distance measures) and Natural Language Processing (e.g. keyword extraction, named entity recognition). We will then walkthrough an example using this methodology to demonstrate the improvements in the accuracy of the resulting matches.
2. Hello, I’m David
Bechberger
Architect and Developer
● Distributed systems
● High performance low
latency big data platforms
● Graph Databases
● Teach and Mentor fellow
developers
www.bechbergerconsulting.com
www.bechberger.com
@bechbd
www.linkedin.com/in/davebechberger
4. What is Entity Resolution
The process of linking digital entities in data to real world entities.
5. I am known by many names but you may call
me:
● Data referencing
● Record Linkage
● Canonicalization
● Coreference resolution
● Merge/purge
● Entity Clustering
● ….
6. Why is it Hard?
● Structured versus Unstructured
● Name Ambiguity
● Typos/Transposition/Data Errors
● Missing/Incomplete Data
● Changing Data
● Abbreviations
7. Two types of ER problems
Ones with canonical data Ones without canonical data
15. Problem - Matching Product Data
● Product catalog data from Amazon and Google*
● Already deduplicated
● ~1300 Amazon Products, ~3200 Google Products
● Contains a list of perfect matches for testing against
*Datasets from Database Leipzig Group and is available at: https://dbs.uni-
leipzig.de/de/research/projects/object_matching/fever/benchmark_datasets_for_entity_resolution
16. Goal
Match Amazon data with Google data to build out the basis for a
master data management solution
17. What are we starting with?
Title Manufacturer Description
clickart 950 000 -
premier image pack
(dvd-rom)
broderbund
ca international -
arcserve lap/desktop
oem 30pk
computer associates oem arcserve backup
v11.1 win 30u for
laptops and desktops
learning quickbooks
2007
intuit learning quickbooks
2007
eu063av aba
microsoft windows xp
professional
hp eu063av aba :
usually ships in 24
hours...
ID
Title
Description
Origin
NameManufacturer
built_by
Product
18. How are we going to get there?
1. Bipartite and Pattern Matching
2. Iteratively add attributes to data
3. Try and match on weighted attributes
25. Find Manufacturers in Amazon data
● Fuzzy match to find unique
● Create and link nodes to
unique manufacturers
● Found 227 manufacturers Intuit
Intuit
Corp
Quick
Book
Intuit
Corp
built_by
built_by
CanonicalOriginal
26. Find Manufacturers in Google data
● ~7% had manufacturers (232/3229)
● 224 products matched existing manufacturers
● Found 8 more unique manufacturers
27. Validate Canonical Manufacturers
● Review and validate canonical
data
● Add edges between data that
represent the same entity
Sony
Sony
Corp
Intuit
Corp
Intuit
is_same_asis_same_as
28. Build out the Canonical Manufacturer graph
● Found 235 unique manufacturers
● 14 aliases
● Canonical Manufacturers added
to graph with aliases
Intuit
Corp
Intuit
is_same_as
Micro
soft
Sony
29. What’s our graph look like now?
Intuit
Intuit
Corp
Intuit
Corp
Quick
Book
Intuit
Turbo
Tax
Micro
soft
Sony
is_same_as
built_by
built_bybuilt_by
built_by
30. Manufacturer Pattern Matching
● Added Manufacturer
Traversal into Pattern
Match
● Found 534 matches
Intuit
Intuit
Corp
Intuit
Corp
Quick
Book
CanonicalOriginal
Intuit
Turbo
Tax
34. A quick word on Similarity Measurements
● Many different algorithms, each solves a different problem
● Know your data
● Research the options and
● Choose the right one for your data
35. Most Google Data Missing Manufacturer
Or is it?
Example:
eu063av aba microsoft windows xp
professional - license and media
- 1 user - cto - english
36. Named Entity Recognition
Process of classifying entities in strings into known categories
microsoft xbox 360: forza motorsport 2
sony playstation 2: karaoke revolution: american idol bundle
ibm(r) viavoice(r) advanced edition 10
37. Damereau-Levenstein Distance
● Measures the edit distance
between two strings
● Handles insertions,
deletions, transposition and
substitutions
Sony
Snoy Snyo
1 2
2
39. Find similarity between titles
Amazon Title Google Title
ms visual studio 2011 plus video studio 11 plus
Spiderman 3 ps2 activision 81935 spiderman 3
ps2
kids power fun for girls Topic entertainment kids
power fun for girls
40. Jaccard Index
● Set similarity measures
between finite sets (A, B)
● Works on n-Grams
● Calculated as Intersection
over Union
“J(A,B) = |A∩B|/|A⋃B|”
N=1 (Unigram)
This is a sentence
this, is, a,
sentence
N=2 (Bigram)
This is a sentence
this is, is a,
a sentence
N=3 (Trigram)
This is a sentence
this is a,
is a sentence
41. A = Dragon Natural Speaking 9.0
B = Dragon Natural 9.0 Professional
A ⋃ B = 5
A ∩ B = 3
Jaccard Index = ⅗ = 0.60
Jaccard Index
A B
Dragon
Natural
Speaking
9.0
Professional
43. Find similarity between descriptions
● Use TF-IDF finds the relative importance of words in a
document
● Cosine similarity compares two vectors and gives the similarity
between them
44. TF = # of times a word appears
# words in a document
IDF = # of documents
# of documents
with term
TF-IDF
Word TF-IDF Score
unique 4.43
bag 4.34
original 2.945
professional 1.336
log( )
48. What does our graph looks like now?
Intuit
Corp
Intuit
is_same_as
Quick
Book
Turbo
Tax
Intuit Corp
Intuit
built_by
built_by
distance:2
distance:2
distance:3
distance:3
jaccard:0.6cosine_similarity:0.75
49. Aggregating Traversal
● Aggregate all the values into a weighted sum*
● Highest sum was most likely
Value = cosine_similarity + jaccard + (manufacturer simplest
traversal path where distance is <=2 and path length is <=3)
*For this talk I used evenly weighted values, in practice this needs calculated
50. What does our traversal look like?
Intuit
Corp
Intuit
Quick
Book
Turbo
Tax
Value = cosine_similarity + jaccard + (traversal paths <3)
Not an architect that just draws boxes and lines, I get my hands dirty by actually helping to build these things
What this means is resolving data from one or more datasets into a canonical representation of that entity.
E.g. I have facebook, linkedin, google, twitter etc but there is only one singular entity that is me. Entity resolution is the process of taking each of those disparate data sources and linking them to the singular real world me entity.
Entity Resolution is not a new problem, its one that has become more important as we get more and more representation of yourself and we want mine interesting data from them
Deduplication, Record Linkage, Data referencing, Canonicalization, Coreference resolution, Merge/purge, Object identification, Entity clustering, Object consolidation, Identity uncertainty, Reference reconciliation
Its not if you have structured/clean and consistent data, but in reality it isnt
Dave versus David
Mispelled names
Missing items
Wife changed name
Canonical Examples - Countries of the world (195), Fortune 500 companies
Non-canonical examples - probably the most common, the canonical list has to be made from the data
Examples are: people, places, products
Not going to talk about Dedupe or blocking clustering
A little bit on canonicalization but mostly on linking records
MDM - Getting master data from multiple systems
Customers - linking customers from multiple different internal systems (email, chat, phone)
Rec engines - Linking sales and product data across divisions
Intrustion detection - linking IP spoofs to the same person
Fraud - Linking fraudulent transactions on multiple cards to same person
Combining the best of Graph techniques with standard data mining and NLP techniques to provide a better outcome
Lots of different
String similarity - The process of comparing two strings and finding out how similar/dissimilar they are
Named Entity Recognition - Process of classifying entities in text into predefined categories
Shingling - process of tokenizing data to gauge similarity
Aggregating Traversals - Using traversals to calculate weighed sums
Pattern Matching - find patterns
Inferring relationships
Path traversals
NER works by using labelled training set data to determine entities
Used canonical manufacturers as training set data
Input the titles
Good for comparing shorter string segments like names
TF-IDF turns each document into a vector of numbers
Values are then normalized using the dot product
Cosine similarity compares the normalized vectors
Produces a normalized vector of relative importance of words
Similar scores are close to 1
Unrelated scores are close to 0
Opposites are close to -1
Summed up the distance between items with cosine similarity, jaccard index and simplest path traversal where distance<=2 and length<=3
Locality Sensitive Hashing - create hash codes for data to find others most like it
Apache Commons for cosine-similarity and Jaccard Index
Java Similairty for Damerau-Levensthein
OpenNLP - for tokenizing and NER
Tinkerpop for traversals