SlideShare una empresa de Scribd logo
1 de 57
Improving Graph Based
Entity Resolution
Using Data Mining and
NLP
Hello, I’m David
Bechberger
Architect and Developer
● Distributed systems
● High performance low
latency big data platforms
● Graph Databases
● Teach and Mentor fellow
developers
www.bechbergerconsulting.com
www.bechberger.com
@bechbd
www.linkedin.com/in/davebechberger
Entity Resolution
What is Entity Resolution
The process of linking digital entities in data to real world entities.
I am known by many names but you may call
me:
● Data referencing
● Record Linkage
● Canonicalization
● Coreference resolution
● Merge/purge
● Entity Clustering
● ….
Why is it Hard?
● Structured versus Unstructured
● Name Ambiguity
● Typos/Transposition/Data Errors
● Missing/Incomplete Data
● Changing Data
● Abbreviations
Two types of ER problems
Ones with canonical data Ones without canonical data
Typical Entity Resolution Steps
● Deduplication
● Canonicalization/Standardization
● Blocking/Clustering
● Linking Records
Wait, I thought we were talking about graphs?
Example Graph Entity Resolution Problems
● Master Data Management
● Linking Customers
● Recommendation Engines
● Intrusion Detection
● Fraud analysis
What are we talking about today?
How can Data Mining/NLP help?
● String Similarity
● Named Entity Recognition
● Shingling
● Active/Machine Learning
How can graphs help?
● Aggregating Traversals
● Pattern Matching
● Inferring Relationships
● Path
● Clustering
Example - Product Catalogs
Problem - Matching Product Data
● Product catalog data from Amazon and Google*
● Already deduplicated
● ~1300 Amazon Products, ~3200 Google Products
● Contains a list of perfect matches for testing against
*Datasets from Database Leipzig Group and is available at: https://dbs.uni-
leipzig.de/de/research/projects/object_matching/fever/benchmark_datasets_for_entity_resolution
Goal
Match Amazon data with Google data to build out the basis for a
master data management solution
What are we starting with?
Title Manufacturer Description
clickart 950 000 -
premier image pack
(dvd-rom)
broderbund
ca international -
arcserve lap/desktop
oem 30pk
computer associates oem arcserve backup
v11.1 win 30u for
laptops and desktops
learning quickbooks
2007
intuit learning quickbooks
2007
eu063av aba
microsoft windows xp
professional
hp eu063av aba :
usually ships in 24
hours...
ID
Title
Description
Origin
NameManufacturer
built_by
Product
How are we going to get there?
1. Bipartite and Pattern Matching
2. Iteratively add attributes to data
3. Try and match on weighted attributes
Bipartite/Pattern Matching
using Gremlin
Bipartite Graph Matching
● Matched on exact titles
● Found 216 matches
Quick
Book
Turbo
Tax
Bipartite Graph Matching
g.V().hasLabel("product").group().
by(values('title').fold())
.unfold()
. filter(
select(values).count(local).is(gt(1))
)
Graph Pattern Matching
Quick
Book
Turbo
Tax
Intuit
Corp
Intuit
built_by
built_by
● Matched on manufacturer +
fuzzy match on title
● Found 354 matches
Graph Pattern Matching
g.V().hasLabel(‘product’).or(
.group().by(values(‘manufacturer’).fold()).unfold()
.filter(
select(values).count(local).is(gt(1))
),
match(
__.as('a').has('origin', 'amazon').as('amazon'),
__.as('amazon'),has(‘title’, V().has('origin','google')
.values(‘title’)).as('google'),
__.as('amazon')
.has('title',tokenFuzzy(V().has('origin',’google')
.values(‘title’))
.values('title'), 2))
)
)
Find Canonical Manufacturers
Find Manufacturers in Amazon data
● Fuzzy match to find unique
● Create and link nodes to
unique manufacturers
● Found 227 manufacturers Intuit
Intuit
Corp
Quick
Book
Intuit
Corp
built_by
built_by
CanonicalOriginal
Find Manufacturers in Google data
● ~7% had manufacturers (232/3229)
● 224 products matched existing manufacturers
● Found 8 more unique manufacturers
Validate Canonical Manufacturers
● Review and validate canonical
data
● Add edges between data that
represent the same entity
Sony
Sony
Corp
Intuit
Corp
Intuit
is_same_asis_same_as
Build out the Canonical Manufacturer graph
● Found 235 unique manufacturers
● 14 aliases
● Canonical Manufacturers added
to graph with aliases
Intuit
Corp
Intuit
is_same_as
Micro
soft
Sony
What’s our graph look like now?
Intuit
Intuit
Corp
Intuit
Corp
Quick
Book
Intuit
Turbo
Tax
Micro
soft
Sony
is_same_as
built_by
built_bybuilt_by
built_by
Manufacturer Pattern Matching
● Added Manufacturer
Traversal into Pattern
Match
● Found 534 matches
Intuit
Intuit
Corp
Intuit
Corp
Quick
Book
CanonicalOriginal
Intuit
Turbo
Tax
Graph Pattern matching
g.V().hasLabel(‘product’).or(
.group().by(values(‘manufacturer’).fold()).unfold()
.filter(
select(values).count(local).is(gt(1))
),
match(
__.as('a').has('origin', 'amazon').as('amazon'),
__.as('amazon'),has(‘title’, V().has('origin','google')
.values(‘title’)).as('google'),
__.as('amazon')
.has('title',tokenFuzzy(V().has('origin',’google')
.values(‘title’))
.values('title'), 2))
),
V().repeat(out().hasLabel(
within(‘built_by’, ‘is_same_as’))).limit(3))
~41%
● Found 534 of 1300
Use NLP/Data Mining to add
attributes
A quick word on Similarity Measurements
● Many different algorithms, each solves a different problem
● Know your data
● Research the options and
● Choose the right one for your data
Most Google Data Missing Manufacturer
Or is it?
Example:
eu063av aba microsoft windows xp
professional - license and media
- 1 user - cto - english
Named Entity Recognition
Process of classifying entities in strings into known categories
microsoft xbox 360: forza motorsport 2
sony playstation 2: karaoke revolution: american idol bundle
ibm(r) viavoice(r) advanced edition 10
Damereau-Levenstein Distance
● Measures the edit distance
between two strings
● Handles insertions,
deletions, transposition and
substitutions
Sony
Snoy Snyo
1 2
2
Add distance attribute
Intuit
Intuit
Corp
Intuit
Quick
Book
built_by
Canonical
distance:2
distance:3
Find similarity between titles
Amazon Title Google Title
ms visual studio 2011 plus video studio 11 plus
Spiderman 3 ps2 activision 81935 spiderman 3
ps2
kids power fun for girls Topic entertainment kids
power fun for girls
Jaccard Index
● Set similarity measures
between finite sets (A, B)
● Works on n-Grams
● Calculated as Intersection
over Union
“J(A,B) = |A∩B|/|A⋃B|”
N=1 (Unigram)
This is a sentence
this, is, a,
sentence
N=2 (Bigram)
This is a sentence
this is, is a,
a sentence
N=3 (Trigram)
This is a sentence
this is a,
is a sentence
A = Dragon Natural Speaking 9.0
B = Dragon Natural 9.0 Professional
A ⋃ B = 5
A ∩ B = 3
Jaccard Index = ⅗ = 0.60
Jaccard Index
A B
Dragon
Natural
Speaking
9.0
Professional
Add jaccard attribute
Quick
Book
Turbo
Tax
Intuit Corp
Intuit
built_by
built_by
jaccard:0.6
Find similarity between descriptions
● Use TF-IDF finds the relative importance of words in a
document
● Cosine similarity compares two vectors and gives the similarity
between them
TF = # of times a word appears
# words in a document
IDF = # of documents
# of documents
with term
TF-IDF
Word TF-IDF Score
unique 4.43
bag 4.34
original 2.945
professional 1.336
log( )
Cosine similarity
Add cosine_similarity attribute
Quick
Book
Turbo
Tax
Intuit Corp
Intuit
built_by
built_by
cosine_similarity:0.75
Putting it all together
What does our graph looks like now?
Intuit
Corp
Intuit
is_same_as
Quick
Book
Turbo
Tax
Intuit Corp
Intuit
built_by
built_by
distance:2
distance:2
distance:3
distance:3
jaccard:0.6cosine_similarity:0.75
Aggregating Traversal
● Aggregate all the values into a weighted sum*
● Highest sum was most likely
Value = cosine_similarity + jaccard + (manufacturer simplest
traversal path where distance is <=2 and path length is <=3)
*For this talk I used evenly weighted values, in practice this needs calculated
What does our traversal look like?
Intuit
Corp
Intuit
Quick
Book
Turbo
Tax
Value = cosine_similarity + jaccard + (traversal paths <3)
So how did we do?
~87%
● Found 1130 of 1300
● ~1.2% error rate
Where do we go from here?
Clustering/Blocking
● N-squared comparisons are
expensive
● Blocking and Clustering
limit comparisons to only
those likely to match
Improve NLP/Data Mining Techniques
● Tune algorithms
● Find accurate weighing with
Active Learning
● Locality Sensitive Hashing
Toolkits I used?
Apache Commons - https://commons.apache.org/
Java String Similarity - https://github.com/tdebatty/java-string-similarity
Apache OpenNLP - https://opennlp.apache.org/
Apache Tinkerpop - http://tinkerpop.apache.org/
Thanks, any questions?
www.bechbergerconsulting.com
www.bechberger.com
@bechbd
www.linkedin.com/in/davebechberger

Más contenido relacionado

Similar a Improving Graph Based Entity Resolution with Data Mining and NLP

Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...Databricks
 
Intelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIntelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIlya Grigorik
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
 
Bootstrapping Entity Alignment with Knowledge Graph Embedding
Bootstrapping Entity Alignment with Knowledge Graph EmbeddingBootstrapping Entity Alignment with Knowledge Graph Embedding
Bootstrapping Entity Alignment with Knowledge Graph EmbeddingNanjing University
 
Bootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsBootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsDatabricks
 
Talk pg conf eu 2013
Talk pg conf eu 2013Talk pg conf eu 2013
Talk pg conf eu 2013Atri Sharma
 
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Spark Summit
 
Engineering Student MuleSoft Meetup#6 - Basic Understanding of DataWeave With...
Engineering Student MuleSoft Meetup#6 - Basic Understanding of DataWeave With...Engineering Student MuleSoft Meetup#6 - Basic Understanding of DataWeave With...
Engineering Student MuleSoft Meetup#6 - Basic Understanding of DataWeave With...Jitendra Bafna
 
1시간만에 머신러닝 개념 따라 잡기
1시간만에 머신러닝 개념 따라 잡기1시간만에 머신러닝 개념 따라 잡기
1시간만에 머신러닝 개념 따라 잡기Sungmin Kim
 
Clustering
ClusteringClustering
Clusteringbutest
 
Microsoft Power BI Online Training.pdf
Microsoft Power BI Online Training.pdfMicrosoft Power BI Online Training.pdf
Microsoft Power BI Online Training.pdfSpiritsoftsTraining
 
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDBMongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDBMongoDB
 
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...GeeksLab Odessa
 
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014Naoki Nakatani
 
Dqs mds-matching 15042015
Dqs mds-matching 15042015Dqs mds-matching 15042015
Dqs mds-matching 15042015Neil Hambly
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013Sanjeev Mishra
 
Analytics Boot Camp - Slides
Analytics Boot Camp - SlidesAnalytics Boot Camp - Slides
Analytics Boot Camp - SlidesAditya Joshi
 

Similar a Improving Graph Based Entity Resolution with Data Mining and NLP (20)

Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
 
Intelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIntelligent Ruby + Machine Learning
Intelligent Ruby + Machine Learning
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
 
Bootstrapping Entity Alignment with Knowledge Graph Embedding
Bootstrapping Entity Alignment with Knowledge Graph EmbeddingBootstrapping Entity Alignment with Knowledge Graph Embedding
Bootstrapping Entity Alignment with Knowledge Graph Embedding
 
Bootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsBootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B Tests
 
Talk pg conf eu 2013
Talk pg conf eu 2013Talk pg conf eu 2013
Talk pg conf eu 2013
 
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
 
Engineering Student MuleSoft Meetup#6 - Basic Understanding of DataWeave With...
Engineering Student MuleSoft Meetup#6 - Basic Understanding of DataWeave With...Engineering Student MuleSoft Meetup#6 - Basic Understanding of DataWeave With...
Engineering Student MuleSoft Meetup#6 - Basic Understanding of DataWeave With...
 
1시간만에 머신러닝 개념 따라 잡기
1시간만에 머신러닝 개념 따라 잡기1시간만에 머신러닝 개념 따라 잡기
1시간만에 머신러닝 개념 따라 잡기
 
Clustering
ClusteringClustering
Clustering
 
Microsoft Power BI Online Training.pdf
Microsoft Power BI Online Training.pdfMicrosoft Power BI Online Training.pdf
Microsoft Power BI Online Training.pdf
 
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDBMongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
 
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
 
chapter1.pdf
chapter1.pdfchapter1.pdf
chapter1.pdf
 
Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
 
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
 
Dqs mds-matching 15042015
Dqs mds-matching 15042015Dqs mds-matching 15042015
Dqs mds-matching 15042015
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
 
Analytics Boot Camp - Slides
Analytics Boot Camp - SlidesAnalytics Boot Camp - Slides
Analytics Boot Camp - Slides
 

Último

cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineeringssuserb3a23b
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 

Último (20)

cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.ppt
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineering
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 

Improving Graph Based Entity Resolution with Data Mining and NLP

Notas del editor

  1. Test text for sizing
  2. Not an architect that just draws boxes and lines, I get my hands dirty by actually helping to build these things
  3. What this means is resolving data from one or more datasets into a canonical representation of that entity. E.g. I have facebook, linkedin, google, twitter etc but there is only one singular entity that is me. Entity resolution is the process of taking each of those disparate data sources and linking them to the singular real world me entity. Entity Resolution is not a new problem, its one that has become more important as we get more and more representation of yourself and we want mine interesting data from them
  4. Deduplication, Record Linkage, Data referencing, Canonicalization, Coreference resolution, Merge/purge, Object identification, Entity clustering, Object consolidation, Identity uncertainty, Reference reconciliation
  5. Its not if you have structured/clean and consistent data, but in reality it isnt Dave versus David Mispelled names Missing items Wife changed name
  6. Canonical Examples - Countries of the world (195), Fortune 500 companies Non-canonical examples - probably the most common, the canonical list has to be made from the data Examples are: people, places, products
  7. Not going to talk about Dedupe or blocking clustering A little bit on canonicalization but mostly on linking records
  8. MDM - Getting master data from multiple systems Customers - linking customers from multiple different internal systems (email, chat, phone) Rec engines - Linking sales and product data across divisions Intrustion detection - linking IP spoofs to the same person Fraud - Linking fraudulent transactions on multiple cards to same person
  9. Combining the best of Graph techniques with standard data mining and NLP techniques to provide a better outcome
  10. Lots of different String similarity - The process of comparing two strings and finding out how similar/dissimilar they are Named Entity Recognition - Process of classifying entities in text into predefined categories Shingling - process of tokenizing data to gauge similarity
  11. Aggregating Traversals - Using traversals to calculate weighed sums Pattern Matching - find patterns Inferring relationships Path traversals
  12. g.V().hasLabel("product"). group(). by(values('title').fold()). unfold(). filter(select(values).count(local).is(gt(1))).count()
  13. g.V().hasLabel("product"). group(). by(values('title').fold()). unfold(). filter(select(values).count(local).is(gt(1))).count()
  14. g.V().hasLabel("product"). group(). by(values('title').fold()). unfold(). filter(select(values).count(local).is(gt(1))).count()
  15. You may wonder why we added unique manufacturers from the google data to our graph if we aren’t matching on them
  16. g.V().hasLabel("product"). group(). by(values('title').fold()). unfold(). filter(select(values).count(local).is(gt(1))).count()
  17. NER works by using labelled training set data to determine entities Used canonical manufacturers as training set data Input the titles
  18. Good for comparing shorter string segments like names
  19. TF-IDF turns each document into a vector of numbers Values are then normalized using the dot product Cosine similarity compares the normalized vectors
  20. Produces a normalized vector of relative importance of words
  21. Similar scores are close to 1 Unrelated scores are close to 0 Opposites are close to -1
  22. Summed up the distance between items with cosine similarity, jaccard index and simplest path traversal where distance<=2 and length<=3
  23. Locality Sensitive Hashing - create hash codes for data to find others most like it
  24. Apache Commons for cosine-similarity and Jaccard Index Java Similairty for Damerau-Levensthein OpenNLP - for tokenizing and NER Tinkerpop for traversals