SlideShare una empresa de Scribd logo
1 de 46
An Open-source Similar-name Finder Dallan Quass  [email_address]
What's the problem?
People can't spell unusual names Maybe a piece of mail addressed to Solverg Quast? Solverg Quast 5934 Phoenix Ave. Shoreview, MN 55126 Johnston Bros. 1256 Bristol St. Mapleton, MN 55126 Should be:  Solveig Quass
People use nicknames John Johnny Jack
Transcribers make typos Jhon
Most of our ancestors didn't know how to read or write  signature
What does it matter?
How do you find records? Johnny Snith John Smith
How do you match people? John Smith Johnny Smithe
Not a new problem
Lots of solutions Soundex Nysiis Double Metaphone Refined Soundex Daitch-Mokotoff Caverphone Levenstein Jaro Winkler Monge Elkan Needleman Wunch Smith Waterman
No Bullseye
Why is this so hard?
How similar are two names? We’re neighbors John Jonny Joe I don’t know those guys
First approach: Coders Soundex Nysiis Double Metaphone Refined Soundex Daitch-Mokotoff Caverphone ,[object Object],[object Object],[object Object],[object Object]
First approach: Coders Jim John Jane Johan Johannes
Second approach: Distance functions Levenstein Jaro Winkler Monge Elkan Needleman Wunch Smith Waterman ,[object Object],[object Object],[object Object],[object Object]
Second approach: Distance functions Jim John Jane Johan Johannes Better results, but Doesn't scale well
Can we do better?
Warning: Machine learning ahead!
Thank you Ancestry! ,[object Object],[object Object],[object Object]
A closer look at Levenstein Jon John Bohn -1 -1
Maximize your expectations ,[object Object],[object Object],[object Object],[object Object],Jon John Bohn high cost low cost Weighted  Edit Distance
Learn to classify ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Wait, i sn't this just another distance function? Distance functions don't scale, right?
Right
Back to the basics x  f(x) -5  -1 -3  4.5 0  7 2  3.5 4  2
Long tail
Long tail 200,000 Surnames  70,000 Given names ≤   1/5,000,000 names
Long tail Use distance function with table here Use coder here
Result: Table initialized by a function Dallan:  Dalana Daleen Dalen Dalin … Talan Tallon Ryan:  Aaran Aran Arrin … Rian Riana ...
A nice thing about tables... Dallan:  Dalana Daleen Dalen Dalin … Talan Tallon Ryan:  Aaran Aran Arrin … Rian Riana ...
Add to the table Nicknames BehindTheName.com The New American Dictionary of Baby Names by Leslie Dunking and William Gosling A Dictionary of Surnames by Patrick Hanks and Flavia Hodges WeRelate community
Thank you BehindTheName.com! Fascinating  Family Trees for given names
Result Soundex Our approach Precision  Recall 28% decrease in false negatives Given names Soundex Our approach Precision  Recall 28% decrease in false negatives Surnames 97 65 97 74 89 68 89 77
Who is using it?
WeRelate.org
Continuous improvement
Continuous improvement
Community oversight
How do I use it? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Roadmap ,[object Object],[object Object],[object Object],[object Object]
Future work
Future work ,[object Object],[object Object],[object Object],[object Object],[object Object],Remove “chaff” variants from common names
Conclusion Images appearing on these slides are copyrighted by the contributors to  http://commons.wikimedia.org and are used under license ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 

Más contenido relacionado

Similar a An Open-source Similar-name Finder

Natural Language Visualization with Scattertext
Natural Language Visualization with ScattertextNatural Language Visualization with Scattertext
Natural Language Visualization with ScattertextJason Kessler
 
The Evolution of Speech Segmentation: A Computer Simulation
The Evolution of Speech Segmentation: A Computer SimulationThe Evolution of Speech Segmentation: A Computer Simulation
The Evolution of Speech Segmentation: A Computer SimulationRichard Littauer
 
Visualizing Words and Topics with Scattertext
Visualizing Words and Topics with ScattertextVisualizing Words and Topics with Scattertext
Visualizing Words and Topics with ScattertextJason Kessler
 
Redevelop 2019 - Debugging our biases and intuition in software development
Redevelop 2019 - Debugging our biases and intuition in software developmentRedevelop 2019 - Debugging our biases and intuition in software development
Redevelop 2019 - Debugging our biases and intuition in software developmentDave Hulbert
 
Aac and mowat wilson syndrome 0617
Aac and mowat wilson syndrome 0617Aac and mowat wilson syndrome 0617
Aac and mowat wilson syndrome 0617Kim Singleton
 
Subword tokenizers
Subword tokenizersSubword tokenizers
Subword tokenizersHa Loc Do
 
Articulation Chapter From Previous Book
Articulation Chapter From Previous BookArticulation Chapter From Previous Book
Articulation Chapter From Previous Bookguest2dd347
 
Literacy 2.0
Literacy 2.0Literacy 2.0
Literacy 2.0nmangum
 

Similar a An Open-source Similar-name Finder (11)

Natural Language Visualization with Scattertext
Natural Language Visualization with ScattertextNatural Language Visualization with Scattertext
Natural Language Visualization with Scattertext
 
The Evolution of Speech Segmentation: A Computer Simulation
The Evolution of Speech Segmentation: A Computer SimulationThe Evolution of Speech Segmentation: A Computer Simulation
The Evolution of Speech Segmentation: A Computer Simulation
 
Visualizing Words and Topics with Scattertext
Visualizing Words and Topics with ScattertextVisualizing Words and Topics with Scattertext
Visualizing Words and Topics with Scattertext
 
Redevelop 2019 - Debugging our biases and intuition in software development
Redevelop 2019 - Debugging our biases and intuition in software developmentRedevelop 2019 - Debugging our biases and intuition in software development
Redevelop 2019 - Debugging our biases and intuition in software development
 
Aac and mowat wilson syndrome 0617
Aac and mowat wilson syndrome 0617Aac and mowat wilson syndrome 0617
Aac and mowat wilson syndrome 0617
 
Subword tokenizers
Subword tokenizersSubword tokenizers
Subword tokenizers
 
The system sound and listening
The system sound and listeningThe system sound and listening
The system sound and listening
 
Articulation Chapter From Previous Book
Articulation Chapter From Previous BookArticulation Chapter From Previous Book
Articulation Chapter From Previous Book
 
Class14
Class14Class14
Class14
 
Literacy 2.0
Literacy 2.0Literacy 2.0
Literacy 2.0
 
Language
LanguageLanguage
Language
 

Más de Dallan Quass

FamilySearch Javascript SDK
FamilySearch Javascript SDKFamilySearch Javascript SDK
FamilySearch Javascript SDKDallan Quass
 
FamilySearch Reference Client
FamilySearch Reference ClientFamilySearch Reference Client
FamilySearch Reference ClientDallan Quass
 
Using WeRelate.org (2009)
Using WeRelate.org (2009)Using WeRelate.org (2009)
Using WeRelate.org (2009)Dallan Quass
 
WeRelate.org flyer (2010)
WeRelate.org flyer (2010)WeRelate.org flyer (2010)
WeRelate.org flyer (2010)Dallan Quass
 
Why share your genealogy content on WeRelate.org (2009)
Why share your genealogy content on WeRelate.org (2009)Why share your genealogy content on WeRelate.org (2009)
Why share your genealogy content on WeRelate.org (2009)Dallan Quass
 
An Open-source Place-finder for Genealogy
An Open-source Place-finder for GenealogyAn Open-source Place-finder for Genealogy
An Open-source Place-finder for GenealogyDallan Quass
 
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM ParserA Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM ParserDallan Quass
 

Más de Dallan Quass (7)

FamilySearch Javascript SDK
FamilySearch Javascript SDKFamilySearch Javascript SDK
FamilySearch Javascript SDK
 
FamilySearch Reference Client
FamilySearch Reference ClientFamilySearch Reference Client
FamilySearch Reference Client
 
Using WeRelate.org (2009)
Using WeRelate.org (2009)Using WeRelate.org (2009)
Using WeRelate.org (2009)
 
WeRelate.org flyer (2010)
WeRelate.org flyer (2010)WeRelate.org flyer (2010)
WeRelate.org flyer (2010)
 
Why share your genealogy content on WeRelate.org (2009)
Why share your genealogy content on WeRelate.org (2009)Why share your genealogy content on WeRelate.org (2009)
Why share your genealogy content on WeRelate.org (2009)
 
An Open-source Place-finder for Genealogy
An Open-source Place-finder for GenealogyAn Open-source Place-finder for Genealogy
An Open-source Place-finder for Genealogy
 
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM ParserA Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
 

Último

Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Último (20)

Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

An Open-source Similar-name Finder